linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU
@ 2015-05-07 14:15 Robert Bragg
  2015-05-07 14:15 ` [RFC PATCH 01/11] perf: export perf_event_overflow Robert Bragg
                   ` (12 more replies)
  0 siblings, 13 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-07 14:15 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

This is an updated series adding support for an "i915_oa" perf PMU for
configuring the Intel Gen graphics Observability unit (Haswell only to
start with) and forwarding periodic counter reports as perf samples.

Compared to the series I sent out last year:

The driver is now hooked into context switching so we no longer require
a context to be pinned for the full lifetime of a perf event fd (when
profiling a specific context).

Not visible in the series, but I can say we've also gained some
experience from looking at enabling Broadwell within the same
architecture. There are some fiddly challenges ahead with enabling
Broadwell but I do feel comfortable that it can be supported in the same
kind of way via perf. I haven't updated my Broadwell branches for a
little while now but if anyone is interested I can share this code as a
point of reference if that's helpful.

As I've had interest from folks looking at developing tools based on
this interface to not require root, but we are following the precedent
of not exposing system wide metrics to unprivileged processes, I've
added a sysctl directly comparable to kernel.perf_event_paranoid
(dev.i915.oa_event_paranoid) that lets users optionally allow
unprivileged access to system wide gpu metrics.

This series is able to expose more than just the A (aggregating)
counters and demonstrates selecting a more counters that are useful when
benchmarking 3D render workloads. The expectation is to add further
configurations later geared towards media or gpgpu workloads for
example.

I've changed the uapi for configuring the i915_oa specific attributes
when calling perf_event_open(2) whereby instead of cramming lots of
bitfields into the perf_event_attr config members, I'm now
daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
config member that's extensible and validated in the same way as the
perf_event_attr struct. I've found this much nicer to work with while
being neatly extensible too.

I've made a few more (small) changes to core perf infrastructure:

I've added a PERF_EVENT_IOC_FLUSH ioctl that can be used to explicitly
ask the driver to flush buffered samples. In our case this makes sure to
forward all reports currently in the gpu mapped, circular, oabuffer as
perf samples. This issue was discussed a bit on LKML last year and
previously I was overloading our PMU's read() hook but decided that the
cleaner approach would be to add a dedicated ioctl instead.

To allow device-driver PMUs to define their own types for records
written to the perf circular buffer I've introduced a PERF_RECORD_DEVICE
type whereby drivers can then document their own header defining a driver
specific scheme for sub-types. This is then used in the i915_oa driver
for reporting hw status conditions such as OABUFFER overruns or report
lost conditions from the hw.


For examples of using the i915_oa driver I have a branch of Mesa that
enables support for the INTEL_performance_query extension based on this:

https://github.com/rib/drm   wip/rib/oa-hsw-4.0.0
https://github.com/rib/mesa  wip/rib/oa-hsw-4.0.0

For reference I sent out a corresponding series for the Mesa work for
review yesterday:

http://lists.freedesktop.org/archives/mesa-dev/2015-May/083519.html

I also have a minimal gputop tool that can both test Mesa's
INTEL_performance_query implementation to view per-context metrics or
it can view system wide gpu metrics collected directly from perf
(gputop/gputop-perf.c would be the main code of interest):

https://github.com/rib/gputop

If it's convenient for testing, my kernel patches can also be fetched
from here:

https://github.com/rib/linux  wip/rib/oa-hsw-4.0.0

One specific patch comment:

[RFC PATCH 11/11] WIP: drm/i915: constrain unit gating while using OA

I didn't want to hold up getting feedback due to this issue that I'm
currently investigating (since the effect on the driver should be
trivial) but I've included a work-in-progress patch since it does
address a known problem with collecting reliable periodic metrics.

Besides the last patch, I feel like this series is in pretty good shape
now and by testing it with Mesa and several profiling tools as well as
starting the work to enable Broadwell I feel quite happy with our
approach of leveraging perf infrastructure.

I'd really appreciate any feedback on the core perf changes I've made,
as well as general feedback on the PMU driver itself.

Since it's been quite a long time since I last sent out patches for this
work; in case it's helpful to refer back to some of the discussion last
year:

https://lkml.org/lkml/2014/10/22/462

For anyone interested to know more details about this hardware, this PRM
is a good starting point:

https://01.org/sites/default/files/documentation/
observability_performance_counters_haswell.pdf

Kind regards,
- Robert

Robert Bragg (11):
  perf: export perf_event_overflow
  perf: Add PERF_PMU_CAP_IS_DEVICE flag
  perf: Add PERF_EVENT_IOC_FLUSH ioctl
  perf: Add a PERF_RECORD_DEVICE event type
  perf: allow drivers more control over event logging
  drm/i915: rename OACONTROL GEN7_OACONTROL
  drm/i915: Expose PMU for Observation Architecture
  drm/i915: add OA config for 3D render counters
  drm/i915: Add dev.i915.oa_event_paranoid sysctl option
  drm/i915: report OA buf overrun + report lost status
  WIP: drm/i915: constrain unit gating while using OA

 drivers/gpu/drm/i915/Makefile           |   1 +
 drivers/gpu/drm/i915/i915_cmd_parser.c  |   4 +-
 drivers/gpu/drm/i915/i915_dma.c         |   6 +
 drivers/gpu/drm/i915/i915_drv.h         |  62 +++
 drivers/gpu/drm/i915/i915_gem_context.c |  45 +-
 drivers/gpu/drm/i915/i915_oa_perf.c     | 951 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h         | 311 ++++++++++-
 include/linux/perf_event.h              |  15 +
 include/uapi/drm/i915_drm.h             |  58 ++
 include/uapi/linux/perf_event.h         |  14 +
 kernel/events/core.c                    |  47 +-
 kernel/events/internal.h                |   9 -
 kernel/events/ring_buffer.c             |   3 +
 13 files changed, 1498 insertions(+), 28 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_oa_perf.c

-- 
2.3.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH 01/11] perf: export perf_event_overflow
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
@ 2015-05-07 14:15 ` Robert Bragg
  2015-05-07 14:15 ` [RFC PATCH 02/11] perf: Add PERF_PMU_CAP_IS_DEVICE flag Robert Bragg
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-07 14:15 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

To support pmu drivers in loadable modules, such as the i915 driver

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 kernel/events/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2fabc06..38c240c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5851,6 +5851,7 @@ int perf_event_overflow(struct perf_event *event,
 {
 	return __perf_event_overflow(event, 1, data, regs);
 }
+EXPORT_SYMBOL_GPL(perf_event_overflow);
 
 /*
  * Generic software event infrastructure
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 02/11] perf: Add PERF_PMU_CAP_IS_DEVICE flag
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
  2015-05-07 14:15 ` [RFC PATCH 01/11] perf: export perf_event_overflow Robert Bragg
@ 2015-05-07 14:15 ` Robert Bragg
  2015-05-07 14:15 ` [RFC PATCH 03/11] perf: Add PERF_EVENT_IOC_FLUSH ioctl Robert Bragg
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-07 14:15 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

The PERF_PMU_CAP_IS_DEVICE flag provides pmu drivers a way to declare
that they only monitor device specific metrics and since they don't
monitor any cpu metrics then perf should bypass any cpu centric security
checks, as well as disallow cpu centric attributes.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 39 +++++++++++++++++++++++++++++++++------
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2b62198..1af35b4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -166,6 +166,7 @@ struct perf_event;
  * pmu::capabilities flags
  */
 #define PERF_PMU_CAP_NO_INTERRUPT		0x01
+#define PERF_PMU_CAP_IS_DEVICE			0x02
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 38c240c..7218b01 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3330,7 +3330,8 @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
 
 	if (!task) {
 		/* Must be root to operate on a CPU event: */
-		if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+		if (!(pmu->capabilities & PERF_PMU_CAP_IS_DEVICE) &&
+		    perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
 			return ERR_PTR(-EACCES);
 
 		/*
@@ -7475,11 +7476,6 @@ SYSCALL_DEFINE5(perf_event_open,
 	if (err)
 		return err;
 
-	if (!attr.exclude_kernel) {
-		if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-			return -EACCES;
-	}
-
 	if (attr.freq) {
 		if (attr.sample_freq > sysctl_perf_event_sample_rate)
 			return -EINVAL;
@@ -7538,6 +7534,37 @@ SYSCALL_DEFINE5(perf_event_open,
 		goto err_cpus;
 	}
 
+	if (event->pmu->capabilities & PERF_PMU_CAP_IS_DEVICE) {
+
+		/* Don't allow cpu centric attributes... */
+		if (event->attr.exclude_user ||
+		    event->attr.exclude_callchain_user ||
+		    event->attr.exclude_kernel ||
+		    event->attr.exclude_callchain_kernel ||
+		    event->attr.exclude_hv ||
+		    event->attr.exclude_idle ||
+		    event->attr.exclude_host ||
+		    event->attr.exclude_guest ||
+		    event->attr.mmap ||
+		    event->attr.comm ||
+		    event->attr.task)
+			return -EINVAL;
+
+		if (attr.sample_type &
+		    (PERF_SAMPLE_IP |
+		     PERF_SAMPLE_TID |
+		     PERF_SAMPLE_ADDR |
+		     PERF_SAMPLE_CALLCHAIN |
+		     PERF_SAMPLE_CPU |
+		     PERF_SAMPLE_BRANCH_STACK |
+		     PERF_SAMPLE_REGS_USER |
+		     PERF_SAMPLE_STACK_USER))
+			return -EINVAL;
+	} else if (!attr.exclude_kernel) {
+		if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+			return -EACCES;
+	}
+
 	if (flags & PERF_FLAG_PID_CGROUP) {
 		err = perf_cgroup_connect(pid, event, &attr, group_leader);
 		if (err) {
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 03/11] perf: Add PERF_EVENT_IOC_FLUSH ioctl
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
  2015-05-07 14:15 ` [RFC PATCH 01/11] perf: export perf_event_overflow Robert Bragg
  2015-05-07 14:15 ` [RFC PATCH 02/11] perf: Add PERF_PMU_CAP_IS_DEVICE flag Robert Bragg
@ 2015-05-07 14:15 ` Robert Bragg
  2015-05-07 14:20   ` [Intel-gfx] " Chris Wilson
  2015-05-07 14:15 ` [RFC PATCH 04/11] perf: Add a PERF_RECORD_DEVICE event type Robert Bragg
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 31+ messages in thread
From: Robert Bragg @ 2015-05-07 14:15 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

To allow for pmus that may have internal buffering (e.g. the hardware
itself writes out data to its own circular buffer which is only
periodically forwarded to userspace via perf) this ioctl enables
userspace to explicitly ensure it has received all samples before a
point in time.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 include/linux/perf_event.h      | 7 +++++++
 include/uapi/linux/perf_event.h | 1 +
 kernel/events/core.c            | 5 +++++
 3 files changed, 13 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1af35b4..69a0cb9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -266,6 +266,13 @@ struct pmu {
 	 * flush branch stack on context-switches (needed in cpu-wide mode)
 	 */
 	void (*flush_branch_stack)	(void);
+
+	/*
+	 * Flush buffered samples (E.g. for pmu hardware that writes samples to
+	 * some intermediate buffer) userspace may need to explicitly ensure
+	 * such samples have been forwarded to perf.
+	 */
+	void (*flush)			(struct perf_event *event); /*optional */
 };
 
 /**
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9b79abb..a25967b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -360,6 +360,7 @@ struct perf_event_attr {
 #define PERF_EVENT_IOC_SET_OUTPUT	_IO ('$', 5)
 #define PERF_EVENT_IOC_SET_FILTER	_IOW('$', 6, char *)
 #define PERF_EVENT_IOC_ID		_IOR('$', 7, __u64 *)
+#define PERF_EVENT_IOC_FLUSH		_IO ('$', 8)
 
 enum perf_event_ioc_flags {
 	PERF_IOC_FLAG_GROUP		= 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7218b01..340deaa 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3981,6 +3981,11 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
 	case PERF_EVENT_IOC_SET_FILTER:
 		return perf_event_set_filter(event, (void __user *)arg);
 
+	case PERF_EVENT_IOC_FLUSH:
+		if (event->pmu->flush)
+			event->pmu->flush(event);
+		return 0;
+
 	default:
 		return -ENOTTY;
 	}
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 04/11] perf: Add a PERF_RECORD_DEVICE event type
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
                   ` (2 preceding siblings ...)
  2015-05-07 14:15 ` [RFC PATCH 03/11] perf: Add PERF_EVENT_IOC_FLUSH ioctl Robert Bragg
@ 2015-05-07 14:15 ` Robert Bragg
  2015-05-07 14:15 ` [RFC PATCH 05/11] perf: allow drivers more control over event logging Robert Bragg
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-07 14:15 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

To allow for more extensible, device specific, perf record types this
adds a generic PERF_RECORD_DEVICE type that can be used by device
drivers. Driver developers can then document some driver-specific header
to further detail such a record's type.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 include/uapi/linux/perf_event.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index a25967b..688e192 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -726,6 +726,19 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_MMAP2			= 10,
 
+	/*
+	 * The DEVICE record implies some device driver specific record that
+	 * will have some further mechanism for describing the contents of
+	 * the record (i.e. some driver specific event header).
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *
+	 *	struct DEVICE_EVENT_HEADER	device_header;
+	 * };
+	 */
+	PERF_RECORD_DEVICE			= 11,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 05/11] perf: allow drivers more control over event logging
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
                   ` (3 preceding siblings ...)
  2015-05-07 14:15 ` [RFC PATCH 04/11] perf: Add a PERF_RECORD_DEVICE event type Robert Bragg
@ 2015-05-07 14:15 ` Robert Bragg
  2015-05-07 14:15 ` [RFC PATCH 06/11] drm/i915: rename OACONTROL GEN7_OACONTROL Robert Bragg
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-07 14:15 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

This exports enough api to allow drivers to output their own
PERF_RECORD_DEVICE events.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 include/linux/perf_event.h  | 7 +++++++
 kernel/events/core.c        | 2 ++
 kernel/events/internal.h    | 9 ---------
 kernel/events/ring_buffer.c | 3 +++
 4 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 69a0cb9..293f041 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -633,6 +633,10 @@ struct perf_sample_data {
 		    PERF_MEM_S(LOCK, NA)  |\
 		    PERF_MEM_S(TLB, NA))
 
+extern void perf_event_header__init_id(struct perf_event_header *header,
+				       struct perf_sample_data *data,
+				       struct perf_event *event);
+
 static inline void perf_sample_data_init(struct perf_sample_data *data,
 					 u64 addr, u64 period)
 {
@@ -654,6 +658,9 @@ extern void perf_prepare_sample(struct perf_event_header *header,
 				struct perf_sample_data *data,
 				struct perf_event *event,
 				struct pt_regs *regs);
+extern void perf_event__output_id_sample(struct perf_event *event,
+					 struct perf_output_handle *handle,
+					 struct perf_sample_data *sample);
 
 extern int perf_event_overflow(struct perf_event *event,
 				 struct perf_sample_data *data,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 340deaa..26b84fc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4793,6 +4793,7 @@ void perf_event_header__init_id(struct perf_event_header *header,
 	if (event->attr.sample_id_all)
 		__perf_event_header__init_id(header, data, event);
 }
+EXPORT_SYMBOL_GPL(perf_event_header__init_id);
 
 static void __perf_event__output_id_sample(struct perf_output_handle *handle,
 					   struct perf_sample_data *data)
@@ -4825,6 +4826,7 @@ void perf_event__output_id_sample(struct perf_event *event,
 	if (event->attr.sample_id_all)
 		__perf_event__output_id_sample(handle, sample);
 }
+EXPORT_SYMBOL_GPL(perf_event__output_id_sample);
 
 static void perf_output_read_one(struct perf_output_handle *handle,
 				 struct perf_event *event,
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b2187..3c86bb3 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -44,15 +44,6 @@ extern struct ring_buffer *
 rb_alloc(int nr_pages, long watermark, int cpu, int flags);
 extern void perf_event_wakeup(struct perf_event *event);
 
-extern void
-perf_event_header__init_id(struct perf_event_header *header,
-			   struct perf_sample_data *data,
-			   struct perf_event *event);
-extern void
-perf_event__output_id_sample(struct perf_event *event,
-			     struct perf_output_handle *handle,
-			     struct perf_sample_data *sample);
-
 extern struct page *
 perf_mmap_to_page(struct ring_buffer *rb, unsigned long pgoff);
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index eadb95c..fa100d4 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -202,12 +202,14 @@ out:
 
 	return -ENOSPC;
 }
+EXPORT_SYMBOL_GPL(perf_output_begin);
 
 unsigned int perf_output_copy(struct perf_output_handle *handle,
 		      const void *buf, unsigned int len)
 {
 	return __output_copy(handle, buf, len);
 }
+EXPORT_SYMBOL_GPL(perf_output_copy);
 
 unsigned int perf_output_skip(struct perf_output_handle *handle,
 			      unsigned int len)
@@ -220,6 +222,7 @@ void perf_output_end(struct perf_output_handle *handle)
 	perf_output_put_handle(handle);
 	rcu_read_unlock();
 }
+EXPORT_SYMBOL_GPL(perf_output_end);
 
 static void
 ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 06/11] drm/i915: rename OACONTROL GEN7_OACONTROL
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
                   ` (4 preceding siblings ...)
  2015-05-07 14:15 ` [RFC PATCH 05/11] perf: allow drivers more control over event logging Robert Bragg
@ 2015-05-07 14:15 ` Robert Bragg
  2015-05-07 14:15 ` [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture Robert Bragg
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-07 14:15 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

OACONTROL changes quite a bit for gen8, with some bits split out into a
per-context OACTXCONTROL register

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_cmd_parser.c | 4 ++--
 drivers/gpu/drm/i915/i915_reg.h        | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c b/drivers/gpu/drm/i915/i915_cmd_parser.c
index 9605ff8..f7ef20c 100644
--- a/drivers/gpu/drm/i915/i915_cmd_parser.c
+++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
@@ -417,7 +417,7 @@ static const u32 gen7_render_regs[] = {
 	REG64(CL_PRIMITIVES_COUNT),
 	REG64(PS_INVOCATION_COUNT),
 	REG64(PS_DEPTH_COUNT),
-	OACONTROL, /* Only allowed for LRI and SRM. See below. */
+	GEN7_OACONTROL, /* Only allowed for LRI and SRM. See below. */
 	REG64(MI_PREDICATE_SRC0),
 	REG64(MI_PREDICATE_SRC1),
 	GEN7_3DPRIM_END_OFFSET,
@@ -961,7 +961,7 @@ static bool check_cmd(const struct intel_engine_cs *ring,
 		 * that will be written to the register. Hence, limit
 		 * OACONTROL writes to only MI_LOAD_REGISTER_IMM commands.
 		 */
-		if (reg_addr == OACONTROL) {
+		if (reg_addr == GEN7_OACONTROL) {
 			if (desc->cmd.value == MI_LOAD_REGISTER_MEM) {
 				DRM_DEBUG_DRIVER("CMD: Rejected LRM to OACONTROL\n");
 				return false;
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index d47afbc..2fa1669 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -515,7 +515,7 @@
 #define GEN7_3DPRIM_START_INSTANCE      0x243C
 #define GEN7_3DPRIM_BASE_VERTEX         0x2440
 
-#define OACONTROL 0x2360
+#define GEN7_OACONTROL 0x2360
 
 #define _GEN7_PIPEA_DE_LOAD_SL	0x70068
 #define _GEN7_PIPEB_DE_LOAD_SL	0x71068
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
                   ` (5 preceding siblings ...)
  2015-05-07 14:15 ` [RFC PATCH 06/11] drm/i915: rename OACONTROL GEN7_OACONTROL Robert Bragg
@ 2015-05-07 14:15 ` Robert Bragg
  2015-05-07 14:36   ` [Intel-gfx] " Chris Wilson
  2015-05-07 14:58   ` Chris Wilson
  2015-05-07 14:15 ` [RFC PATCH 08/11] drm/i915: add OA config for 3D render counters Robert Bragg
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-07 14:15 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

Gen graphics hardware can be set up to periodically write snapshots of
performance counters into a circular buffer and this patch exposes that
capability to userspace via the perf interface.

To start with this only enables the A (aggregating) counters with the
simplest configuration requirements.

Only Haswell is supported currently.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/Makefile           |   1 +
 drivers/gpu/drm/i915/i915_dma.c         |   6 +
 drivers/gpu/drm/i915/i915_drv.h         |  53 +++
 drivers/gpu/drm/i915/i915_gem_context.c |  45 +-
 drivers/gpu/drm/i915/i915_oa_perf.c     | 750 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h         |  68 +++
 include/uapi/drm/i915_drm.h             |  29 ++
 7 files changed, 942 insertions(+), 10 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_oa_perf.c

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index a69002e..8af9f4a 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -16,6 +16,7 @@ i915-y := i915_drv.o \
 
 i915-$(CONFIG_COMPAT)   += i915_ioc32.o
 i915-$(CONFIG_DEBUG_FS) += i915_debugfs.o
+i915-$(CONFIG_PERF_EVENTS) += i915_oa_perf.o
 
 # GEM code
 i915-y += i915_cmd_parser.o \
diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index e44116f..1de6639 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -817,6 +817,11 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
 	mutex_init(&dev_priv->dpio_lock);
 	mutex_init(&dev_priv->modeset_restore_lock);
 
+	/* Must at least be registered before trying to pin any context
+	 * otherwise i915_oa_context_pin_notify() will lock an un-initialized
+	 * spinlock, upsetting lockdep checks */
+	i915_oa_pmu_register(dev);
+
 	intel_pm_setup(dev);
 
 	intel_display_crc_init(dev);
@@ -1062,6 +1067,7 @@ int i915_driver_unload(struct drm_device *dev)
 		return ret;
 	}
 
+	i915_oa_pmu_unregister(dev);
 	intel_power_domains_fini(dev_priv);
 
 	intel_gpu_ips_teardown();
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f5020a8..3d8e4ed 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -49,6 +49,7 @@
 #include <linux/hashtable.h>
 #include <linux/intel-iommu.h>
 #include <linux/kref.h>
+#include <linux/perf_event.h>
 #include <linux/pm_qos.h>
 
 /* General customization:
@@ -1816,6 +1817,35 @@ struct drm_i915_private {
 	 */
 	struct workqueue_struct *dp_wq;
 
+#ifdef CONFIG_PERF_EVENTS
+	struct {
+		struct pmu pmu;
+		spinlock_t lock;
+		struct hrtimer timer;
+		struct pt_regs dummy_regs;
+
+		struct perf_event *exclusive_event;
+		struct intel_context *specific_ctx;
+		bool event_active;
+
+		bool periodic;
+		u32 period_exponent;
+
+		u32 metrics_set;
+
+		struct {
+			struct drm_i915_gem_object *obj;
+			u32 gtt_offset;
+			u8 *addr;
+			u32 head;
+			u32 tail;
+			int format;
+			int format_size;
+			spinlock_t flush_lock;
+		} oa_buffer;
+	} oa_pmu;
+#endif
+
 	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
 	struct {
 		int (*execbuf_submit)(struct drm_device *dev, struct drm_file *file,
@@ -2975,6 +3005,20 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file_priv);
 
+#ifdef CONFIG_PERF_EVENTS
+void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
+				struct intel_context *context);
+void i915_oa_context_unpin_notify(struct drm_i915_private *dev_priv,
+				  struct intel_context *context);
+#else
+static inline void
+i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
+			   struct intel_context *context) {}
+static inline void
+i915_oa_context_unpin_notify(struct drm_i915_private *dev_priv,
+			     struct intel_context *context) {}
+#endif
+
 /* i915_gem_evict.c */
 int __must_check i915_gem_evict_something(struct drm_device *dev,
 					  struct i915_address_space *vm,
@@ -3084,6 +3128,15 @@ int i915_parse_cmds(struct intel_engine_cs *ring,
 		    u32 batch_len,
 		    bool is_master);
 
+/* i915_oa_perf.c */
+#ifdef CONFIG_PERF_EVENTS
+extern void i915_oa_pmu_register(struct drm_device *dev);
+extern void i915_oa_pmu_unregister(struct drm_device *dev);
+#else
+static inline void i915_oa_pmu_register(struct drm_device *dev) {}
+static inline void i915_oa_pmu_unregister(struct drm_device *dev) {}
+#endif
+
 /* i915_suspend.c */
 extern int i915_save_state(struct drm_device *dev);
 extern int i915_restore_state(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index dd5bff6..3624435 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -133,6 +133,33 @@ static int get_context_size(struct drm_device *dev)
 	return ret;
 }
 
+static int i915_gem_context_pin_state(struct drm_device *dev,
+				      struct intel_context *ctx)
+{
+	int ret;
+
+	BUG_ON(!mutex_is_locked(&dev->struct_mutex));
+
+	ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state,
+				    get_context_alignment(dev), 0);
+	if (ret)
+		return ret;
+
+	i915_oa_context_pin_notify(dev->dev_private, ctx);
+
+	return 0;
+}
+
+static void i915_gem_context_unpin_state(struct drm_device *dev,
+					 struct intel_context *ctx)
+{
+	/* Ensure that we stop the OA unit referencing the context *before*
+	 * actually unpinning the ctx */
+	i915_oa_context_unpin_notify(dev->dev_private, ctx);
+
+	i915_gem_object_ggtt_unpin(ctx->legacy_hw_ctx.rcs_state);
+}
+
 void i915_gem_context_free(struct kref *ctx_ref)
 {
 	struct intel_context *ctx = container_of(ctx_ref,
@@ -260,8 +287,7 @@ i915_gem_create_context(struct drm_device *dev,
 		 * be available. To avoid this we always pin the default
 		 * context.
 		 */
-		ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state,
-					    get_context_alignment(dev), 0);
+		ret = i915_gem_context_pin_state(dev, ctx);
 		if (ret) {
 			DRM_DEBUG_DRIVER("Couldn't pin %d\n", ret);
 			goto err_destroy;
@@ -287,7 +313,7 @@ i915_gem_create_context(struct drm_device *dev,
 
 err_unpin:
 	if (is_global_default_ctx && ctx->legacy_hw_ctx.rcs_state)
-		i915_gem_object_ggtt_unpin(ctx->legacy_hw_ctx.rcs_state);
+		i915_gem_context_unpin_state(dev, ctx);
 err_destroy:
 	i915_gem_context_unreference(ctx);
 	return ERR_PTR(ret);
@@ -314,7 +340,7 @@ void i915_gem_context_reset(struct drm_device *dev)
 
 		if (lctx) {
 			if (lctx->legacy_hw_ctx.rcs_state && i == RCS)
-				i915_gem_object_ggtt_unpin(lctx->legacy_hw_ctx.rcs_state);
+				i915_gem_context_unpin_state(dev, lctx);
 
 			i915_gem_context_unreference(lctx);
 			ring->last_context = NULL;
@@ -388,12 +414,12 @@ void i915_gem_context_fini(struct drm_device *dev)
 		if (dev_priv->ring[RCS].last_context == dctx) {
 			/* Fake switch to NULL context */
 			WARN_ON(dctx->legacy_hw_ctx.rcs_state->active);
-			i915_gem_object_ggtt_unpin(dctx->legacy_hw_ctx.rcs_state);
+			i915_gem_context_unpin_state(dev, dctx);
 			i915_gem_context_unreference(dctx);
 			dev_priv->ring[RCS].last_context = NULL;
 		}
 
-		i915_gem_object_ggtt_unpin(dctx->legacy_hw_ctx.rcs_state);
+		i915_gem_context_unpin_state(dev, dctx);
 	}
 
 	for (i = 0; i < I915_NUM_RINGS; i++) {
@@ -650,8 +676,7 @@ static int do_switch(struct intel_engine_cs *ring,
 
 	/* Trying to pin first makes error handling easier. */
 	if (ring == &dev_priv->ring[RCS]) {
-		ret = i915_gem_obj_ggtt_pin(to->legacy_hw_ctx.rcs_state,
-					    get_context_alignment(ring->dev), 0);
+		ret = i915_gem_context_pin_state(ring->dev, to);
 		if (ret)
 			return ret;
 	}
@@ -763,7 +788,7 @@ static int do_switch(struct intel_engine_cs *ring,
 			from->legacy_hw_ctx.rcs_state->last_read_req) != ring);
 
 		/* obj is kept alive until the next request by its active ref */
-		i915_gem_object_ggtt_unpin(from->legacy_hw_ctx.rcs_state);
+		i915_gem_context_unpin_state(ring->dev, from);
 		i915_gem_context_unreference(from);
 	}
 
@@ -786,7 +811,7 @@ done:
 
 unpin_out:
 	if (ring->id == RCS)
-		i915_gem_object_ggtt_unpin(to->legacy_hw_ctx.rcs_state);
+		i915_gem_context_unpin_state(ring->dev, to);
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
new file mode 100644
index 0000000..a1cf55f
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -0,0 +1,750 @@
+#include <linux/perf_event.h>
+#include <linux/sizes.h>
+
+#include "i915_drv.h"
+#include "intel_ringbuffer.h"
+
+/* Must be a power of two */
+#define OA_BUFFER_SIZE	     SZ_16M
+#define OA_TAKEN(tail, head) ((tail - head) & (OA_BUFFER_SIZE - 1))
+
+#define FREQUENCY 200
+#define PERIOD max_t(u64, 10000, NSEC_PER_SEC / FREQUENCY)
+
+static int hsw_perf_format_sizes[] = {
+	64,  /* A13_HSW */
+	128, /* A29_HSW */
+	128, /* A13_B8_C8_HSW */
+	-1,  /* Disallowed since 192 bytes doesn't factor into buffer size
+		(A29_B8_C8_HSW) */
+	64,  /* B4_C8_HSW */
+	256, /* A45_B8_C8_HSW */
+	128, /* B4_C8_A16_HSW */
+	64   /* C4_B8_HSW */
+};
+
+static void forward_one_oa_snapshot_to_event(struct drm_i915_private *dev_priv,
+					     u8 *snapshot,
+					     struct perf_event *event)
+{
+	struct perf_sample_data data;
+	int snapshot_size = dev_priv->oa_pmu.oa_buffer.format_size;
+	struct perf_raw_record raw;
+
+	WARN_ON(snapshot_size == 0);
+
+	perf_sample_data_init(&data, 0, event->hw.last_period);
+
+	/* Note: the combined u32 raw->size member + raw data itself must be 8
+	 * byte aligned. (See note in init_oa_buffer for more details) */
+	raw.size = snapshot_size + 4;
+	raw.data = snapshot;
+
+	data.raw = &raw;
+
+	perf_event_overflow(event, &data, &dev_priv->oa_pmu.dummy_regs);
+}
+
+static u32 forward_oa_snapshots(struct drm_i915_private *dev_priv,
+				u32 head,
+				u32 tail)
+{
+	struct perf_event *exclusive_event = dev_priv->oa_pmu.exclusive_event;
+	int snapshot_size = dev_priv->oa_pmu.oa_buffer.format_size;
+	u8 *oa_buf_base = dev_priv->oa_pmu.oa_buffer.addr;
+	u32 mask = (OA_BUFFER_SIZE - 1);
+	u8 *snapshot;
+	u32 taken;
+
+	head -= dev_priv->oa_pmu.oa_buffer.gtt_offset;
+	tail -= dev_priv->oa_pmu.oa_buffer.gtt_offset;
+
+	/* Note: the gpu doesn't wrap the tail according to the OA buffer size
+	 * so when we need to make sure our head/tail values are in-bounds we
+	 * use the above mask.
+	 */
+
+	while ((taken = OA_TAKEN(tail, head))) {
+		/* The tail increases in 64 byte increments, not in
+		 * format_size steps. */
+		if (taken < snapshot_size)
+			break;
+
+		snapshot = oa_buf_base + (head & mask);
+		head += snapshot_size;
+
+		/* We currently only allow exclusive access to the counters
+		 * so only have one event to forward too... */
+		if (dev_priv->oa_pmu.event_active)
+			forward_one_oa_snapshot_to_event(dev_priv, snapshot,
+							 exclusive_event);
+	}
+
+	return dev_priv->oa_pmu.oa_buffer.gtt_offset + head;
+}
+
+static void flush_oa_snapshots(struct drm_i915_private *dev_priv,
+			       bool skip_if_flushing)
+{
+	unsigned long flags;
+	u32 oastatus2;
+	u32 oastatus1;
+	u32 head;
+	u32 tail;
+
+	/* Can either flush via hrtimer callback or pmu methods/fops */
+	if (skip_if_flushing) {
+
+		/* If the hrtimer triggers at the same time that we are
+		 * responding to a userspace initiated flush then we can
+		 * just bail out...
+		 */
+		if (!spin_trylock_irqsave(&dev_priv->oa_pmu.oa_buffer.flush_lock,
+					  flags))
+			return;
+	} else
+		spin_lock_irqsave(&dev_priv->oa_pmu.oa_buffer.flush_lock, flags);
+
+	WARN_ON(!dev_priv->oa_pmu.oa_buffer.addr);
+
+	oastatus2 = I915_READ(GEN7_OASTATUS2);
+	oastatus1 = I915_READ(GEN7_OASTATUS1);
+
+	head = oastatus2 & GEN7_OASTATUS2_HEAD_MASK;
+	tail = oastatus1 & GEN7_OASTATUS1_TAIL_MASK;
+
+	if (oastatus1 & (GEN7_OASTATUS1_OABUFFER_OVERFLOW |
+			 GEN7_OASTATUS1_REPORT_LOST)) {
+
+		/* XXX: How can we convey report-lost errors to userspace?  It
+		 * doesn't look like perf's _REPORT_LOST mechanism is
+		 * appropriate in this case; that's just for cases where we
+		 * run out of space for samples in the perf circular buffer.
+		 *
+		 * Maybe we can claim a special report-id and use that to
+		 * forward status flags?
+		 */
+		pr_debug("OA buffer read error: addr = %p, head = %u, offset = %u, tail = %u cnt o'flow = %d, buf o'flow = %d, rpt lost = %d\n",
+			 dev_priv->oa_pmu.oa_buffer.addr,
+			 head,
+			 head - dev_priv->oa_pmu.oa_buffer.gtt_offset,
+			 tail,
+			 oastatus1 & GEN7_OASTATUS1_COUNTER_OVERFLOW ? 1 : 0,
+			 oastatus1 & GEN7_OASTATUS1_OABUFFER_OVERFLOW ? 1 : 0,
+			 oastatus1 & GEN7_OASTATUS1_REPORT_LOST ? 1 : 0);
+
+		I915_WRITE(GEN7_OASTATUS1, oastatus1 &
+			   ~(GEN7_OASTATUS1_OABUFFER_OVERFLOW |
+			     GEN7_OASTATUS1_REPORT_LOST));
+	}
+
+	head = forward_oa_snapshots(dev_priv, head, tail);
+
+	I915_WRITE(GEN7_OASTATUS2, (head & GEN7_OASTATUS2_HEAD_MASK) |
+				    GEN7_OASTATUS2_GGTT);
+
+	spin_unlock_irqrestore(&dev_priv->oa_pmu.oa_buffer.flush_lock, flags);
+}
+
+static void
+oa_buffer_destroy(struct drm_i915_private *i915)
+{
+	mutex_lock(&i915->dev->struct_mutex);
+
+	vunmap(i915->oa_pmu.oa_buffer.addr);
+	i915_gem_object_ggtt_unpin(i915->oa_pmu.oa_buffer.obj);
+	drm_gem_object_unreference(&i915->oa_pmu.oa_buffer.obj->base);
+
+	i915->oa_pmu.oa_buffer.obj = NULL;
+	i915->oa_pmu.oa_buffer.gtt_offset = 0;
+	i915->oa_pmu.oa_buffer.addr = NULL;
+
+	mutex_unlock(&i915->dev->struct_mutex);
+}
+
+static void i915_oa_event_destroy(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), oa_pmu.pmu);
+
+	WARN_ON(event->parent);
+
+	oa_buffer_destroy(i915);
+
+	i915->oa_pmu.specific_ctx = NULL;
+
+	BUG_ON(i915->oa_pmu.exclusive_event != event);
+	i915->oa_pmu.exclusive_event = NULL;
+
+	intel_uncore_forcewake_put(i915, FORCEWAKE_ALL);
+	intel_runtime_pm_put(i915);
+}
+
+static void *vmap_oa_buffer(struct drm_i915_gem_object *obj)
+{
+	int i;
+	void *addr = NULL;
+	struct sg_page_iter sg_iter;
+	struct page **pages;
+
+	pages = drm_malloc_ab(obj->base.size >> PAGE_SHIFT, sizeof(*pages));
+	if (pages == NULL) {
+		DRM_DEBUG_DRIVER("Failed to get space for pages\n");
+		goto finish;
+	}
+
+	i = 0;
+	for_each_sg_page(obj->pages->sgl, &sg_iter, obj->pages->nents, 0) {
+		pages[i] = sg_page_iter_page(&sg_iter);
+		i++;
+	}
+
+	addr = vmap(pages, i, 0, PAGE_KERNEL);
+	if (addr == NULL) {
+		DRM_DEBUG_DRIVER("Failed to vmap pages\n");
+		goto finish;
+	}
+
+finish:
+	if (pages)
+		drm_free_large(pages);
+	return addr;
+}
+
+static int init_oa_buffer(struct perf_event *event)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	struct drm_i915_gem_object *bo;
+	int ret;
+
+	BUG_ON(!IS_HASWELL(dev_priv->dev));
+	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
+	BUG_ON(dev_priv->oa_pmu.oa_buffer.obj);
+
+	spin_lock_init(&dev_priv->oa_pmu.oa_buffer.flush_lock);
+
+	/* NB: We over allocate the OA buffer due to the way raw sample data
+	 * gets copied from the gpu mapped circular buffer into the perf
+	 * circular buffer so that only one copy is required.
+	 *
+	 * For each perf sample (raw->size + 4) needs to be 8 byte aligned,
+	 * where the 4 corresponds to the 32bit raw->size member that's
+	 * added to the sample header that userspace sees.
+	 *
+	 * Due to the + 4 for the size member: when we copy a report to the
+	 * userspace facing perf buffer we always copy an additional 4 bytes
+	 * from the subsequent report to make up for the miss alignment, but
+	 * when a report is at the end of the gpu mapped buffer we need to
+	 * read 4 bytes past the end of the buffer.
+	 */
+	bo = i915_gem_alloc_object(dev_priv->dev, OA_BUFFER_SIZE + PAGE_SIZE);
+	if (bo == NULL) {
+		DRM_ERROR("Failed to allocate OA buffer\n");
+		ret = -ENOMEM;
+		goto err;
+	}
+	dev_priv->oa_pmu.oa_buffer.obj = bo;
+
+	ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
+	if (ret)
+		goto err_unref;
+
+	/* PreHSW required 512K alignment, HSW requires 16M */
+	ret = i915_gem_obj_ggtt_pin(bo, SZ_16M, 0);
+	if (ret)
+		goto err_unref;
+
+	dev_priv->oa_pmu.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo);
+	dev_priv->oa_pmu.oa_buffer.addr = vmap_oa_buffer(bo);
+
+	/* Pre-DevBDW: OABUFFER must be set with counters off,
+	 * before OASTATUS1, but after OASTATUS2 */
+	I915_WRITE(GEN7_OASTATUS2, dev_priv->oa_pmu.oa_buffer.gtt_offset |
+		   GEN7_OASTATUS2_GGTT); /* head */
+	I915_WRITE(GEN7_OABUFFER, dev_priv->oa_pmu.oa_buffer.gtt_offset);
+	I915_WRITE(GEN7_OASTATUS1, dev_priv->oa_pmu.oa_buffer.gtt_offset |
+		   GEN7_OASTATUS1_OABUFFER_SIZE_16M); /* tail */
+
+	DRM_DEBUG_DRIVER("OA Buffer initialized, gtt offset = 0x%x, vaddr = %p",
+			 dev_priv->oa_pmu.oa_buffer.gtt_offset,
+			 dev_priv->oa_pmu.oa_buffer.addr);
+
+	return 0;
+
+err_unref:
+	drm_gem_object_unreference_unlocked(&bo->base);
+err:
+	return ret;
+}
+
+static enum hrtimer_restart hrtimer_sample(struct hrtimer *hrtimer)
+{
+	struct drm_i915_private *i915 =
+		container_of(hrtimer, typeof(*i915), oa_pmu.timer);
+
+	flush_oa_snapshots(i915, true);
+
+	hrtimer_forward_now(hrtimer, ns_to_ktime(PERIOD));
+	return HRTIMER_RESTART;
+}
+
+static struct intel_context *
+lookup_context(struct drm_i915_private *dev_priv,
+	       struct file *user_filp,
+	       u32 ctx_user_handle)
+{
+	struct intel_context *ctx;
+
+	mutex_lock(&dev_priv->dev->struct_mutex);
+	list_for_each_entry(ctx, &dev_priv->context_list, link) {
+		struct drm_file *drm_file;
+
+		if (!ctx->file_priv)
+			continue;
+
+		drm_file = ctx->file_priv->file;
+
+		if (user_filp->private_data == drm_file &&
+		    ctx->user_handle == ctx_user_handle) {
+			mutex_unlock(&dev_priv->dev->struct_mutex);
+			return ctx;
+		}
+	}
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+
+	return NULL;
+}
+
+static int i915_oa_copy_attr(drm_i915_oa_attr_t __user *uattr,
+			     drm_i915_oa_attr_t *attr)
+{
+	u32 size;
+	int ret;
+
+	if (!access_ok(VERIFY_WRITE, uattr, I915_OA_ATTR_SIZE_VER0))
+		return -EFAULT;
+
+	/*
+	 * zero the full structure, so that a short copy will be nice.
+	 */
+	memset(attr, 0, sizeof(*attr));
+
+	ret = get_user(size, &uattr->size);
+	if (ret)
+		return ret;
+
+	if (size > PAGE_SIZE)	/* silly large */
+		goto err_size;
+
+	if (size < I915_OA_ATTR_SIZE_VER0)
+		goto err_size;
+
+	/*
+	 * If we're handed a bigger struct than we know of,
+	 * ensure all the unknown bits are 0 - i.e. new
+	 * user-space does not rely on any kernel feature
+	 * extensions we dont know about yet.
+	 */
+	if (size > sizeof(*attr)) {
+		unsigned char __user *addr;
+		unsigned char __user *end;
+		unsigned char val;
+
+		addr = (void __user *)uattr + sizeof(*attr);
+		end  = (void __user *)uattr + size;
+
+		for (; addr < end; addr++) {
+			ret = get_user(val, addr);
+			if (ret)
+				return ret;
+			if (val)
+				goto err_size;
+		}
+		size = sizeof(*attr);
+	}
+
+	ret = copy_from_user(attr, uattr, size);
+	if (ret)
+		return -EFAULT;
+
+	if (attr->__reserved_1)
+		return -EINVAL;
+
+out:
+	return ret;
+
+err_size:
+	put_user(sizeof(*attr), &uattr->size);
+	ret = -E2BIG;
+	goto out;
+}
+
+static int i915_oa_event_init(struct perf_event *event)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	drm_i915_oa_attr_t oa_attr;
+	u64 report_format;
+	int ret = 0;
+
+	if (event->attr.type != event->pmu->type)
+		return -ENOENT;
+
+	ret = i915_oa_copy_attr(to_user_ptr(event->attr.config), &oa_attr);
+	if (ret)
+		return ret;
+
+	/* To avoid the complexity of having to accurately filter
+	 * counter snapshots and marshal to the appropriate client
+	 * we currently only allow exclusive access */
+	if (dev_priv->oa_pmu.oa_buffer.obj)
+		return -EBUSY;
+
+	report_format = oa_attr.format;
+	dev_priv->oa_pmu.oa_buffer.format = report_format;
+	dev_priv->oa_pmu.metrics_set = oa_attr.metrics_set;
+
+	if (IS_HASWELL(dev_priv->dev)) {
+		int snapshot_size;
+
+		if (report_format >= ARRAY_SIZE(hsw_perf_format_sizes))
+			return -EINVAL;
+
+		snapshot_size = hsw_perf_format_sizes[report_format];
+		if (snapshot_size < 0)
+			return -EINVAL;
+
+		dev_priv->oa_pmu.oa_buffer.format_size = snapshot_size;
+
+		if (oa_attr.metrics_set > I915_OA_METRICS_SET_3D)
+			return -EINVAL;
+	} else {
+		BUG(); /* pmu shouldn't have been registered */
+		return -ENODEV;
+	}
+
+	/* Since we are limited to an exponential scale for
+	 * programming the OA sampling period we don't allow userspace
+	 * to pass a precise attr.sample_period. */
+	if (event->attr.freq ||
+	    (event->attr.sample_period != 0 &&
+	     event->attr.sample_period != 1))
+		return -EINVAL;
+
+	dev_priv->oa_pmu.periodic = event->attr.sample_period;
+
+	/* Instead of allowing userspace to configure the period via
+	 * attr.sample_period we instead accept an exponent whereby
+	 * the sample_period will be:
+	 *
+	 *   80ns * 2^(period_exponent + 1)
+	 *
+	 * Programming a period of 160 nanoseconds would not be very
+	 * polite, so higher frequencies are reserved for root.
+	 */
+	if (dev_priv->oa_pmu.periodic) {
+		u64 period_exponent = oa_attr.timer_exponent;
+
+		if (period_exponent > 63)
+			return -EINVAL;
+
+		if (period_exponent < 15 && !capable(CAP_SYS_ADMIN))
+			return -EACCES;
+
+		dev_priv->oa_pmu.period_exponent = period_exponent;
+	} else if (oa_attr.timer_exponent)
+		return -EINVAL;
+
+	/* We bypass the default perf core perf_paranoid_cpu() ||
+	 * CAP_SYS_ADMIN check by using the PERF_PMU_CAP_IS_DEVICE
+	 * flag and instead authenticate based on whether the current
+	 * pid owns the specified context, or require CAP_SYS_ADMIN
+	 * when collecting cross-context metrics.
+	 */
+	dev_priv->oa_pmu.specific_ctx = NULL;
+	if (oa_attr.single_context) {
+		u32 ctx_id = oa_attr.ctx_id;
+		unsigned int drm_fd = oa_attr.drm_fd;
+		struct fd fd = fdget(drm_fd);
+
+		if (fd.file) {
+			dev_priv->oa_pmu.specific_ctx =
+				lookup_context(dev_priv, fd.file, ctx_id);
+		}
+	}
+
+	if (!dev_priv->oa_pmu.specific_ctx && !capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	mutex_lock(&dev_priv->dev->struct_mutex);
+	ret = init_oa_buffer(event);
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+
+	if (ret)
+		return ret;
+
+	BUG_ON(dev_priv->oa_pmu.exclusive_event);
+	dev_priv->oa_pmu.exclusive_event = event;
+
+	event->destroy = i915_oa_event_destroy;
+
+	/* PRM - observability performance counters:
+	 *
+	 *   OACONTROL, performance counter enable, note:
+	 *
+	 *   "When this bit is set, in order to have coherent counts,
+	 *   RC6 power state and trunk clock gating must be disabled.
+	 *   This can be achieved by programming MMIO registers as
+	 *   0xA094=0 and 0xA090[31]=1"
+	 *
+	 *   In our case we are expected that taking pm + FORCEWAKE
+	 *   references will effectively disable RC6 and trunk clock
+	 *   gating.
+	 */
+	intel_runtime_pm_get(dev_priv);
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+	return 0;
+}
+
+static void update_oacontrol(struct drm_i915_private *dev_priv)
+{
+	BUG_ON(!spin_is_locked(&dev_priv->oa_pmu.lock));
+
+	if (dev_priv->oa_pmu.event_active) {
+		unsigned long ctx_id = 0;
+		bool pinning_ok = false;
+
+		if (dev_priv->oa_pmu.specific_ctx) {
+			struct intel_context *ctx =
+				dev_priv->oa_pmu.specific_ctx;
+			struct drm_i915_gem_object *obj =
+				ctx->legacy_hw_ctx.rcs_state;
+
+			if (i915_gem_obj_is_pinned(obj)) {
+				ctx_id = i915_gem_obj_ggtt_offset(obj);
+				pinning_ok = true;
+			}
+		}
+
+		if ((ctx_id == 0 || pinning_ok)) {
+			bool periodic = dev_priv->oa_pmu.periodic;
+			u32 period_exponent = dev_priv->oa_pmu.period_exponent;
+			u32 report_format = dev_priv->oa_pmu.oa_buffer.format;
+
+			I915_WRITE(GEN7_OACONTROL,
+				   (ctx_id & GEN7_OACONTROL_CTX_MASK) |
+				   (period_exponent <<
+				    GEN7_OACONTROL_TIMER_PERIOD_SHIFT) |
+				   (periodic ?
+				    GEN7_OACONTROL_TIMER_ENABLE : 0) |
+				   (report_format <<
+				    GEN7_OACONTROL_FORMAT_SHIFT) |
+				   (ctx_id ?
+				    GEN7_OACONTROL_PER_CTX_ENABLE : 0) |
+				   GEN7_OACONTROL_ENABLE);
+			return;
+		}
+	}
+
+	I915_WRITE(GEN7_OACONTROL, 0);
+}
+
+static void i915_oa_event_start(struct perf_event *event, int flags)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	unsigned long lock_flags;
+	u32 oastatus1, tail;
+
+	/* PRM - observability performance counters:
+	 *
+	 *   OACONTROL, specific context enable:
+	 *
+	 *   "OA unit level clock gating must be ENABLED when using
+	 *   specific ContextID feature."
+	 *
+	 * Assuming we don't ever disable OA unit level clock gating
+	 * lets just assert that this condition is met...
+	 */
+	WARN_ONCE(I915_READ(GEN6_UCGCTL3) & GEN6_OACSUNIT_CLOCK_GATE_DISABLE,
+		  "disabled OA unit level clock gating will result in incorrect per-context OA counters");
+
+	/* XXX: On Haswell, when threshold disable mode is desired,
+	 * instead of setting the threshold enable to '0', we need to
+	 * program it to '1' and set OASTARTTRIG1 bits 15:0 to 0
+	 * (threshold value of 0)
+	 */
+	I915_WRITE(OASTARTTRIG6, (OASTARTTRIG6_B4_TO_B7_THRESHOLD_ENABLE |
+				  OASTARTTRIG6_B4_CUSTOM_EVENT_ENABLE));
+	I915_WRITE(OASTARTTRIG5, 0); /* threshold value */
+
+	I915_WRITE(OASTARTTRIG2, (OASTARTTRIG2_B0_TO_B3_THRESHOLD_ENABLE |
+				  OASTARTTRIG2_B0_CUSTOM_EVENT_ENABLE));
+	I915_WRITE(OASTARTTRIG1, 0); /* threshold value */
+
+	/* Setup B0 as the gpu clock counter... */
+	I915_WRITE(OACEC0_0, OACEC0_0_B0_COMPARE_GREATER_OR_EQUAL); /* to 0 */
+	I915_WRITE(OACEC0_1, 0xfffe); /* Select NOA[0] */
+
+	spin_lock_irqsave(&dev_priv->oa_pmu.lock, lock_flags);
+
+	dev_priv->oa_pmu.event_active = true;
+	update_oacontrol(dev_priv);
+
+	/* Reset the head ptr to ensure we don't forward reports relating
+	 * to a previous perf event */
+	oastatus1 = I915_READ(GEN7_OASTATUS1);
+	tail = oastatus1 & GEN7_OASTATUS1_TAIL_MASK;
+	I915_WRITE(GEN7_OASTATUS2, (tail & GEN7_OASTATUS2_HEAD_MASK) |
+				    GEN7_OASTATUS2_GGTT);
+
+	spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, lock_flags);
+
+	if (event->attr.sample_period)
+		__hrtimer_start_range_ns(&dev_priv->oa_pmu.timer,
+					 ns_to_ktime(PERIOD), 0,
+					 HRTIMER_MODE_REL_PINNED, 0);
+
+	event->hw.state = 0;
+}
+
+static void i915_oa_event_stop(struct perf_event *event, int flags)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	unsigned long lock_flags;
+
+	spin_lock_irqsave(&dev_priv->oa_pmu.lock, lock_flags);
+	dev_priv->oa_pmu.event_active = false;
+	update_oacontrol(dev_priv);
+	spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, lock_flags);
+
+	if (event->attr.sample_period) {
+		hrtimer_cancel(&dev_priv->oa_pmu.timer);
+		flush_oa_snapshots(dev_priv, false);
+	}
+
+	event->hw.state = PERF_HES_STOPPED;
+}
+
+static int i915_oa_event_add(struct perf_event *event, int flags)
+{
+	if (flags & PERF_EF_START)
+		i915_oa_event_start(event, flags);
+
+	return 0;
+}
+
+static void i915_oa_event_del(struct perf_event *event, int flags)
+{
+	i915_oa_event_stop(event, flags);
+}
+
+static void i915_oa_event_read(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), oa_pmu.pmu);
+
+	/* XXX: What counter would be useful here? */
+	local64_set(&event->count, 0);
+}
+
+static void i915_oa_event_flush(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), oa_pmu.pmu);
+
+	/* We want userspace to be able to use a read() to explicitly
+	 * flush OA counter snapshots... */
+	if (event->attr.sample_period)
+		flush_oa_snapshots(i915, true);
+}
+
+static int i915_oa_event_event_idx(struct perf_event *event)
+{
+	return 0;
+}
+
+void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
+				struct intel_context *context)
+{
+	unsigned long flags;
+
+	if (dev_priv->oa_pmu.pmu.event_init == NULL)
+		return;
+
+	spin_lock_irqsave(&dev_priv->oa_pmu.lock, flags);
+
+	if (dev_priv->oa_pmu.specific_ctx == context)
+		update_oacontrol(dev_priv);
+
+	spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, flags);
+}
+
+void i915_oa_context_unpin_notify(struct drm_i915_private *dev_priv,
+				  struct intel_context *context)
+{
+	unsigned long flags;
+
+	if (dev_priv->oa_pmu.pmu.event_init == NULL)
+		return;
+
+	spin_lock_irqsave(&dev_priv->oa_pmu.lock, flags);
+
+	if (dev_priv->oa_pmu.specific_ctx == context)
+		update_oacontrol(dev_priv);
+
+	spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, flags);
+}
+
+void i915_oa_pmu_register(struct drm_device *dev)
+{
+	struct drm_i915_private *i915 = to_i915(dev);
+
+	if (!IS_HASWELL(dev))
+		return;
+
+	/* We need to be careful about forwarding cpu metrics to
+	 * userspace considering that PERF_PMU_CAP_IS_DEVICE bypasses
+	 * the events/core security check that stops an unprivileged
+	 * process collecting metrics for other processes.
+	 */
+	i915->oa_pmu.dummy_regs = *task_pt_regs(current);
+
+	hrtimer_init(&i915->oa_pmu.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	i915->oa_pmu.timer.function = hrtimer_sample;
+
+	spin_lock_init(&i915->oa_pmu.lock);
+
+	i915->oa_pmu.pmu.capabilities  = PERF_PMU_CAP_IS_DEVICE;
+
+	/* Effectively disallow opening an event with a specific pid
+	 * since we aren't interested in processes running on the cpu...
+	 */
+	i915->oa_pmu.pmu.task_ctx_nr   = perf_invalid_context;
+
+	i915->oa_pmu.pmu.event_init    = i915_oa_event_init;
+	i915->oa_pmu.pmu.add	       = i915_oa_event_add;
+	i915->oa_pmu.pmu.del	       = i915_oa_event_del;
+	i915->oa_pmu.pmu.start	       = i915_oa_event_start;
+	i915->oa_pmu.pmu.stop	       = i915_oa_event_stop;
+	i915->oa_pmu.pmu.read	       = i915_oa_event_read;
+	i915->oa_pmu.pmu.flush	       = i915_oa_event_flush;
+	i915->oa_pmu.pmu.event_idx     = i915_oa_event_event_idx;
+
+	if (perf_pmu_register(&i915->oa_pmu.pmu, "i915_oa", -1))
+		i915->oa_pmu.pmu.event_init = NULL;
+}
+
+void i915_oa_pmu_unregister(struct drm_device *dev)
+{
+	struct drm_i915_private *i915 = to_i915(dev);
+
+	if (i915->oa_pmu.pmu.event_init == NULL)
+		return;
+
+	perf_pmu_unregister(&i915->oa_pmu.pmu);
+	i915->oa_pmu.pmu.event_init = NULL;
+}
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 2fa1669..9e427cc 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -516,6 +516,73 @@
 #define GEN7_3DPRIM_BASE_VERTEX         0x2440
 
 #define GEN7_OACONTROL 0x2360
+#define  GEN7_OACONTROL_CTX_MASK	    0xFFFFF000
+#define  GEN7_OACONTROL_TIMER_PERIOD_MASK   0x3F
+#define  GEN7_OACONTROL_TIMER_PERIOD_SHIFT  6
+#define  GEN7_OACONTROL_TIMER_ENABLE	    (1<<5)
+#define  GEN7_OACONTROL_FORMAT_A13	    (0<<2)
+#define  GEN7_OACONTROL_FORMAT_A29	    (1<<2)
+#define  GEN7_OACONTROL_FORMAT_A13_B8_C8    (2<<2)
+#define  GEN7_OACONTROL_FORMAT_A29_B8_C8    (3<<2)
+#define  GEN7_OACONTROL_FORMAT_B4_C8	    (4<<2)
+#define  GEN7_OACONTROL_FORMAT_A45_B8_C8    (5<<2)
+#define  GEN7_OACONTROL_FORMAT_B4_C8_A16    (6<<2)
+#define  GEN7_OACONTROL_FORMAT_C4_B8	    (7<<2)
+#define  GEN7_OACONTROL_FORMAT_SHIFT	    2
+#define  GEN7_OACONTROL_PER_CTX_ENABLE	    (1<<1)
+#define  GEN7_OACONTROL_ENABLE		    (1<<0)
+
+#define OASTARTTRIG5 0x02720
+#define  OASTARTTRIG5_THRESHOLD_VALUE_MASK	0xffff
+
+#define OASTARTTRIG6 0x02724
+#define  OASTARTTRIG6_B4_TO_B7_THRESHOLD_ENABLE (1<<23)
+#define  OASTARTTRIG6_B4_CUSTOM_EVENT_ENABLE	(1<<28)
+
+#define OASTARTTRIG1 0x02710
+#define  OASTARTTRIG1_THRESHOLD_VALUE_MASK	0xffff
+
+#define OASTARTTRIG2 0x02714
+#define  OASTARTTRIG2_B0_TO_B3_THRESHOLD_ENABLE (1<<23)
+#define  OASTARTTRIG2_B0_CUSTOM_EVENT_ENABLE	(1<<28)
+
+#define OACEC0_0 0x2770
+#define  OACEC0_0_B0_COMPARE_ANY_EQUAL		0
+#define  OACEC0_0_B0_COMPARE_OR			0
+#define  OACEC0_0_B0_COMPARE_GREATER_THAN	1
+#define  OACEC0_0_B0_COMPARE_EQUAL		2
+#define  OACEC0_0_B0_COMPARE_GREATER_OR_EQUAL	3
+#define  OACEC0_0_B0_COMPARE_LESS_THAN		4
+#define  OACEC0_0_B0_COMPARE_NOT_EQUAL		5
+#define  OACEC0_0_B0_COMPARE_LESS_OR_EQUAL	6
+#define  OACEC0_0_B0_COMPARE_VALUE_MASK		0xffff
+#define  OACEC0_0_B0_COMPARE_VALUE_SHIFT	3
+
+#define OACEC0_1 0x2774
+
+#define GEN7_OABUFFER 0x23B0 /* R/W */
+#define  GEN7_OABUFFER_OVERRUN_DISABLE	    (1<<3)
+#define  GEN7_OABUFFER_EDGE_TRIGGER	    (1<<2)
+#define  GEN7_OABUFFER_STOP_RESUME_ENABLE   (1<<1)
+#define  GEN7_OABUFFER_RESUME		    (1<<0)
+
+#define GEN7_OASTATUS1 0x2364
+#define  GEN7_OASTATUS1_TAIL_MASK	    0xffffffc0
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_128K  (0<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_256K  (1<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_512K  (2<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_1M    (3<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_2M    (4<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_4M    (5<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_8M    (6<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_16M   (7<<3)
+#define  GEN7_OASTATUS1_COUNTER_OVERFLOW    (1<<2)
+#define  GEN7_OASTATUS1_OABUFFER_OVERFLOW   (1<<1)
+#define  GEN7_OASTATUS1_REPORT_LOST	    (1<<0)
+
+#define GEN7_OASTATUS2 0x2368
+#define GEN7_OASTATUS2_HEAD_MASK    0xffffffc0
+#define GEN7_OASTATUS2_GGTT	    0x1
 
 #define _GEN7_PIPEA_DE_LOAD_SL	0x70068
 #define _GEN7_PIPEB_DE_LOAD_SL	0x71068
@@ -6497,6 +6564,7 @@ enum skl_disp_power_wells {
 # define GEN6_RCCUNIT_CLOCK_GATE_DISABLE		(1 << 11)
 
 #define GEN6_UCGCTL3				0x9408
+# define GEN6_OACSUNIT_CLOCK_GATE_DISABLE		(1 << 20)
 
 #define GEN7_UCGCTL4				0x940c
 #define  GEN7_L3BANK2X_CLOCK_GATE_DISABLE	(1<<25)
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 4851d66..f78f232 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -58,6 +58,35 @@
 #define I915_ERROR_UEVENT		"ERROR"
 #define I915_RESET_UEVENT		"RESET"
 
+/**
+ * DOC: perf events configuration exposed by i915 through /sys/bus/event_sources/drivers/i915_oa
+ *
+ */
+
+#define I915_OA_FORMAT_A13_HSW		0
+#define I915_OA_FORMAT_A29_HSW		1
+#define I915_OA_FORMAT_A13_B8_C8_HSW	2
+#define I915_OA_FORMAT_B4_C8_HSW	4
+#define I915_OA_FORMAT_A45_B8_C8_HSW	5
+#define I915_OA_FORMAT_B4_C8_A16_HSW	6
+#define I915_OA_FORMAT_C4_B8_HSW	7
+
+#define I915_OA_ATTR_SIZE_VER0		32  /* sizeof first published struct */
+
+typedef struct _drm_i915_oa_attr {
+	__u32 size;
+
+	__u32 format;
+	__u32 metrics_set;
+	__u32 timer_exponent;
+
+	__u32 drm_fd;
+	__u32 ctx_id;
+
+	__u64 single_context : 1,
+	      __reserved_1 : 63;
+} drm_i915_oa_attr_t;
+
 /* Each region is a minimum of 16k, and there are at most 255 of them.
  */
 #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 08/11] drm/i915: add OA config for 3D render counters
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
                   ` (6 preceding siblings ...)
  2015-05-07 14:15 ` [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture Robert Bragg
@ 2015-05-07 14:15 ` Robert Bragg
  2015-05-07 14:15 ` [RFC PATCH 09/11] drm/i915: Add dev.i915.oa_event_paranoid sysctl option Robert Bragg
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-07 14:15 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

This enables access to some additional counters beyond the aggregating A
counters, adding a '3D' metric set configuration useful while profiling
3D rendering workloads.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_drv.h     |   7 +
 drivers/gpu/drm/i915/i915_oa_perf.c | 124 +++++++++++++--
 drivers/gpu/drm/i915/i915_reg.h     | 294 ++++++++++++++++++++++++++++++++----
 include/uapi/drm/i915_drm.h         |   2 +
 4 files changed, 385 insertions(+), 42 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3d8e4ed..1e65dc2 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1564,6 +1564,13 @@ struct i915_virtual_gpu {
 	bool active;
 };
 
+#ifdef CONFIG_PERF_EVENTS
+struct i915_oa_reg {
+	u32 addr;
+	u32 value;
+};
+#endif
+
 struct drm_i915_private {
 	struct drm_device *dev;
 	struct kmem_cache *objects;
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index a1cf55f..d0e144c 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -23,6 +23,80 @@ static int hsw_perf_format_sizes[] = {
 	64   /* C4_B8_HSW */
 };
 
+
+/* A generated mux config to select counters useful for profiling 3D
+ * workloads */
+static struct i915_oa_reg hsw_profile_3d_mux_config[] = {
+
+	{ 0x253A4, 0x01600000 },
+	{ 0x25440, 0x00100000 },
+	{ 0x25128, 0x00000000 },
+	{ 0x2691C, 0x00000800 },
+	{ 0x26AA0, 0x01500000 },
+	{ 0x26B9C, 0x00006000 },
+	{ 0x2791C, 0x00000800 },
+	{ 0x27AA0, 0x01500000 },
+	{ 0x27B9C, 0x00006000 },
+	{ 0x2641C, 0x00000400 },
+	{ 0x25380, 0x00000010 },
+	{ 0x2538C, 0x00000000 },
+	{ 0x25384, 0x0800AAAA },
+	{ 0x25400, 0x00000004 },
+	{ 0x2540C, 0x06029000 },
+	{ 0x25410, 0x00000002 },
+	{ 0x25404, 0x5C30FFFF },
+	{ 0x25100, 0x00000016 },
+	{ 0x25110, 0x00000400 },
+	{ 0x25104, 0x00000000 },
+	{ 0x26804, 0x00001211 },
+	{ 0x26884, 0x00000100 },
+	{ 0x26900, 0x00000002 },
+	{ 0x26908, 0x00700000 },
+	{ 0x26904, 0x00000000 },
+	{ 0x26984, 0x00001022 },
+	{ 0x26A04, 0x00000011 },
+	{ 0x26A80, 0x00000006 },
+	{ 0x26A88, 0x00000C02 },
+	{ 0x26A84, 0x00000000 },
+	{ 0x26B04, 0x00001000 },
+	{ 0x26B80, 0x00000002 },
+	{ 0x26B8C, 0x00000007 },
+	{ 0x26B84, 0x00000000 },
+	{ 0x27804, 0x00004844 },
+	{ 0x27884, 0x00000400 },
+	{ 0x27900, 0x00000002 },
+	{ 0x27908, 0x0E000000 },
+	{ 0x27904, 0x00000000 },
+	{ 0x27984, 0x00004088 },
+	{ 0x27A04, 0x00000044 },
+	{ 0x27A80, 0x00000006 },
+	{ 0x27A88, 0x00018040 },
+	{ 0x27A84, 0x00000000 },
+	{ 0x27B04, 0x00004000 },
+	{ 0x27B80, 0x00000002 },
+	{ 0x27B8C, 0x000000E0 },
+	{ 0x27B84, 0x00000000 },
+	{ 0x26104, 0x00002222 },
+	{ 0x26184, 0x0C006666 },
+	{ 0x26284, 0x04000000 },
+	{ 0x26304, 0x04000000 },
+	{ 0x26400, 0x00000002 },
+	{ 0x26410, 0x000000A0 },
+	{ 0x26404, 0x00000000 },
+	{ 0x25420, 0x04108020 },
+	{ 0x25424, 0x1284A420 },
+	{ 0x2541C, 0x00000000 },
+	{ 0x25428, 0x00042049 },
+};
+
+/* A corresponding B counter configuration for profiling 3D workloads */
+static struct i915_oa_reg hsw_profile_3d_b_counter_config[] = {
+	{ 0x2724, 0x00800000 },
+	{ 0x2720, 0x00000000 },
+	{ 0x2714, 0x00800000 },
+	{ 0x2710, 0x00000000 },
+};
+
 static void forward_one_oa_snapshot_to_event(struct drm_i915_private *dev_priv,
 					     u8 *snapshot,
 					     struct perf_event *event)
@@ -551,6 +625,19 @@ static void update_oacontrol(struct drm_i915_private *dev_priv)
 	I915_WRITE(GEN7_OACONTROL, 0);
 }
 
+static void config_oa_regs(struct drm_i915_private *dev_priv,
+			   struct i915_oa_reg *regs,
+			   int n_regs)
+{
+	int i;
+
+	for (i = 0; i < n_regs; i++) {
+		struct i915_oa_reg *reg = regs + i;
+
+		I915_WRITE(reg->addr, reg->value);
+	}
+}
+
 static void i915_oa_event_start(struct perf_event *event, int flags)
 {
 	struct drm_i915_private *dev_priv =
@@ -571,22 +658,31 @@ static void i915_oa_event_start(struct perf_event *event, int flags)
 	WARN_ONCE(I915_READ(GEN6_UCGCTL3) & GEN6_OACSUNIT_CLOCK_GATE_DISABLE,
 		  "disabled OA unit level clock gating will result in incorrect per-context OA counters");
 
-	/* XXX: On Haswell, when threshold disable mode is desired,
-	 * instead of setting the threshold enable to '0', we need to
-	 * program it to '1' and set OASTARTTRIG1 bits 15:0 to 0
-	 * (threshold value of 0)
-	 */
-	I915_WRITE(OASTARTTRIG6, (OASTARTTRIG6_B4_TO_B7_THRESHOLD_ENABLE |
-				  OASTARTTRIG6_B4_CUSTOM_EVENT_ENABLE));
-	I915_WRITE(OASTARTTRIG5, 0); /* threshold value */
+	I915_WRITE(GDT_CHICKEN_BITS, GT_NOA_ENABLE);
 
-	I915_WRITE(OASTARTTRIG2, (OASTARTTRIG2_B0_TO_B3_THRESHOLD_ENABLE |
-				  OASTARTTRIG2_B0_CUSTOM_EVENT_ENABLE));
-	I915_WRITE(OASTARTTRIG1, 0); /* threshold value */
+	if (dev_priv->oa_pmu.metrics_set == I915_OA_METRICS_SET_3D) {
+		config_oa_regs(dev_priv, hsw_profile_3d_mux_config,
+			       ARRAY_SIZE(hsw_profile_3d_mux_config));
+		config_oa_regs(dev_priv, hsw_profile_3d_b_counter_config,
+			       ARRAY_SIZE(hsw_profile_3d_b_counter_config));
+	} else {
+		/* XXX: On Haswell, when threshold disable mode is desired,
+		 * instead of setting the threshold enable to '0', we need to
+		 * program it to '1' and set OASTARTTRIG1 bits 15:0 to 0
+		 * (threshold value of 0)
+		 */
+		I915_WRITE(OASTARTTRIG6, (OASTARTTRIG6_THRESHOLD_ENABLE |
+					  OASTARTTRIG6_EVENT_SELECT_4));
+		I915_WRITE(OASTARTTRIG5, 0); /* threshold value */
 
-	/* Setup B0 as the gpu clock counter... */
-	I915_WRITE(OACEC0_0, OACEC0_0_B0_COMPARE_GREATER_OR_EQUAL); /* to 0 */
-	I915_WRITE(OACEC0_1, 0xfffe); /* Select NOA[0] */
+		I915_WRITE(OASTARTTRIG2, (OASTARTTRIG2_THRESHOLD_ENABLE |
+					  OASTARTTRIG2_EVENT_SELECT_0));
+		I915_WRITE(OASTARTTRIG1, 0); /* threshold value */
+
+		/* Setup B0 as the gpu clock counter... */
+		I915_WRITE(OACEC0_0, OACEC_COMPARE_GREATER_OR_EQUAL); /* to 0 */
+		I915_WRITE(OACEC0_1, 0xfffe); /* Select NOA[0] */
+	}
 
 	spin_lock_irqsave(&dev_priv->oa_pmu.lock, lock_flags);
 
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 9e427cc..d94932a 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -532,34 +532,6 @@
 #define  GEN7_OACONTROL_PER_CTX_ENABLE	    (1<<1)
 #define  GEN7_OACONTROL_ENABLE		    (1<<0)
 
-#define OASTARTTRIG5 0x02720
-#define  OASTARTTRIG5_THRESHOLD_VALUE_MASK	0xffff
-
-#define OASTARTTRIG6 0x02724
-#define  OASTARTTRIG6_B4_TO_B7_THRESHOLD_ENABLE (1<<23)
-#define  OASTARTTRIG6_B4_CUSTOM_EVENT_ENABLE	(1<<28)
-
-#define OASTARTTRIG1 0x02710
-#define  OASTARTTRIG1_THRESHOLD_VALUE_MASK	0xffff
-
-#define OASTARTTRIG2 0x02714
-#define  OASTARTTRIG2_B0_TO_B3_THRESHOLD_ENABLE (1<<23)
-#define  OASTARTTRIG2_B0_CUSTOM_EVENT_ENABLE	(1<<28)
-
-#define OACEC0_0 0x2770
-#define  OACEC0_0_B0_COMPARE_ANY_EQUAL		0
-#define  OACEC0_0_B0_COMPARE_OR			0
-#define  OACEC0_0_B0_COMPARE_GREATER_THAN	1
-#define  OACEC0_0_B0_COMPARE_EQUAL		2
-#define  OACEC0_0_B0_COMPARE_GREATER_OR_EQUAL	3
-#define  OACEC0_0_B0_COMPARE_LESS_THAN		4
-#define  OACEC0_0_B0_COMPARE_NOT_EQUAL		5
-#define  OACEC0_0_B0_COMPARE_LESS_OR_EQUAL	6
-#define  OACEC0_0_B0_COMPARE_VALUE_MASK		0xffff
-#define  OACEC0_0_B0_COMPARE_VALUE_SHIFT	3
-
-#define OACEC0_1 0x2774
-
 #define GEN7_OABUFFER 0x23B0 /* R/W */
 #define  GEN7_OABUFFER_OVERRUN_DISABLE	    (1<<3)
 #define  GEN7_OABUFFER_EDGE_TRIGGER	    (1<<2)
@@ -584,6 +556,272 @@
 #define GEN7_OASTATUS2_HEAD_MASK    0xffffffc0
 #define GEN7_OASTATUS2_GGTT	    0x1
 
+#define GDT_CHICKEN_BITS    0x9840
+#define GT_NOA_ENABLE	    0x00000080
+
+/*
+ * OA Boolean state
+ */
+
+#define OAREPORTTRIG1 0x2740
+#define OAREPORTTRIG1_THRESHOLD_MASK 0xffff
+#define OAREPORTTRIG1_EDGE_LEVEL_TRIGER_SELECT_MASK 0xffff0000 /* 0=level */
+
+#define OAREPORTTRIG2 0x2744
+#define OAREPORTTRIG2_INVERT_A_0  (1<<0)
+#define OAREPORTTRIG2_INVERT_A_1  (1<<1)
+#define OAREPORTTRIG2_INVERT_A_2  (1<<2)
+#define OAREPORTTRIG2_INVERT_A_3  (1<<3)
+#define OAREPORTTRIG2_INVERT_A_4  (1<<4)
+#define OAREPORTTRIG2_INVERT_A_5  (1<<5)
+#define OAREPORTTRIG2_INVERT_A_6  (1<<6)
+#define OAREPORTTRIG2_INVERT_A_7  (1<<7)
+#define OAREPORTTRIG2_INVERT_A_8  (1<<8)
+#define OAREPORTTRIG2_INVERT_A_9  (1<<9)
+#define OAREPORTTRIG2_INVERT_A_10 (1<<10)
+#define OAREPORTTRIG2_INVERT_A_11 (1<<11)
+#define OAREPORTTRIG2_INVERT_A_12 (1<<12)
+#define OAREPORTTRIG2_INVERT_A_13 (1<<13)
+#define OAREPORTTRIG2_INVERT_A_14 (1<<14)
+#define OAREPORTTRIG2_INVERT_A_15 (1<<15)
+#define OAREPORTTRIG2_INVERT_B_0  (1<<16)
+#define OAREPORTTRIG2_INVERT_B_1  (1<<17)
+#define OAREPORTTRIG2_INVERT_B_2  (1<<18)
+#define OAREPORTTRIG2_INVERT_B_3  (1<<19)
+#define OAREPORTTRIG2_INVERT_C_0  (1<<20)
+#define OAREPORTTRIG2_INVERT_C_1  (1<<21)
+#define OAREPORTTRIG2_INVERT_D_0  (1<<22)
+#define OAREPORTTRIG2_THRESHOLD_ENABLE	    (1<<23)
+#define OAREPORTTRIG2_REPORT_TRIGGER_ENABLE (1<<31)
+
+#define OAREPORTTRIG3 0x2748
+#define OAREPORTTRIG3_NOA_SELECT_MASK	    0xf
+#define OAREPORTTRIG3_NOA_SELECT_8_SHIFT    0
+#define OAREPORTTRIG3_NOA_SELECT_9_SHIFT    4
+#define OAREPORTTRIG3_NOA_SELECT_10_SHIFT   8
+#define OAREPORTTRIG3_NOA_SELECT_11_SHIFT   12
+#define OAREPORTTRIG3_NOA_SELECT_12_SHIFT   16
+#define OAREPORTTRIG3_NOA_SELECT_13_SHIFT   20
+#define OAREPORTTRIG3_NOA_SELECT_14_SHIFT   24
+#define OAREPORTTRIG3_NOA_SELECT_15_SHIFT   28
+
+#define OAREPORTTRIG4 0x274c
+#define OAREPORTTRIG4_NOA_SELECT_MASK	    0xf
+#define OAREPORTTRIG4_NOA_SELECT_0_SHIFT    0
+#define OAREPORTTRIG4_NOA_SELECT_1_SHIFT    4
+#define OAREPORTTRIG4_NOA_SELECT_2_SHIFT    8
+#define OAREPORTTRIG4_NOA_SELECT_3_SHIFT    12
+#define OAREPORTTRIG4_NOA_SELECT_4_SHIFT    16
+#define OAREPORTTRIG4_NOA_SELECT_5_SHIFT    20
+#define OAREPORTTRIG4_NOA_SELECT_6_SHIFT    24
+#define OAREPORTTRIG4_NOA_SELECT_7_SHIFT    28
+
+#define OAREPORTTRIG5 0x2750
+#define OAREPORTTRIG5_THRESHOLD_MASK 0xffff
+#define OAREPORTTRIG5_EDGE_LEVEL_TRIGER_SELECT_MASK 0xffff0000 /* 0=level */
+
+#define OAREPORTTRIG6 0x2754
+#define OAREPORTTRIG6_INVERT_A_0  (1<<0)
+#define OAREPORTTRIG6_INVERT_A_1  (1<<1)
+#define OAREPORTTRIG6_INVERT_A_2  (1<<2)
+#define OAREPORTTRIG6_INVERT_A_3  (1<<3)
+#define OAREPORTTRIG6_INVERT_A_4  (1<<4)
+#define OAREPORTTRIG6_INVERT_A_5  (1<<5)
+#define OAREPORTTRIG6_INVERT_A_6  (1<<6)
+#define OAREPORTTRIG6_INVERT_A_7  (1<<7)
+#define OAREPORTTRIG6_INVERT_A_8  (1<<8)
+#define OAREPORTTRIG6_INVERT_A_9  (1<<9)
+#define OAREPORTTRIG6_INVERT_A_10 (1<<10)
+#define OAREPORTTRIG6_INVERT_A_11 (1<<11)
+#define OAREPORTTRIG6_INVERT_A_12 (1<<12)
+#define OAREPORTTRIG6_INVERT_A_13 (1<<13)
+#define OAREPORTTRIG6_INVERT_A_14 (1<<14)
+#define OAREPORTTRIG6_INVERT_A_15 (1<<15)
+#define OAREPORTTRIG6_INVERT_B_0  (1<<16)
+#define OAREPORTTRIG6_INVERT_B_1  (1<<17)
+#define OAREPORTTRIG6_INVERT_B_2  (1<<18)
+#define OAREPORTTRIG6_INVERT_B_3  (1<<19)
+#define OAREPORTTRIG6_INVERT_C_0  (1<<20)
+#define OAREPORTTRIG6_INVERT_C_1  (1<<21)
+#define OAREPORTTRIG6_INVERT_D_0  (1<<22)
+#define OAREPORTTRIG6_THRESHOLD_ENABLE	    (1<<23)
+#define OAREPORTTRIG6_REPORT_TRIGGER_ENABLE (1<<31)
+
+#define OAREPORTTRIG7 0x2758
+#define OAREPORTTRIG7_NOA_SELECT_MASK	    0xf
+#define OAREPORTTRIG7_NOA_SELECT_8_SHIFT    0
+#define OAREPORTTRIG7_NOA_SELECT_9_SHIFT    4
+#define OAREPORTTRIG7_NOA_SELECT_10_SHIFT   8
+#define OAREPORTTRIG7_NOA_SELECT_11_SHIFT   12
+#define OAREPORTTRIG7_NOA_SELECT_12_SHIFT   16
+#define OAREPORTTRIG7_NOA_SELECT_13_SHIFT   20
+#define OAREPORTTRIG7_NOA_SELECT_14_SHIFT   24
+#define OAREPORTTRIG7_NOA_SELECT_15_SHIFT   28
+
+#define OAREPORTTRIG8 0x275c
+#define OAREPORTTRIG8_NOA_SELECT_MASK	    0xf
+#define OAREPORTTRIG8_NOA_SELECT_0_SHIFT    0
+#define OAREPORTTRIG8_NOA_SELECT_1_SHIFT    4
+#define OAREPORTTRIG8_NOA_SELECT_2_SHIFT    8
+#define OAREPORTTRIG8_NOA_SELECT_3_SHIFT    12
+#define OAREPORTTRIG8_NOA_SELECT_4_SHIFT    16
+#define OAREPORTTRIG8_NOA_SELECT_5_SHIFT    20
+#define OAREPORTTRIG8_NOA_SELECT_6_SHIFT    24
+#define OAREPORTTRIG8_NOA_SELECT_7_SHIFT    28
+
+#define OASTARTTRIG1 0x2710
+#define OASTARTTRIG1_THRESHOLD_COUNT_MASK_MBZ 0xffff0000
+#define OASTARTTRIG1_THRESHOLD_MASK	      0xffff
+
+#define OASTARTTRIG2 0x2714
+#define OASTARTTRIG2_INVERT_A_0 (1<<0)
+#define OASTARTTRIG2_INVERT_A_1 (1<<1)
+#define OASTARTTRIG2_INVERT_A_2 (1<<2)
+#define OASTARTTRIG2_INVERT_A_3 (1<<3)
+#define OASTARTTRIG2_INVERT_A_4 (1<<4)
+#define OASTARTTRIG2_INVERT_A_5 (1<<5)
+#define OASTARTTRIG2_INVERT_A_6 (1<<6)
+#define OASTARTTRIG2_INVERT_A_7 (1<<7)
+#define OASTARTTRIG2_INVERT_A_8 (1<<8)
+#define OASTARTTRIG2_INVERT_A_9 (1<<9)
+#define OASTARTTRIG2_INVERT_A_10 (1<<10)
+#define OASTARTTRIG2_INVERT_A_11 (1<<11)
+#define OASTARTTRIG2_INVERT_A_12 (1<<12)
+#define OASTARTTRIG2_INVERT_A_13 (1<<13)
+#define OASTARTTRIG2_INVERT_A_14 (1<<14)
+#define OASTARTTRIG2_INVERT_A_15 (1<<15)
+#define OASTARTTRIG2_INVERT_B_0 (1<<16)
+#define OASTARTTRIG2_INVERT_B_1 (1<<17)
+#define OASTARTTRIG2_INVERT_B_2 (1<<18)
+#define OASTARTTRIG2_INVERT_B_3 (1<<19)
+#define OASTARTTRIG2_INVERT_C_0 (1<<20)
+#define OASTARTTRIG2_INVERT_C_1 (1<<21)
+#define OASTARTTRIG2_INVERT_D_0 (1<<22)
+#define OASTARTTRIG2_THRESHOLD_ENABLE	    (1<<23)
+#define OASTARTTRIG2_START_TRIG_FLAG_MBZ    (1<<24)
+#define OASTARTTRIG2_EVENT_SELECT_0  (1<<28)
+#define OASTARTTRIG2_EVENT_SELECT_1  (1<<29)
+#define OASTARTTRIG2_EVENT_SELECT_2  (1<<30)
+#define OASTARTTRIG2_EVENT_SELECT_3  (1<<31)
+
+#define OASTARTTRIG3 0x2718
+#define OASTARTTRIG3_NOA_SELECT_MASK	   0xf
+#define OASTARTTRIG3_NOA_SELECT_8_SHIFT    0
+#define OASTARTTRIG3_NOA_SELECT_9_SHIFT    4
+#define OASTARTTRIG3_NOA_SELECT_10_SHIFT   8
+#define OASTARTTRIG3_NOA_SELECT_11_SHIFT   12
+#define OASTARTTRIG3_NOA_SELECT_12_SHIFT   16
+#define OASTARTTRIG3_NOA_SELECT_13_SHIFT   20
+#define OASTARTTRIG3_NOA_SELECT_14_SHIFT   24
+#define OASTARTTRIG3_NOA_SELECT_15_SHIFT   28
+
+#define OASTARTTRIG4 0x271c
+#define OASTARTTRIG4_NOA_SELECT_MASK	    0xf
+#define OASTARTTRIG4_NOA_SELECT_0_SHIFT    0
+#define OASTARTTRIG4_NOA_SELECT_1_SHIFT    4
+#define OASTARTTRIG4_NOA_SELECT_2_SHIFT    8
+#define OASTARTTRIG4_NOA_SELECT_3_SHIFT    12
+#define OASTARTTRIG4_NOA_SELECT_4_SHIFT    16
+#define OASTARTTRIG4_NOA_SELECT_5_SHIFT    20
+#define OASTARTTRIG4_NOA_SELECT_6_SHIFT    24
+#define OASTARTTRIG4_NOA_SELECT_7_SHIFT    28
+
+#define OASTARTTRIG5 0x2720
+#define OASTARTTRIG5_THRESHOLD_COUNT_MASK_MBZ 0xffff0000
+#define OASTARTTRIG5_THRESHOLD_MASK	      0xffff
+
+#define OASTARTTRIG6 0x2724
+#define OASTARTTRIG6_INVERT_A_0 (1<<0)
+#define OASTARTTRIG6_INVERT_A_1 (1<<1)
+#define OASTARTTRIG6_INVERT_A_2 (1<<2)
+#define OASTARTTRIG6_INVERT_A_3 (1<<3)
+#define OASTARTTRIG6_INVERT_A_4 (1<<4)
+#define OASTARTTRIG6_INVERT_A_5 (1<<5)
+#define OASTARTTRIG6_INVERT_A_6 (1<<6)
+#define OASTARTTRIG6_INVERT_A_7 (1<<7)
+#define OASTARTTRIG6_INVERT_A_8 (1<<8)
+#define OASTARTTRIG6_INVERT_A_9 (1<<9)
+#define OASTARTTRIG6_INVERT_A_10 (1<<10)
+#define OASTARTTRIG6_INVERT_A_11 (1<<11)
+#define OASTARTTRIG6_INVERT_A_12 (1<<12)
+#define OASTARTTRIG6_INVERT_A_13 (1<<13)
+#define OASTARTTRIG6_INVERT_A_14 (1<<14)
+#define OASTARTTRIG6_INVERT_A_15 (1<<15)
+#define OASTARTTRIG6_INVERT_B_0 (1<<16)
+#define OASTARTTRIG6_INVERT_B_1 (1<<17)
+#define OASTARTTRIG6_INVERT_B_2 (1<<18)
+#define OASTARTTRIG6_INVERT_B_3 (1<<19)
+#define OASTARTTRIG6_INVERT_C_0 (1<<20)
+#define OASTARTTRIG6_INVERT_C_1 (1<<21)
+#define OASTARTTRIG6_INVERT_D_0 (1<<22)
+#define OASTARTTRIG6_THRESHOLD_ENABLE	    (1<<23)
+#define OASTARTTRIG6_START_TRIG_FLAG_MBZ    (1<<24)
+#define OASTARTTRIG6_EVENT_SELECT_4  (1<<28)
+#define OASTARTTRIG6_EVENT_SELECT_5  (1<<29)
+#define OASTARTTRIG6_EVENT_SELECT_6  (1<<30)
+#define OASTARTTRIG6_EVENT_SELECT_7  (1<<31)
+
+#define OASTARTTRIG7 0x2728
+#define OASTARTTRIG7_NOA_SELECT_MASK	   0xf
+#define OASTARTTRIG7_NOA_SELECT_8_SHIFT    0
+#define OASTARTTRIG7_NOA_SELECT_9_SHIFT    4
+#define OASTARTTRIG7_NOA_SELECT_10_SHIFT   8
+#define OASTARTTRIG7_NOA_SELECT_11_SHIFT   12
+#define OASTARTTRIG7_NOA_SELECT_12_SHIFT   16
+#define OASTARTTRIG7_NOA_SELECT_13_SHIFT   20
+#define OASTARTTRIG7_NOA_SELECT_14_SHIFT   24
+#define OASTARTTRIG7_NOA_SELECT_15_SHIFT   28
+
+#define OASTARTTRIG8 0x272c
+#define OASTARTTRIG8_NOA_SELECT_MASK	   0xf
+#define OASTARTTRIG8_NOA_SELECT_0_SHIFT    0
+#define OASTARTTRIG8_NOA_SELECT_1_SHIFT    4
+#define OASTARTTRIG8_NOA_SELECT_2_SHIFT    8
+#define OASTARTTRIG8_NOA_SELECT_3_SHIFT    12
+#define OASTARTTRIG8_NOA_SELECT_4_SHIFT    16
+#define OASTARTTRIG8_NOA_SELECT_5_SHIFT    20
+#define OASTARTTRIG8_NOA_SELECT_6_SHIFT    24
+#define OASTARTTRIG8_NOA_SELECT_7_SHIFT    28
+
+/* CECX_0 */
+#define OACEC_COMPARE_LESS_OR_EQUAL	6
+#define OACEC_COMPARE_NOT_EQUAL		5
+#define OACEC_COMPARE_LESS_THAN		4
+#define OACEC_COMPARE_GREATER_OR_EQUAL	3
+#define OACEC_COMPARE_EQUAL		2
+#define OACEC_COMPARE_GREATER_THAN	1
+#define OACEC_COMPARE_ANY_EQUAL		0
+
+#define OACEC_COMPARE_VALUE_MASK    0xffff
+#define OACEC_COMPARE_VALUE_SHIFT   3
+
+#define OACEC_SELECT_NOA	(0<<19)
+#define OACEC_SELECT_PREV	(1<<19)
+#define OACEC_SELECT_BOOLEAN	(2<<19)
+
+/* CECX_1 */
+#define OACEC_MASK_MASK		    0xffff
+#define OACEC_CONSIDERATIONS_MASK   0xffff
+#define OACEC_CONSIDERATIONS_SHIFT  16
+
+#define OACEC0_0 0x2770
+#define OACEC0_1 0x2774
+#define OACEC1_0 0x2778
+#define OACEC1_1 0x277c
+#define OACEC2_0 0x2780
+#define OACEC2_1 0x2784
+#define OACEC3_0 0x2788
+#define OACEC3_1 0x278c
+#define OACEC4_0 0x2790
+#define OACEC4_1 0x2794
+#define OACEC5_0 0x2798
+#define OACEC5_1 0x279c
+#define OACEC6_0 0x27a0
+#define OACEC6_1 0x27a4
+#define OACEC7_0 0x27a8
+#define OACEC7_1 0x27ac
+
+
 #define _GEN7_PIPEA_DE_LOAD_SL	0x70068
 #define _GEN7_PIPEB_DE_LOAD_SL	0x71068
 #define GEN7_PIPE_DE_LOAD_SL(pipe) _PIPE(pipe, \
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index f78f232..7aa1d33 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -71,6 +71,8 @@
 #define I915_OA_FORMAT_B4_C8_A16_HSW	6
 #define I915_OA_FORMAT_C4_B8_HSW	7
 
+#define I915_OA_METRICS_SET_3D		1
+
 #define I915_OA_ATTR_SIZE_VER0		32  /* sizeof first published struct */
 
 typedef struct _drm_i915_oa_attr {
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 09/11] drm/i915: Add dev.i915.oa_event_paranoid sysctl option
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
                   ` (7 preceding siblings ...)
  2015-05-07 14:15 ` [RFC PATCH 08/11] drm/i915: add OA config for 3D render counters Robert Bragg
@ 2015-05-07 14:15 ` Robert Bragg
  2015-05-07 14:15 ` [RFC PATCH 10/11] drm/i915: report OA buf overrun + report lost status Robert Bragg
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-07 14:15 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

Consistent with the kernel.perf_event_paranoid sysctl option that can
allow non-root users to access system wide cpu metrics, this can
optionally allow non-root users to access system wide OA counter metrics
from Gen graphics hardware.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_drv.h     |  2 ++
 drivers/gpu/drm/i915/i915_oa_perf.c | 40 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 1e65dc2..7bc7243 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1831,6 +1831,8 @@ struct drm_i915_private {
 		struct hrtimer timer;
 		struct pt_regs dummy_regs;
 
+		struct ctl_table_header *sysctl_header;
+
 		struct perf_event *exclusive_event;
 		struct intel_context *specific_ctx;
 		bool event_active;
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index d0e144c..c3e5059 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -11,6 +11,8 @@
 #define FREQUENCY 200
 #define PERIOD max_t(u64, 10000, NSEC_PER_SEC / FREQUENCY)
 
+static u32 i915_oa_event_paranoid = true;
+
 static int hsw_perf_format_sizes[] = {
 	64,  /* A13_HSW */
 	128, /* A29_HSW */
@@ -548,7 +550,8 @@ static int i915_oa_event_init(struct perf_event *event)
 		}
 	}
 
-	if (!dev_priv->oa_pmu.specific_ctx && !capable(CAP_SYS_ADMIN))
+	if (!dev_priv->oa_pmu.specific_ctx &&
+	    i915_oa_event_paranoid && !capable(CAP_SYS_ADMIN))
 		return -EACCES;
 
 	mutex_lock(&dev_priv->dev->struct_mutex);
@@ -795,6 +798,37 @@ void i915_oa_context_unpin_notify(struct drm_i915_private *dev_priv,
 	spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, flags);
 }
 
+static struct ctl_table oa_table[] = {
+	{
+	 .procname = "oa_event_paranoid",
+	 .data = &i915_oa_event_paranoid,
+	 .maxlen = sizeof(i915_oa_event_paranoid),
+	 .mode = 0644,
+	 .proc_handler = proc_dointvec,
+	 },
+	{}
+};
+
+static struct ctl_table i915_root[] = {
+	{
+	 .procname = "i915",
+	 .maxlen = 0,
+	 .mode = 0555,
+	 .child = oa_table,
+	 },
+	{}
+};
+
+static struct ctl_table dev_root[] = {
+	{
+	 .procname = "dev",
+	 .maxlen = 0,
+	 .mode = 0555,
+	 .child = i915_root,
+	 },
+	{}
+};
+
 void i915_oa_pmu_register(struct drm_device *dev)
 {
 	struct drm_i915_private *i915 = to_i915(dev);
@@ -802,6 +836,8 @@ void i915_oa_pmu_register(struct drm_device *dev)
 	if (!IS_HASWELL(dev))
 		return;
 
+	i915->oa_pmu.sysctl_header = register_sysctl_table(dev_root);
+
 	/* We need to be careful about forwarding cpu metrics to
 	 * userspace considering that PERF_PMU_CAP_IS_DEVICE bypasses
 	 * the events/core security check that stops an unprivileged
@@ -841,6 +877,8 @@ void i915_oa_pmu_unregister(struct drm_device *dev)
 	if (i915->oa_pmu.pmu.event_init == NULL)
 		return;
 
+	unregister_sysctl_table(i915->oa_pmu.sysctl_header);
+
 	perf_pmu_unregister(&i915->oa_pmu.pmu);
 	i915->oa_pmu.pmu.event_init = NULL;
 }
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 10/11] drm/i915: report OA buf overrun + report lost status
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
                   ` (8 preceding siblings ...)
  2015-05-07 14:15 ` [RFC PATCH 09/11] drm/i915: Add dev.i915.oa_event_paranoid sysctl option Robert Bragg
@ 2015-05-07 14:15 ` Robert Bragg
  2015-05-07 14:15 ` [RFC PATCH 11/11] WIP: drm/i915: constrain unit gating while using OA Robert Bragg
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-07 14:15 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

This adds two driver specific PERF_RECORD_DEVICE event types for
reporting OA buffer overrun and report lost status bits to userspace.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_oa_perf.c | 53 ++++++++++++++++++++++++-------------
 include/uapi/drm/i915_drm.h         | 27 +++++++++++++++++++
 2 files changed, 62 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index c3e5059..d0dad5d 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -159,6 +159,34 @@ static u32 forward_oa_snapshots(struct drm_i915_private *dev_priv,
 	return dev_priv->oa_pmu.oa_buffer.gtt_offset + head;
 }
 
+static void log_oa_status(struct drm_i915_private *dev_priv,
+			  enum drm_i915_oa_event_type status)
+{
+	struct {
+		struct perf_event_header header;
+		drm_i915_oa_event_header_t i915_oa_header;
+	} oa_event;
+	struct perf_output_handle handle;
+	struct perf_sample_data sample_data;
+	struct perf_event *event = dev_priv->oa_pmu.exclusive_event;
+	int ret;
+
+	oa_event.header.size = sizeof(oa_event);
+	oa_event.header.type = PERF_RECORD_DEVICE;
+	oa_event.i915_oa_header.type = status;
+	oa_event.i915_oa_header.__reserved_1 = 0;
+
+	perf_event_header__init_id(&oa_event.header, &sample_data, event);
+
+	ret = perf_output_begin(&handle, event, oa_event.header.size);
+	if (ret)
+		return;
+
+	perf_output_put(&handle, oa_event);
+	perf_event__output_id_sample(event, &handle, &sample_data);
+	perf_output_end(&handle);
+}
+
 static void flush_oa_snapshots(struct drm_i915_private *dev_priv,
 			       bool skip_if_flushing)
 {
@@ -189,25 +217,14 @@ static void flush_oa_snapshots(struct drm_i915_private *dev_priv,
 	head = oastatus2 & GEN7_OASTATUS2_HEAD_MASK;
 	tail = oastatus1 & GEN7_OASTATUS1_TAIL_MASK;
 
-	if (oastatus1 & (GEN7_OASTATUS1_OABUFFER_OVERFLOW |
-			 GEN7_OASTATUS1_REPORT_LOST)) {
+	if (unlikely(oastatus1 & (GEN7_OASTATUS1_OABUFFER_OVERFLOW |
+				  GEN7_OASTATUS1_REPORT_LOST))) {
 
-		/* XXX: How can we convey report-lost errors to userspace?  It
-		 * doesn't look like perf's _REPORT_LOST mechanism is
-		 * appropriate in this case; that's just for cases where we
-		 * run out of space for samples in the perf circular buffer.
-		 *
-		 * Maybe we can claim a special report-id and use that to
-		 * forward status flags?
-		 */
-		pr_debug("OA buffer read error: addr = %p, head = %u, offset = %u, tail = %u cnt o'flow = %d, buf o'flow = %d, rpt lost = %d\n",
-			 dev_priv->oa_pmu.oa_buffer.addr,
-			 head,
-			 head - dev_priv->oa_pmu.oa_buffer.gtt_offset,
-			 tail,
-			 oastatus1 & GEN7_OASTATUS1_COUNTER_OVERFLOW ? 1 : 0,
-			 oastatus1 & GEN7_OASTATUS1_OABUFFER_OVERFLOW ? 1 : 0,
-			 oastatus1 & GEN7_OASTATUS1_REPORT_LOST ? 1 : 0);
+		if (oastatus1 & GEN7_OASTATUS1_OABUFFER_OVERFLOW)
+			log_oa_status(dev_priv, I915_OA_RECORD_BUFFER_OVERFLOW);
+
+		if (oastatus1 & GEN7_OASTATUS1_REPORT_LOST)
+			log_oa_status(dev_priv, I915_OA_RECORD_REPORT_LOST);
 
 		I915_WRITE(GEN7_OASTATUS1, oastatus1 &
 			   ~(GEN7_OASTATUS1_OABUFFER_OVERFLOW |
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7aa1d33..2871922 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -89,6 +89,33 @@ typedef struct _drm_i915_oa_attr {
 	      __reserved_1 : 63;
 } drm_i915_oa_attr_t;
 
+/* Header for PERF_RECORD_DEVICE type events */
+typedef struct _drm_i915_oa_event_header {
+	__u32 type;
+	__u32 __reserved_1;
+} drm_i915_oa_event_header_t;
+
+enum drm_i915_oa_event_type {
+
+	/*
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	drm_i915_oa_event_header_t	i915_oa_header;
+	 * };
+	 */
+	I915_OA_RECORD_BUFFER_OVERFLOW		= 1,
+
+	/*
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	drm_i915_oa_event_header_t	i915_oa_header;
+	 * };
+	 */
+	I915_OA_RECORD_REPORT_LOST		= 2,
+
+	I915_OA_RECORD_MAX,			/* non-ABI */
+};
+
 /* Each region is a minimum of 16k, and there are at most 255 of them.
  */
 #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 11/11] WIP: drm/i915: constrain unit gating while using OA
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
                   ` (9 preceding siblings ...)
  2015-05-07 14:15 ` [RFC PATCH 10/11] drm/i915: report OA buf overrun + report lost status Robert Bragg
@ 2015-05-07 14:15 ` Robert Bragg
  2015-05-08 16:21 ` [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Peter Zijlstra
  2015-05-08 16:24 ` Peter Zijlstra
  12 siblings, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-07 14:15 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

We are still investigating the detailed requirements here, but there are
some constraints we need to apply on unit level clock gating for
reliable metrics (in particular for a reliable sampling period).

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_oa_perf.c | 70 +++++++++++++++++++++++++++++++------
 drivers/gpu/drm/i915/i915_reg.h     |  3 ++
 2 files changed, 63 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index d0dad5d..2a4121b 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -257,20 +257,46 @@ oa_buffer_destroy(struct drm_i915_private *i915)
 
 static void i915_oa_event_destroy(struct perf_event *event)
 {
-	struct drm_i915_private *i915 =
-		container_of(event->pmu, typeof(*i915), oa_pmu.pmu);
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
 
 	WARN_ON(event->parent);
 
-	oa_buffer_destroy(i915);
+	I915_WRITE(GEN6_UCGCTL1, (I915_READ(GEN6_UCGCTL1) &
+				  ~GEN6_RCZUNIT_CLOCK_GATE_DISABLE));
+	//I915_WRITE(GEN6_UCGCTL3, (I915_READ(GEN6_UCGCTL3) &
+	//			  ~GEN6_OACSUNIT_CLOCK_GATE_DISABLE));
+	I915_WRITE(GEN7_MISCCPCTL, (I915_READ(GEN7_MISCCPCTL) |
+				    GEN7_DOP_CLOCK_GATE_ENABLE));
+
+	I915_WRITE(GEN7_ROW_CHICKEN2,
+		   _MASKED_BIT_DISABLE(DOP_CLOCK_GATING_DISABLE));
+
+	//if (IS_HSW_GT2(dev_priv->dev)) {
+	if (1) {
+		I915_WRITE(HSW_ROW_CHICKEN2_GT2,
+			   _MASKED_BIT_DISABLE(DOP_CLOCK_GATING_DISABLE));
+	}
+
+	if (IS_HSW_GT3(dev_priv->dev)) {
+		I915_WRITE(HSW_ROW_CHICKEN2_GT3_0,
+			   _MASKED_BIT_DISABLE(DOP_CLOCK_GATING_DISABLE));
+		I915_WRITE(HSW_ROW_CHICKEN2_GT3_1,
+			   _MASKED_BIT_DISABLE(DOP_CLOCK_GATING_DISABLE));
+	}
 
-	i915->oa_pmu.specific_ctx = NULL;
+	I915_WRITE(GDT_CHICKEN_BITS, (I915_READ(GDT_CHICKEN_BITS) &
+				      ~GT_NOA_ENABLE));
+
+	oa_buffer_destroy(dev_priv);
+
+	dev_priv->oa_pmu.specific_ctx = NULL;
 
-	BUG_ON(i915->oa_pmu.exclusive_event != event);
-	i915->oa_pmu.exclusive_event = NULL;
+	BUG_ON(dev_priv->oa_pmu.exclusive_event != event);
+	dev_priv->oa_pmu.exclusive_event = NULL;
 
-	intel_uncore_forcewake_put(i915, FORCEWAKE_ALL);
-	intel_runtime_pm_put(i915);
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+	intel_runtime_pm_put(dev_priv);
 }
 
 static void *vmap_oa_buffer(struct drm_i915_gem_object *obj)
@@ -581,6 +607,32 @@ static int i915_oa_event_init(struct perf_event *event)
 	BUG_ON(dev_priv->oa_pmu.exclusive_event);
 	dev_priv->oa_pmu.exclusive_event = event;
 
+
+	I915_WRITE(GDT_CHICKEN_BITS, GT_NOA_ENABLE);
+
+	I915_WRITE(GEN6_UCGCTL1, (I915_READ(GEN6_UCGCTL1) |
+				  GEN6_RCZUNIT_CLOCK_GATE_DISABLE));
+	//I915_WRITE(GEN6_UCGCTL3, (I915_READ(GEN6_UCGCTL3) |
+	//			  GEN6_OACSUNIT_CLOCK_GATE_DISABLE));
+	I915_WRITE(GEN7_MISCCPCTL, (I915_READ(GEN7_MISCCPCTL) &
+				    ~GEN7_DOP_CLOCK_GATE_ENABLE));
+
+	I915_WRITE(GEN7_ROW_CHICKEN2,
+		   _MASKED_BIT_ENABLE(DOP_CLOCK_GATING_DISABLE));
+
+	//if (IS_HSW_GT2(dev_priv->dev)) {
+	if (1) {
+		I915_WRITE(HSW_ROW_CHICKEN2_GT2,
+			   _MASKED_BIT_ENABLE(DOP_CLOCK_GATING_DISABLE));
+	}
+
+	if (IS_HSW_GT3(dev_priv->dev)) {
+		I915_WRITE(HSW_ROW_CHICKEN2_GT3_0,
+			   _MASKED_BIT_ENABLE(DOP_CLOCK_GATING_DISABLE));
+		I915_WRITE(HSW_ROW_CHICKEN2_GT3_1,
+			   _MASKED_BIT_ENABLE(DOP_CLOCK_GATING_DISABLE));
+	}
+
 	event->destroy = i915_oa_event_destroy;
 
 	/* PRM - observability performance counters:
@@ -678,8 +730,6 @@ static void i915_oa_event_start(struct perf_event *event, int flags)
 	WARN_ONCE(I915_READ(GEN6_UCGCTL3) & GEN6_OACSUNIT_CLOCK_GATE_DISABLE,
 		  "disabled OA unit level clock gating will result in incorrect per-context OA counters");
 
-	I915_WRITE(GDT_CHICKEN_BITS, GT_NOA_ENABLE);
-
 	if (dev_priv->oa_pmu.metrics_set == I915_OA_METRICS_SET_3D) {
 		config_oa_regs(dev_priv, hsw_profile_3d_mux_config,
 			       ARRAY_SIZE(hsw_profile_3d_mux_config));
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index d94932a..518b34c 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -7036,6 +7036,9 @@ enum skl_disp_power_wells {
 
 #define GEN7_ROW_CHICKEN2		0xe4f4
 #define GEN7_ROW_CHICKEN2_GT2		0xf4f4
+#define HSW_ROW_CHICKEN2_GT2		0xe5f4
+#define HSW_ROW_CHICKEN2_GT3_0		0xe6f4
+#define HSW_ROW_CHICKEN2_GT3_1		0xe7f4
 #define   DOP_CLOCK_GATING_DISABLE	(1<<0)
 
 #define HSW_ROW_CHICKEN3		0xe49c
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 03/11] perf: Add PERF_EVENT_IOC_FLUSH ioctl
  2015-05-07 14:15 ` [RFC PATCH 03/11] perf: Add PERF_EVENT_IOC_FLUSH ioctl Robert Bragg
@ 2015-05-07 14:20   ` Chris Wilson
  2015-05-18 17:25     ` [RFC PATCH v2] " Robert Bragg
  0 siblings, 1 reply; 31+ messages in thread
From: Chris Wilson @ 2015-05-07 14:20 UTC (permalink / raw)
  To: Robert Bragg
  Cc: intel-gfx, Peter Zijlstra, David Airlie, linux-api, dri-devel,
	linux-kernel, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Daniel Vetter

On Thu, May 07, 2015 at 03:15:46PM +0100, Robert Bragg wrote:
> To allow for pmus that may have internal buffering (e.g. the hardware
> itself writes out data to its own circular buffer which is only
> periodically forwarded to userspace via perf) this ioctl enables
> userspace to explicitly ensure it has received all samples before a
> point in time.
> 
> Signed-off-by: Robert Bragg <robert@sixbynine.org>
> ---
>  include/linux/perf_event.h      | 7 +++++++
>  include/uapi/linux/perf_event.h | 1 +
>  kernel/events/core.c            | 5 +++++
>  3 files changed, 13 insertions(+)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 1af35b4..69a0cb9 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -266,6 +266,13 @@ struct pmu {
>  	 * flush branch stack on context-switches (needed in cpu-wide mode)
>  	 */
>  	void (*flush_branch_stack)	(void);
> +
> +	/*
> +	 * Flush buffered samples (E.g. for pmu hardware that writes samples to
> +	 * some intermediate buffer) userspace may need to explicitly ensure
> +	 * such samples have been forwarded to perf.
> +	 */
> +	void (*flush)			(struct perf_event *event); /*optional */

void return?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture
  2015-05-07 14:15 ` [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture Robert Bragg
@ 2015-05-07 14:36   ` Chris Wilson
  2015-05-07 14:58   ` Chris Wilson
  1 sibling, 0 replies; 31+ messages in thread
From: Chris Wilson @ 2015-05-07 14:36 UTC (permalink / raw)
  To: Robert Bragg
  Cc: intel-gfx, Peter Zijlstra, David Airlie, linux-api, dri-devel,
	linux-kernel, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Daniel Vetter

On Thu, May 07, 2015 at 03:15:50PM +0100, Robert Bragg wrote:
> +static int init_oa_buffer(struct perf_event *event)
> +{
> +	struct drm_i915_private *dev_priv =
> +		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
> +	struct drm_i915_gem_object *bo;
> +	int ret;
> +
> +	BUG_ON(!IS_HASWELL(dev_priv->dev));
> +	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
> +	BUG_ON(dev_priv->oa_pmu.oa_buffer.obj);
> +
> +	spin_lock_init(&dev_priv->oa_pmu.oa_buffer.flush_lock);
> +
> +	/* NB: We over allocate the OA buffer due to the way raw sample data
> +	 * gets copied from the gpu mapped circular buffer into the perf
> +	 * circular buffer so that only one copy is required.
> +	 *
> +	 * For each perf sample (raw->size + 4) needs to be 8 byte aligned,
> +	 * where the 4 corresponds to the 32bit raw->size member that's
> +	 * added to the sample header that userspace sees.
> +	 *
> +	 * Due to the + 4 for the size member: when we copy a report to the
> +	 * userspace facing perf buffer we always copy an additional 4 bytes
> +	 * from the subsequent report to make up for the miss alignment, but
> +	 * when a report is at the end of the gpu mapped buffer we need to
> +	 * read 4 bytes past the end of the buffer.
> +	 */
> +	bo = i915_gem_alloc_object(dev_priv->dev, OA_BUFFER_SIZE + PAGE_SIZE);
> +	if (bo == NULL) {
> +		DRM_ERROR("Failed to allocate OA buffer\n");
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +	dev_priv->oa_pmu.oa_buffer.obj = bo;
> +
> +	ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
> +	if (ret)
> +		goto err_unref;
> +
> +	/* PreHSW required 512K alignment, HSW requires 16M */
> +	ret = i915_gem_obj_ggtt_pin(bo, SZ_16M, 0);
> +	if (ret)
> +		goto err_unref;
> +
> +	dev_priv->oa_pmu.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo);
> +	dev_priv->oa_pmu.oa_buffer.addr = vmap_oa_buffer(bo);

You can look forward to both i915_gem_object_create_internal() and
i915_gem_object_pin_vmap()

> +
> +	/* Pre-DevBDW: OABUFFER must be set with counters off,
> +	 * before OASTATUS1, but after OASTATUS2 */
> +	I915_WRITE(GEN7_OASTATUS2, dev_priv->oa_pmu.oa_buffer.gtt_offset |
> +		   GEN7_OASTATUS2_GGTT); /* head */
> +	I915_WRITE(GEN7_OABUFFER, dev_priv->oa_pmu.oa_buffer.gtt_offset);
> +	I915_WRITE(GEN7_OASTATUS1, dev_priv->oa_pmu.oa_buffer.gtt_offset |
> +		   GEN7_OASTATUS1_OABUFFER_SIZE_16M); /* tail */
> +
> +	DRM_DEBUG_DRIVER("OA Buffer initialized, gtt offset = 0x%x, vaddr = %p",
> +			 dev_priv->oa_pmu.oa_buffer.gtt_offset,
> +			 dev_priv->oa_pmu.oa_buffer.addr);
> +
> +	return 0;
> +
> +err_unref:
> +	drm_gem_object_unreference_unlocked(&bo->base);

But what I really what to say was:
mutex deadlock^^^
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture
  2015-05-07 14:15 ` [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture Robert Bragg
  2015-05-07 14:36   ` [Intel-gfx] " Chris Wilson
@ 2015-05-07 14:58   ` Chris Wilson
  2015-05-18 16:36     ` Robert Bragg
  1 sibling, 1 reply; 31+ messages in thread
From: Chris Wilson @ 2015-05-07 14:58 UTC (permalink / raw)
  To: Robert Bragg
  Cc: intel-gfx, Peter Zijlstra, David Airlie, linux-api, dri-devel,
	linux-kernel, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Daniel Vetter

On Thu, May 07, 2015 at 03:15:50PM +0100, Robert Bragg wrote:
> +	/* We bypass the default perf core perf_paranoid_cpu() ||
> +	 * CAP_SYS_ADMIN check by using the PERF_PMU_CAP_IS_DEVICE
> +	 * flag and instead authenticate based on whether the current
> +	 * pid owns the specified context, or require CAP_SYS_ADMIN
> +	 * when collecting cross-context metrics.
> +	 */
> +	dev_priv->oa_pmu.specific_ctx = NULL;
> +	if (oa_attr.single_context) {
> +		u32 ctx_id = oa_attr.ctx_id;
> +		unsigned int drm_fd = oa_attr.drm_fd;
> +		struct fd fd = fdget(drm_fd);
> +
> +		if (fd.file) {

Specify a ctx and not providing the right fd should be its own error,
either EBADF or EINVAL.

> +			dev_priv->oa_pmu.specific_ctx =
> +				lookup_context(dev_priv, fd.file, ctx_id);
> +		}

Missing fdput

> +	}
> +
> +	if (!dev_priv->oa_pmu.specific_ctx && !capable(CAP_SYS_ADMIN))
> +		return -EACCES;
> +
> +	mutex_lock(&dev_priv->dev->struct_mutex);

i915_mutex_interruptible, probably best to couple into the GPU error
handling here as well especially as init_oa_buffer() will go onto touch
GPU internals.

> +	ret = init_oa_buffer(event);
> +	mutex_unlock(&dev_priv->dev->struct_mutex);
> +
> +	if (ret)
> +		return ret;
> +
> +	BUG_ON(dev_priv->oa_pmu.exclusive_event);
> +	dev_priv->oa_pmu.exclusive_event = event;
> +
> +	event->destroy = i915_oa_event_destroy;
> +
> +	/* PRM - observability performance counters:
> +	 *
> +	 *   OACONTROL, performance counter enable, note:
> +	 *
> +	 *   "When this bit is set, in order to have coherent counts,
> +	 *   RC6 power state and trunk clock gating must be disabled.
> +	 *   This can be achieved by programming MMIO registers as
> +	 *   0xA094=0 and 0xA090[31]=1"
> +	 *
> +	 *   In our case we are expected that taking pm + FORCEWAKE
> +	 *   references will effectively disable RC6 and trunk clock
> +	 *   gating.
> +	 */
> +	intel_runtime_pm_get(dev_priv);
> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);

That is a nuisance. Aside: Why isn't OA inside the powerctx? Is a subset
valid with forcewake? It does perturb the system greatly to disable rc6,
so I wonder if it could be made optional?

> +
> +	return 0;
> +}
> +
> +static void update_oacontrol(struct drm_i915_private *dev_priv)
> +{
> +	BUG_ON(!spin_is_locked(&dev_priv->oa_pmu.lock));
> +
> +	if (dev_priv->oa_pmu.event_active) {
> +		unsigned long ctx_id = 0;
> +		bool pinning_ok = false;
> +
> +		if (dev_priv->oa_pmu.specific_ctx) {
> +			struct intel_context *ctx =
> +				dev_priv->oa_pmu.specific_ctx;
> +			struct drm_i915_gem_object *obj =
> +				ctx->legacy_hw_ctx.rcs_state;

If only there was ctx->legacy_hw_ctx.rcs_vma...

> +
> +			if (i915_gem_obj_is_pinned(obj)) {
> +				ctx_id = i915_gem_obj_ggtt_offset(obj);
> +				pinning_ok = true;
> +			}
> +		}
> +
> +		if ((ctx_id == 0 || pinning_ok)) {
> +			bool periodic = dev_priv->oa_pmu.periodic;
> +			u32 period_exponent = dev_priv->oa_pmu.period_exponent;
> +			u32 report_format = dev_priv->oa_pmu.oa_buffer.format;
> +
> +			I915_WRITE(GEN7_OACONTROL,
> +				   (ctx_id & GEN7_OACONTROL_CTX_MASK) |
> +				   (period_exponent <<
> +				    GEN7_OACONTROL_TIMER_PERIOD_SHIFT) |
> +				   (periodic ?
> +				    GEN7_OACONTROL_TIMER_ENABLE : 0) |
> +				   (report_format <<
> +				    GEN7_OACONTROL_FORMAT_SHIFT) |
> +				   (ctx_id ?
> +				    GEN7_OACONTROL_PER_CTX_ENABLE : 0) |
> +				   GEN7_OACONTROL_ENABLE);

I notice you don't use any write barriers...
-Chris
-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
                   ` (10 preceding siblings ...)
  2015-05-07 14:15 ` [RFC PATCH 11/11] WIP: drm/i915: constrain unit gating while using OA Robert Bragg
@ 2015-05-08 16:21 ` Peter Zijlstra
  2015-05-18 17:29   ` Robert Bragg
  2015-05-08 16:24 ` Peter Zijlstra
  12 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2015-05-08 16:21 UTC (permalink / raw)
  To: Robert Bragg
  Cc: intel-gfx, Daniel Vetter, Jani Nikula, David Airlie,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api


So I've not yet went through the entire series; but I'm wondering if its
at all possible to re-use some of this work:

lkml.kernel.org/r/1428453299-19121-1-git-send-email-sukadev@linux.vnet.ibm.com

That's for a Power8 HV call that can basically return an array of
values; which on a superficial level sounds a bit like what this GPU
hardware does.

Let me read more of this..

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU
  2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
                   ` (11 preceding siblings ...)
  2015-05-08 16:21 ` [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Peter Zijlstra
@ 2015-05-08 16:24 ` Peter Zijlstra
  2015-05-15  1:07   ` Robert Bragg
  12 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2015-05-08 16:24 UTC (permalink / raw)
  To: Robert Bragg
  Cc: intel-gfx, Daniel Vetter, Jani Nikula, David Airlie,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:

> I've changed the uapi for configuring the i915_oa specific attributes
> when calling perf_event_open(2) whereby instead of cramming lots of
> bitfields into the perf_event_attr config members, I'm now
> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
> config member that's extensible and validated in the same way as the
> perf_event_attr struct. I've found this much nicer to work with while
> being neatly extensible too.

This worries me a bit.. is there more background for this?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU
  2015-05-08 16:24 ` Peter Zijlstra
@ 2015-05-15  1:07   ` Robert Bragg
  2015-05-19 14:53     ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Robert Bragg @ 2015-05-15  1:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: intel-gfx, Daniel Vetter, Jani Nikula, David Airlie,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

On Fri, May 8, 2015 at 5:24 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:
>
>> I've changed the uapi for configuring the i915_oa specific attributes
>> when calling perf_event_open(2) whereby instead of cramming lots of
>> bitfields into the perf_event_attr config members, I'm now
>> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
>> config member that's extensible and validated in the same way as the
>> perf_event_attr struct. I've found this much nicer to work with while
>> being neatly extensible too.
>
> This worries me a bit.. is there more background for this?

Would it maybe be helpful to see the before and after? I had kept this
uapi change in a separate patch for a while locally but in the end
decided to squash it before sending out my updated series.

Although I did find it a bit awkward with the bitfields, I was mainly
concerned about the extensibility of packing logically separate
attributes into the config members and had heard similar concerns from
a few others who had been experimenting with my patches too.

A few simple attributes I can think of a.t.m that we might want to add
in the future are:
- control of the OABUFFER size
- a way to ask the kernel to collect reports at the beginning and end
of batch buffers, in addition to periodic reports
- alternative ways to uniquely identify a context to support tools
profiling a single context not necessarily owned by the current
process

It could also be interesting to expose some counter configuration
through these attributes too. E.g. on Broadwell+ we have 14 'Flexible
EU' counters included in the OA unit reports, each with a 16bit
configuration.

In a more extreme case it might also be useful to allow userspace to
specify a complete counter config, which (depending on the
configuration) could be over 100 32bit values to select the counter
signals + configure the corresponding combining logic.

Since this pmu is in a device driver it also seemed reasonably
appropriate to de-couple it slightly from the core perf_event_attr
structure by allowing driver extensible attributes.

I wonder if it might be less worrisome if the i915_oa_copy_attr() code
were instead a re-usable utility perhaps maintained in events/core.c,
so if other pmu drivers were to follow suite there would be less risk
of a mistake being made here?

Regards,
- Robert

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture
  2015-05-07 14:58   ` Chris Wilson
@ 2015-05-18 16:36     ` Robert Bragg
  2015-05-18 17:17       ` [RFC PATCH v2] " Robert Bragg
  2015-05-18 17:21       ` [RFC PATCH] squash: be more careful stopping oacontrol updates Robert Bragg
  0 siblings, 2 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-18 16:36 UTC (permalink / raw)
  To: dri-devel, David Airlie, Daniel Vetter, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Chris Wilson,
	Arnaldo Carvalho de Melo, intel-gfx, linux-api, linux-kernel

On 7 May 2015 15:58, "Chris Wilson" <chris@chris-wilson.co.uk> wrote:
>
> On Thu, May 07, 2015 at 03:15:50PM +0100, Robert Bragg wrote:
> > +     /* We bypass the default perf core perf_paranoid_cpu() ||
> > +      * CAP_SYS_ADMIN check by using the PERF_PMU_CAP_IS_DEVICE
> > +      * flag and instead authenticate based on whether the current
> > +      * pid owns the specified context, or require CAP_SYS_ADMIN
> > +      * when collecting cross-context metrics.
> > +      */
> > +     dev_priv->oa_pmu.specific_ctx = NULL;
> > +     if (oa_attr.single_context) {
> > +             u32 ctx_id = oa_attr.ctx_id;
> > +             unsigned int drm_fd = oa_attr.drm_fd;
> > +             struct fd fd = fdget(drm_fd);
> > +
> > +             if (fd.file) {
>
> Specify a ctx and not providing the right fd should be its own error,
> either EBADF or EINVAL.

Right, I went for both in the end; EBADF if fdget fails and EINVAL if
the fd is ok but we fail to lookup a context with it.

>
> > +                     dev_priv->oa_pmu.specific_ctx =
> > +                             lookup_context(dev_priv, fd.file, ctx_id);
> > +             }
>
> Missing fdput

Ah yes; fixed.

>
> > +     }
> > +
> > +     if (!dev_priv->oa_pmu.specific_ctx && !capable(CAP_SYS_ADMIN))
> > +             return -EACCES;
> > +
> > +     mutex_lock(&dev_priv->dev->struct_mutex);
>
> i915_mutex_interruptible, probably best to couple into the GPU error
> handling here as well especially as init_oa_buffer() will go onto touch
> GPU internals.

Ok, using i915_mutex_interruptible makes sense, I've also moved the
locking into init_oa_buffer.

About the GPU error handling, do you have any thoughts on what could
be most helpful here? I'm thinking a.t.m of extending
i915_capture_reg_state() in i915_gpu_error.c to capture the OACONTROL
+ OASTATUS state and perhaps all the UCGCTL unit clock gating state
too.

>
> > +     ret = init_oa_buffer(event);
> > +     mutex_unlock(&dev_priv->dev->struct_mutex);
> > +
> > +     if (ret)
> > +             return ret;
> > +
> > +     BUG_ON(dev_priv->oa_pmu.exclusive_event);
> > +     dev_priv->oa_pmu.exclusive_event = event;
> > +
> > +     event->destroy = i915_oa_event_destroy;
> > +
> > +     /* PRM - observability performance counters:
> > +      *
> > +      *   OACONTROL, performance counter enable, note:
> > +      *
> > +      *   "When this bit is set, in order to have coherent counts,
> > +      *   RC6 power state and trunk clock gating must be disabled.
> > +      *   This can be achieved by programming MMIO registers as
> > +      *   0xA094=0 and 0xA090[31]=1"
> > +      *
> > +      *   In our case we are expected that taking pm + FORCEWAKE
> > +      *   references will effectively disable RC6 and trunk clock
> > +      *   gating.
> > +      */
> > +     intel_runtime_pm_get(dev_priv);
> > +     intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>
> That is a nuisance. Aside: Why isn't OA inside the powerctx? Is a subset
> valid with forcewake? It does perturb the system greatly to disable rc6,
> so I wonder if it could be made optional?

Yes, it's a shame.

I probably only really know enough about the OA unit design to be
dangerous and won't try and comment in detail here, but I think
there's more to it than not saving state in a power context. As I
understand it, there were a number of design changes made to enable
OA+RC6 support for BDW+, including having the OA unit automatically
write out reports to the OA buffer when entering RC6.

I think just FORCEWAKE_RENDER would work here, but only say that
because it looks like HSW only has the render forcewake domain from
what I could tell.

I think I need to update the comment above these lines as I don't
think these will affect crclk gating; these just handle disabling RC6.

The WIP patch I sent out basically represents me trying to get to the
bottom of the clock gating constraints we have.

At this point I think I need to disable CS unit gating via UCGCTL1, as
well as DOP gating for the render trunk clock via MISCCPCTL but I'm
not entirely confident about that just yet. At least empirically I see
these fixing some issues in rudimentary testing.

>
> > +
> > +     return 0;
> > +}
> > +
> > +static void update_oacontrol(struct drm_i915_private *dev_priv)
> > +{
> > +     BUG_ON(!spin_is_locked(&dev_priv->oa_pmu.lock));
> > +
> > +     if (dev_priv->oa_pmu.event_active) {
> > +             unsigned long ctx_id = 0;
> > +             bool pinning_ok = false;
> > +
> > +             if (dev_priv->oa_pmu.specific_ctx) {
> > +                     struct intel_context *ctx =
> > +                             dev_priv->oa_pmu.specific_ctx;
> > +                     struct drm_i915_gem_object *obj =
> > +                             ctx->legacy_hw_ctx.rcs_state;
>
> If only there was ctx->legacy_hw_ctx.rcs_vma...

ok, not sure if this is a prod to add that, a heads up that this is
coming or seething because some prior attempt to add this was nack'd.


>
> > +
> > +                     if (i915_gem_obj_is_pinned(obj)) {
> > +                             ctx_id = i915_gem_obj_ggtt_offset(obj);
> > +                             pinning_ok = true;
> > +                     }
> > +             }
> > +
> > +             if ((ctx_id == 0 || pinning_ok)) {
> > +                     bool periodic = dev_priv->oa_pmu.periodic;
> > +                     u32 period_exponent = dev_priv->oa_pmu.period_exponent;
> > +                     u32 report_format = dev_priv->oa_pmu.oa_buffer.format;
> > +
> > +                     I915_WRITE(GEN7_OACONTROL,
> > +                                (ctx_id & GEN7_OACONTROL_CTX_MASK) |
> > +                                (period_exponent <<
> > +                                 GEN7_OACONTROL_TIMER_PERIOD_SHIFT) |
> > +                                (periodic ?
> > +                                 GEN7_OACONTROL_TIMER_ENABLE : 0) |
> > +                                (report_format <<
> > +                                 GEN7_OACONTROL_FORMAT_SHIFT) |
> > +                                (ctx_id ?
> > +                                 GEN7_OACONTROL_PER_CTX_ENABLE : 0) |
> > +                                GEN7_OACONTROL_ENABLE);
>
> I notice you don't use any write barriers...

ok, so I still haven't put write barriers within update_oacontrol()
itself, but I've now added mmiowb()s just before unlocking which is
done outside of the update_oacontrol(). I think a barrier just within
update_oacontrol() could be ok a.t.m while the pinning hooks currently
just use update_oacontol(), but in case we might introduce more
overlapping mmio configuration within these critical sections, waiting
until the unlock might be preferable. On the other hand, a.t.m the
pinning callbacks now have redundant wb()s while there is no specific
context filtering - not sure if that should be a concern.

Looking at this, I also didn't feel happy with the way I reset
oa_pmu->specific_context when destroying an event, considering that
->specific_context being set is what determines whether the pinning
callbacks may call update_oacontrol() asynchronously with respect to
the pmu methods. Although we know oacontrol will be disabled by the
time we come to destroy an event, it didn't seem great that that we
could be continuing to run update_oacontrol() up to the point where we
are resetting all the clock gating, power management and NOA enable
state. I'll attach a separate patch to see if you agree this is worth
changing.

Thanks for the comments.

- Robert

> -Chris
> --
> Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH v2] drm/i915: Expose PMU for Observation Architecture
  2015-05-18 16:36     ` Robert Bragg
@ 2015-05-18 17:17       ` Robert Bragg
  2015-05-18 17:21       ` [RFC PATCH] squash: be more careful stopping oacontrol updates Robert Bragg
  1 sibling, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-18 17:17 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api, Chris Wilson

Gen graphics hardware can be set up to periodically write snapshots of
performance counters into a circular buffer and this patch exposes that
capability to userspace via the perf interface.

To start with this only enables the A (aggregating) counters with the
simplest configuration requirements.

Only Haswell is supported currently.

v2:
- fix deadlock in init_oa_buffer error path
- EBADF for bad drm fd, EINVAL for failure to lookup ctx
- mmio write barriers, after OA reconfigure, before unlocks
- use i915_mutex_lock_interruptible within event init

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/Makefile           |   1 +
 drivers/gpu/drm/i915/i915_dma.c         |   6 +
 drivers/gpu/drm/i915/i915_drv.h         |  53 +++
 drivers/gpu/drm/i915/i915_gem_context.c |  45 +-
 drivers/gpu/drm/i915/i915_oa_perf.c     | 762 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h         |  68 +++
 include/uapi/drm/i915_drm.h             |  29 ++
 7 files changed, 954 insertions(+), 10 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_oa_perf.c

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index b7ddf48..b5ebfbe 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -15,6 +15,7 @@ i915-y := i915_drv.o \
 
 i915-$(CONFIG_COMPAT)   += i915_ioc32.o
 i915-$(CONFIG_DEBUG_FS) += i915_debugfs.o
+i915-$(CONFIG_PERF_EVENTS) += i915_oa_perf.o
 
 # GEM code
 i915-y += i915_cmd_parser.o \
diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index a238889..c299e18 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -818,6 +818,11 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
 	mutex_init(&dev_priv->modeset_restore_lock);
 	mutex_init(&dev_priv->csr_lock);
 
+	/* Must at least be registered before trying to pin any context
+	 * otherwise i915_oa_context_pin_notify() will lock an un-initialized
+	 * spinlock, upsetting lockdep checks */
+	i915_oa_pmu_register(dev);
+
 	intel_pm_setup(dev);
 
 	intel_display_crc_init(dev);
@@ -1067,6 +1072,7 @@ int i915_driver_unload(struct drm_device *dev)
 		return ret;
 	}
 
+	i915_oa_pmu_unregister(dev);
 	intel_power_domains_fini(dev_priv);
 
 	intel_gpu_ips_teardown();
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 6a66d6b..dd475ca 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -49,6 +49,7 @@
 #include <linux/hashtable.h>
 #include <linux/intel-iommu.h>
 #include <linux/kref.h>
+#include <linux/perf_event.h>
 #include <linux/pm_qos.h>
 
 /* General customization:
@@ -1839,6 +1840,35 @@ struct drm_i915_private {
 	 */
 	struct workqueue_struct *dp_wq;
 
+#ifdef CONFIG_PERF_EVENTS
+	struct {
+		struct pmu pmu;
+		spinlock_t lock;
+		struct hrtimer timer;
+		struct pt_regs dummy_regs;
+
+		struct perf_event *exclusive_event;
+		struct intel_context *specific_ctx;
+		bool event_active;
+
+		bool periodic;
+		u32 period_exponent;
+
+		u32 metrics_set;
+
+		struct {
+			struct drm_i915_gem_object *obj;
+			u32 gtt_offset;
+			u8 *addr;
+			u32 head;
+			u32 tail;
+			int format;
+			int format_size;
+			spinlock_t flush_lock;
+		} oa_buffer;
+	} oa_pmu;
+#endif
+
 	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
 	struct {
 		int (*execbuf_submit)(struct drm_device *dev, struct drm_file *file,
@@ -3012,6 +3042,20 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file_priv);
 
+#ifdef CONFIG_PERF_EVENTS
+void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
+				struct intel_context *context);
+void i915_oa_context_unpin_notify(struct drm_i915_private *dev_priv,
+				  struct intel_context *context);
+#else
+static inline void
+i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
+			   struct intel_context *context) {}
+static inline void
+i915_oa_context_unpin_notify(struct drm_i915_private *dev_priv,
+			     struct intel_context *context) {}
+#endif
+
 /* i915_gem_evict.c */
 int __must_check i915_gem_evict_something(struct drm_device *dev,
 					  struct i915_address_space *vm,
@@ -3121,6 +3165,15 @@ int i915_parse_cmds(struct intel_engine_cs *ring,
 		    u32 batch_len,
 		    bool is_master);
 
+/* i915_oa_perf.c */
+#ifdef CONFIG_PERF_EVENTS
+extern void i915_oa_pmu_register(struct drm_device *dev);
+extern void i915_oa_pmu_unregister(struct drm_device *dev);
+#else
+static inline void i915_oa_pmu_register(struct drm_device *dev) {}
+static inline void i915_oa_pmu_unregister(struct drm_device *dev) {}
+#endif
+
 /* i915_suspend.c */
 extern int i915_save_state(struct drm_device *dev);
 extern int i915_restore_state(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 5a47eb5..3e9a7f5 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -133,6 +133,33 @@ static int get_context_size(struct drm_device *dev)
 	return ret;
 }
 
+static int i915_gem_context_pin_state(struct drm_device *dev,
+				      struct intel_context *ctx)
+{
+	int ret;
+
+	BUG_ON(!mutex_is_locked(&dev->struct_mutex));
+
+	ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state,
+				    get_context_alignment(dev), 0);
+	if (ret)
+		return ret;
+
+	i915_oa_context_pin_notify(dev->dev_private, ctx);
+
+	return 0;
+}
+
+static void i915_gem_context_unpin_state(struct drm_device *dev,
+					 struct intel_context *ctx)
+{
+	/* Ensure that we stop the OA unit referencing the context *before*
+	 * actually unpinning the ctx */
+	i915_oa_context_unpin_notify(dev->dev_private, ctx);
+
+	i915_gem_object_ggtt_unpin(ctx->legacy_hw_ctx.rcs_state);
+}
+
 void i915_gem_context_free(struct kref *ctx_ref)
 {
 	struct intel_context *ctx = container_of(ctx_ref,
@@ -260,8 +287,7 @@ i915_gem_create_context(struct drm_device *dev,
 		 * be available. To avoid this we always pin the default
 		 * context.
 		 */
-		ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state,
-					    get_context_alignment(dev), 0);
+		ret = i915_gem_context_pin_state(dev, ctx);
 		if (ret) {
 			DRM_DEBUG_DRIVER("Couldn't pin %d\n", ret);
 			goto err_destroy;
@@ -287,7 +313,7 @@ i915_gem_create_context(struct drm_device *dev,
 
 err_unpin:
 	if (is_global_default_ctx && ctx->legacy_hw_ctx.rcs_state)
-		i915_gem_object_ggtt_unpin(ctx->legacy_hw_ctx.rcs_state);
+		i915_gem_context_unpin_state(dev, ctx);
 err_destroy:
 	i915_gem_context_unreference(ctx);
 	return ERR_PTR(ret);
@@ -314,7 +340,7 @@ void i915_gem_context_reset(struct drm_device *dev)
 
 		if (lctx) {
 			if (lctx->legacy_hw_ctx.rcs_state && i == RCS)
-				i915_gem_object_ggtt_unpin(lctx->legacy_hw_ctx.rcs_state);
+				i915_gem_context_unpin_state(dev, lctx);
 
 			i915_gem_context_unreference(lctx);
 			ring->last_context = NULL;
@@ -388,12 +414,12 @@ void i915_gem_context_fini(struct drm_device *dev)
 		if (dev_priv->ring[RCS].last_context == dctx) {
 			/* Fake switch to NULL context */
 			WARN_ON(dctx->legacy_hw_ctx.rcs_state->active);
-			i915_gem_object_ggtt_unpin(dctx->legacy_hw_ctx.rcs_state);
+			i915_gem_context_unpin_state(dev, dctx);
 			i915_gem_context_unreference(dctx);
 			dev_priv->ring[RCS].last_context = NULL;
 		}
 
-		i915_gem_object_ggtt_unpin(dctx->legacy_hw_ctx.rcs_state);
+		i915_gem_context_unpin_state(dev, dctx);
 	}
 
 	for (i = 0; i < I915_NUM_RINGS; i++) {
@@ -642,8 +668,7 @@ static int do_switch(struct intel_engine_cs *ring,
 
 	/* Trying to pin first makes error handling easier. */
 	if (ring == &dev_priv->ring[RCS]) {
-		ret = i915_gem_obj_ggtt_pin(to->legacy_hw_ctx.rcs_state,
-					    get_context_alignment(ring->dev), 0);
+		ret = i915_gem_context_pin_state(ring->dev, to);
 		if (ret)
 			return ret;
 	}
@@ -757,7 +782,7 @@ static int do_switch(struct intel_engine_cs *ring,
 			from->legacy_hw_ctx.rcs_state->last_read_req) != ring);
 
 		/* obj is kept alive until the next request by its active ref */
-		i915_gem_object_ggtt_unpin(from->legacy_hw_ctx.rcs_state);
+		i915_gem_context_unpin_state(ring->dev, from);
 		i915_gem_context_unreference(from);
 	}
 
@@ -780,7 +805,7 @@ done:
 
 unpin_out:
 	if (ring->id == RCS)
-		i915_gem_object_ggtt_unpin(to->legacy_hw_ctx.rcs_state);
+		i915_gem_context_unpin_state(ring->dev, to);
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
new file mode 100644
index 0000000..bf1c1d6
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -0,0 +1,762 @@
+#include <linux/perf_event.h>
+#include <linux/sizes.h>
+
+#include "i915_drv.h"
+#include "intel_ringbuffer.h"
+
+/* Must be a power of two */
+#define OA_BUFFER_SIZE	     SZ_16M
+#define OA_TAKEN(tail, head) ((tail - head) & (OA_BUFFER_SIZE - 1))
+
+#define FREQUENCY 200
+#define PERIOD max_t(u64, 10000, NSEC_PER_SEC / FREQUENCY)
+
+static int hsw_perf_format_sizes[] = {
+	64,  /* A13_HSW */
+	128, /* A29_HSW */
+	128, /* A13_B8_C8_HSW */
+	-1,  /* Disallowed since 192 bytes doesn't factor into buffer size
+		(A29_B8_C8_HSW) */
+	64,  /* B4_C8_HSW */
+	256, /* A45_B8_C8_HSW */
+	128, /* B4_C8_A16_HSW */
+	64   /* C4_B8_HSW */
+};
+
+static void forward_one_oa_snapshot_to_event(struct drm_i915_private *dev_priv,
+					     u8 *snapshot,
+					     struct perf_event *event)
+{
+	struct perf_sample_data data;
+	int snapshot_size = dev_priv->oa_pmu.oa_buffer.format_size;
+	struct perf_raw_record raw;
+
+	WARN_ON(snapshot_size == 0);
+
+	perf_sample_data_init(&data, 0, event->hw.last_period);
+
+	/* Note: the combined u32 raw->size member + raw data itself must be 8
+	 * byte aligned. (See note in init_oa_buffer for more details) */
+	raw.size = snapshot_size + 4;
+	raw.data = snapshot;
+
+	data.raw = &raw;
+
+	perf_event_overflow(event, &data, &dev_priv->oa_pmu.dummy_regs);
+}
+
+static u32 forward_oa_snapshots(struct drm_i915_private *dev_priv,
+				u32 head,
+				u32 tail)
+{
+	struct perf_event *exclusive_event = dev_priv->oa_pmu.exclusive_event;
+	int snapshot_size = dev_priv->oa_pmu.oa_buffer.format_size;
+	u8 *oa_buf_base = dev_priv->oa_pmu.oa_buffer.addr;
+	u32 mask = (OA_BUFFER_SIZE - 1);
+	u8 *snapshot;
+	u32 taken;
+
+	head -= dev_priv->oa_pmu.oa_buffer.gtt_offset;
+	tail -= dev_priv->oa_pmu.oa_buffer.gtt_offset;
+
+	/* Note: the gpu doesn't wrap the tail according to the OA buffer size
+	 * so when we need to make sure our head/tail values are in-bounds we
+	 * use the above mask.
+	 */
+
+	while ((taken = OA_TAKEN(tail, head))) {
+		/* The tail increases in 64 byte increments, not in
+		 * format_size steps. */
+		if (taken < snapshot_size)
+			break;
+
+		snapshot = oa_buf_base + (head & mask);
+		head += snapshot_size;
+
+		/* We currently only allow exclusive access to the counters
+		 * so only have one event to forward too... */
+		if (dev_priv->oa_pmu.event_active)
+			forward_one_oa_snapshot_to_event(dev_priv, snapshot,
+							 exclusive_event);
+	}
+
+	return dev_priv->oa_pmu.oa_buffer.gtt_offset + head;
+}
+
+static void flush_oa_snapshots(struct drm_i915_private *dev_priv,
+			       bool skip_if_flushing)
+{
+	unsigned long flags;
+	u32 oastatus2;
+	u32 oastatus1;
+	u32 head;
+	u32 tail;
+
+	/* Can either flush via hrtimer callback or pmu methods/fops */
+	if (skip_if_flushing) {
+
+		/* If the hrtimer triggers at the same time that we are
+		 * responding to a userspace initiated flush then we can
+		 * just bail out...
+		 */
+		if (!spin_trylock_irqsave(&dev_priv->oa_pmu.oa_buffer.flush_lock,
+					  flags))
+			return;
+	} else
+		spin_lock_irqsave(&dev_priv->oa_pmu.oa_buffer.flush_lock, flags);
+
+	WARN_ON(!dev_priv->oa_pmu.oa_buffer.addr);
+
+	oastatus2 = I915_READ(GEN7_OASTATUS2);
+	oastatus1 = I915_READ(GEN7_OASTATUS1);
+
+	head = oastatus2 & GEN7_OASTATUS2_HEAD_MASK;
+	tail = oastatus1 & GEN7_OASTATUS1_TAIL_MASK;
+
+	if (oastatus1 & (GEN7_OASTATUS1_OABUFFER_OVERFLOW |
+			 GEN7_OASTATUS1_REPORT_LOST)) {
+
+		/* XXX: How can we convey report-lost errors to userspace?  It
+		 * doesn't look like perf's _REPORT_LOST mechanism is
+		 * appropriate in this case; that's just for cases where we
+		 * run out of space for samples in the perf circular buffer.
+		 *
+		 * Maybe we can claim a special report-id and use that to
+		 * forward status flags?
+		 */
+		pr_debug("OA buffer read error: addr = %p, head = %u, offset = %u, tail = %u cnt o'flow = %d, buf o'flow = %d, rpt lost = %d\n",
+			 dev_priv->oa_pmu.oa_buffer.addr,
+			 head,
+			 head - dev_priv->oa_pmu.oa_buffer.gtt_offset,
+			 tail,
+			 oastatus1 & GEN7_OASTATUS1_COUNTER_OVERFLOW ? 1 : 0,
+			 oastatus1 & GEN7_OASTATUS1_OABUFFER_OVERFLOW ? 1 : 0,
+			 oastatus1 & GEN7_OASTATUS1_REPORT_LOST ? 1 : 0);
+
+		I915_WRITE(GEN7_OASTATUS1, oastatus1 &
+			   ~(GEN7_OASTATUS1_OABUFFER_OVERFLOW |
+			     GEN7_OASTATUS1_REPORT_LOST));
+	}
+
+	head = forward_oa_snapshots(dev_priv, head, tail);
+
+	I915_WRITE(GEN7_OASTATUS2, (head & GEN7_OASTATUS2_HEAD_MASK) |
+				    GEN7_OASTATUS2_GGTT);
+
+	spin_unlock_irqrestore(&dev_priv->oa_pmu.oa_buffer.flush_lock, flags);
+}
+
+static void
+oa_buffer_destroy(struct drm_i915_private *i915)
+{
+	mutex_lock(&i915->dev->struct_mutex);
+
+	vunmap(i915->oa_pmu.oa_buffer.addr);
+	i915_gem_object_ggtt_unpin(i915->oa_pmu.oa_buffer.obj);
+	drm_gem_object_unreference(&i915->oa_pmu.oa_buffer.obj->base);
+
+	i915->oa_pmu.oa_buffer.obj = NULL;
+	i915->oa_pmu.oa_buffer.gtt_offset = 0;
+	i915->oa_pmu.oa_buffer.addr = NULL;
+
+	mutex_unlock(&i915->dev->struct_mutex);
+}
+
+static void i915_oa_event_destroy(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), oa_pmu.pmu);
+
+	WARN_ON(event->parent);
+
+	oa_buffer_destroy(i915);
+
+	i915->oa_pmu.specific_ctx = NULL;
+
+	BUG_ON(i915->oa_pmu.exclusive_event != event);
+	i915->oa_pmu.exclusive_event = NULL;
+
+	intel_uncore_forcewake_put(i915, FORCEWAKE_ALL);
+	intel_runtime_pm_put(i915);
+}
+
+static void *vmap_oa_buffer(struct drm_i915_gem_object *obj)
+{
+	int i;
+	void *addr = NULL;
+	struct sg_page_iter sg_iter;
+	struct page **pages;
+
+	pages = drm_malloc_ab(obj->base.size >> PAGE_SHIFT, sizeof(*pages));
+	if (pages == NULL) {
+		DRM_DEBUG_DRIVER("Failed to get space for pages\n");
+		goto finish;
+	}
+
+	i = 0;
+	for_each_sg_page(obj->pages->sgl, &sg_iter, obj->pages->nents, 0) {
+		pages[i] = sg_page_iter_page(&sg_iter);
+		i++;
+	}
+
+	addr = vmap(pages, i, 0, PAGE_KERNEL);
+	if (addr == NULL) {
+		DRM_DEBUG_DRIVER("Failed to vmap pages\n");
+		goto finish;
+	}
+
+finish:
+	if (pages)
+		drm_free_large(pages);
+	return addr;
+}
+
+static int init_oa_buffer(struct perf_event *event)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	struct drm_i915_gem_object *bo;
+	int ret;
+
+	BUG_ON(!IS_HASWELL(dev_priv->dev));
+	BUG_ON(mutex_is_locked(&dev_priv->dev->struct_mutex));
+	BUG_ON(dev_priv->oa_pmu.oa_buffer.obj);
+
+	ret = i915_mutex_lock_interruptible(dev_priv->dev);
+	if (ret)
+		return ret;
+
+	spin_lock_init(&dev_priv->oa_pmu.oa_buffer.flush_lock);
+
+	/* NB: We over allocate the OA buffer due to the way raw sample data
+	 * gets copied from the gpu mapped circular buffer into the perf
+	 * circular buffer so that only one copy is required.
+	 *
+	 * For each perf sample (raw->size + 4) needs to be 8 byte aligned,
+	 * where the 4 corresponds to the 32bit raw->size member that's
+	 * added to the sample header that userspace sees.
+	 *
+	 * Due to the + 4 for the size member: when we copy a report to the
+	 * userspace facing perf buffer we always copy an additional 4 bytes
+	 * from the subsequent report to make up for the miss alignment, but
+	 * when a report is at the end of the gpu mapped buffer we need to
+	 * read 4 bytes past the end of the buffer.
+	 */
+	bo = i915_gem_alloc_object(dev_priv->dev, OA_BUFFER_SIZE + PAGE_SIZE);
+	if (bo == NULL) {
+		DRM_ERROR("Failed to allocate OA buffer\n");
+		ret = -ENOMEM;
+		goto unlock;
+	}
+	dev_priv->oa_pmu.oa_buffer.obj = bo;
+
+	ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
+	if (ret)
+		goto err_unref;
+
+	/* PreHSW required 512K alignment, HSW requires 16M */
+	ret = i915_gem_obj_ggtt_pin(bo, SZ_16M, 0);
+	if (ret)
+		goto err_unref;
+
+	dev_priv->oa_pmu.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo);
+	dev_priv->oa_pmu.oa_buffer.addr = vmap_oa_buffer(bo);
+
+	/* Pre-DevBDW: OABUFFER must be set with counters off,
+	 * before OASTATUS1, but after OASTATUS2 */
+	I915_WRITE(GEN7_OASTATUS2, dev_priv->oa_pmu.oa_buffer.gtt_offset |
+		   GEN7_OASTATUS2_GGTT); /* head */
+	I915_WRITE(GEN7_OABUFFER, dev_priv->oa_pmu.oa_buffer.gtt_offset);
+	I915_WRITE(GEN7_OASTATUS1, dev_priv->oa_pmu.oa_buffer.gtt_offset |
+		   GEN7_OASTATUS1_OABUFFER_SIZE_16M); /* tail */
+
+	DRM_DEBUG_DRIVER("OA Buffer initialized, gtt offset = 0x%x, vaddr = %p",
+			 dev_priv->oa_pmu.oa_buffer.gtt_offset,
+			 dev_priv->oa_pmu.oa_buffer.addr);
+
+	goto unlock;
+
+err_unref:
+	drm_gem_object_unreference(&bo->base);
+
+unlock:
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+	return ret;
+}
+
+static enum hrtimer_restart hrtimer_sample(struct hrtimer *hrtimer)
+{
+	struct drm_i915_private *i915 =
+		container_of(hrtimer, typeof(*i915), oa_pmu.timer);
+
+	flush_oa_snapshots(i915, true);
+
+	hrtimer_forward_now(hrtimer, ns_to_ktime(PERIOD));
+	return HRTIMER_RESTART;
+}
+
+static struct intel_context *
+lookup_context(struct drm_i915_private *dev_priv,
+	       struct file *user_filp,
+	       u32 ctx_user_handle)
+{
+	struct intel_context *ctx;
+
+	mutex_lock(&dev_priv->dev->struct_mutex);
+	list_for_each_entry(ctx, &dev_priv->context_list, link) {
+		struct drm_file *drm_file;
+
+		if (!ctx->file_priv)
+			continue;
+
+		drm_file = ctx->file_priv->file;
+
+		if (user_filp->private_data == drm_file &&
+		    ctx->user_handle == ctx_user_handle) {
+			mutex_unlock(&dev_priv->dev->struct_mutex);
+			return ctx;
+		}
+	}
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+
+	return NULL;
+}
+
+static int i915_oa_copy_attr(drm_i915_oa_attr_t __user *uattr,
+			     drm_i915_oa_attr_t *attr)
+{
+	u32 size;
+	int ret;
+
+	if (!access_ok(VERIFY_WRITE, uattr, I915_OA_ATTR_SIZE_VER0))
+		return -EFAULT;
+
+	/*
+	 * zero the full structure, so that a short copy will be nice.
+	 */
+	memset(attr, 0, sizeof(*attr));
+
+	ret = get_user(size, &uattr->size);
+	if (ret)
+		return ret;
+
+	if (size > PAGE_SIZE)	/* silly large */
+		goto err_size;
+
+	if (size < I915_OA_ATTR_SIZE_VER0)
+		goto err_size;
+
+	/*
+	 * If we're handed a bigger struct than we know of,
+	 * ensure all the unknown bits are 0 - i.e. new
+	 * user-space does not rely on any kernel feature
+	 * extensions we dont know about yet.
+	 */
+	if (size > sizeof(*attr)) {
+		unsigned char __user *addr;
+		unsigned char __user *end;
+		unsigned char val;
+
+		addr = (void __user *)uattr + sizeof(*attr);
+		end  = (void __user *)uattr + size;
+
+		for (; addr < end; addr++) {
+			ret = get_user(val, addr);
+			if (ret)
+				return ret;
+			if (val)
+				goto err_size;
+		}
+		size = sizeof(*attr);
+	}
+
+	ret = copy_from_user(attr, uattr, size);
+	if (ret)
+		return -EFAULT;
+
+	if (attr->__reserved_1)
+		return -EINVAL;
+
+out:
+	return ret;
+
+err_size:
+	put_user(sizeof(*attr), &uattr->size);
+	ret = -E2BIG;
+	goto out;
+}
+
+static int i915_oa_event_init(struct perf_event *event)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	drm_i915_oa_attr_t oa_attr;
+	u64 report_format;
+	int ret = 0;
+
+	if (event->attr.type != event->pmu->type)
+		return -ENOENT;
+
+	ret = i915_oa_copy_attr(to_user_ptr(event->attr.config), &oa_attr);
+	if (ret)
+		return ret;
+
+	/* To avoid the complexity of having to accurately filter
+	 * counter snapshots and marshal to the appropriate client
+	 * we currently only allow exclusive access */
+	if (dev_priv->oa_pmu.oa_buffer.obj)
+		return -EBUSY;
+
+	report_format = oa_attr.format;
+	dev_priv->oa_pmu.oa_buffer.format = report_format;
+	dev_priv->oa_pmu.metrics_set = oa_attr.metrics_set;
+
+	if (IS_HASWELL(dev_priv->dev)) {
+		int snapshot_size;
+
+		if (report_format >= ARRAY_SIZE(hsw_perf_format_sizes))
+			return -EINVAL;
+
+		snapshot_size = hsw_perf_format_sizes[report_format];
+		if (snapshot_size < 0)
+			return -EINVAL;
+
+		dev_priv->oa_pmu.oa_buffer.format_size = snapshot_size;
+	} else {
+		BUG(); /* pmu shouldn't have been registered */
+		return -ENODEV;
+	}
+
+	/* Since we are limited to an exponential scale for
+	 * programming the OA sampling period we don't allow userspace
+	 * to pass a precise attr.sample_period. */
+	if (event->attr.freq ||
+	    (event->attr.sample_period != 0 &&
+	     event->attr.sample_period != 1))
+		return -EINVAL;
+
+	dev_priv->oa_pmu.periodic = event->attr.sample_period;
+
+	/* Instead of allowing userspace to configure the period via
+	 * attr.sample_period we instead accept an exponent whereby
+	 * the sample_period will be:
+	 *
+	 *   80ns * 2^(period_exponent + 1)
+	 *
+	 * Programming a period of 160 nanoseconds would not be very
+	 * polite, so higher frequencies are reserved for root.
+	 */
+	if (dev_priv->oa_pmu.periodic) {
+		u64 period_exponent = oa_attr.timer_exponent;
+
+		if (period_exponent > 63)
+			return -EINVAL;
+
+		if (period_exponent < 15 && !capable(CAP_SYS_ADMIN))
+			return -EACCES;
+
+		dev_priv->oa_pmu.period_exponent = period_exponent;
+	} else if (oa_attr.timer_exponent)
+		return -EINVAL;
+
+	/* We bypass the default perf core perf_paranoid_cpu() ||
+	 * CAP_SYS_ADMIN check by using the PERF_PMU_CAP_IS_DEVICE
+	 * flag and instead authenticate based on whether the current
+	 * pid owns the specified context, or require CAP_SYS_ADMIN
+	 * when collecting cross-context metrics.
+	 */
+	dev_priv->oa_pmu.specific_ctx = NULL;
+	if (oa_attr.single_context) {
+		u32 ctx_id = oa_attr.ctx_id;
+		unsigned int drm_fd = oa_attr.drm_fd;
+		struct fd fd = fdget(drm_fd);
+
+		if (!fd.file)
+			return -EBADF;
+
+		dev_priv->oa_pmu.specific_ctx =
+			lookup_context(dev_priv, fd.file, ctx_id);
+		fdput(fd);
+
+		if (!dev_priv->oa_pmu.specific_ctx)
+			return -EINVAL;
+	}
+
+	if (!dev_priv->oa_pmu.specific_ctx && !capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	ret = init_oa_buffer(event);
+	if (ret)
+		return ret;
+
+	BUG_ON(dev_priv->oa_pmu.exclusive_event);
+	dev_priv->oa_pmu.exclusive_event = event;
+
+	event->destroy = i915_oa_event_destroy;
+
+	/* PRM - observability performance counters:
+	 *
+	 *   OACONTROL, performance counter enable, note:
+	 *
+	 *   "When this bit is set, in order to have coherent counts,
+	 *   RC6 power state and trunk clock gating must be disabled.
+	 *   This can be achieved by programming MMIO registers as
+	 *   0xA094=0 and 0xA090[31]=1"
+	 *
+	 *   In our case we are expected that taking pm + FORCEWAKE
+	 *   references will effectively disable RC6 and trunk clock
+	 *   gating.
+	 */
+	intel_runtime_pm_get(dev_priv);
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+	return 0;
+}
+
+static void update_oacontrol(struct drm_i915_private *dev_priv)
+{
+	BUG_ON(!spin_is_locked(&dev_priv->oa_pmu.lock));
+
+	if (dev_priv->oa_pmu.event_active) {
+		unsigned long ctx_id = 0;
+		bool pinning_ok = false;
+
+		if (dev_priv->oa_pmu.specific_ctx) {
+			struct intel_context *ctx =
+				dev_priv->oa_pmu.specific_ctx;
+			struct drm_i915_gem_object *obj =
+				ctx->legacy_hw_ctx.rcs_state;
+
+			if (i915_gem_obj_is_pinned(obj)) {
+				ctx_id = i915_gem_obj_ggtt_offset(obj);
+				pinning_ok = true;
+			}
+		}
+
+		if ((ctx_id == 0 || pinning_ok)) {
+			bool periodic = dev_priv->oa_pmu.periodic;
+			u32 period_exponent = dev_priv->oa_pmu.period_exponent;
+			u32 report_format = dev_priv->oa_pmu.oa_buffer.format;
+
+			I915_WRITE(GEN7_OACONTROL,
+				   (ctx_id & GEN7_OACONTROL_CTX_MASK) |
+				   (period_exponent <<
+				    GEN7_OACONTROL_TIMER_PERIOD_SHIFT) |
+				   (periodic ?
+				    GEN7_OACONTROL_TIMER_ENABLE : 0) |
+				   (report_format <<
+				    GEN7_OACONTROL_FORMAT_SHIFT) |
+				   (ctx_id ?
+				    GEN7_OACONTROL_PER_CTX_ENABLE : 0) |
+				   GEN7_OACONTROL_ENABLE);
+			return;
+		}
+	}
+
+	I915_WRITE(GEN7_OACONTROL, 0);
+}
+
+static void i915_oa_event_start(struct perf_event *event, int flags)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	unsigned long lock_flags;
+	u32 oastatus1, tail;
+
+	/* PRM - observability performance counters:
+	 *
+	 *   OACONTROL, specific context enable:
+	 *
+	 *   "OA unit level clock gating must be ENABLED when using
+	 *   specific ContextID feature."
+	 *
+	 * Assuming we don't ever disable OA unit level clock gating
+	 * lets just assert that this condition is met...
+	 */
+	WARN_ONCE(I915_READ(GEN6_UCGCTL3) & GEN6_OACSUNIT_CLOCK_GATE_DISABLE,
+		  "disabled OA unit level clock gating will result in incorrect per-context OA counters");
+
+	/* XXX: On Haswell, when threshold disable mode is desired,
+	 * instead of setting the threshold enable to '0', we need to
+	 * program it to '1' and set OASTARTTRIG1 bits 15:0 to 0
+	 * (threshold value of 0)
+	 */
+	I915_WRITE(OASTARTTRIG6, (OASTARTTRIG6_B4_TO_B7_THRESHOLD_ENABLE |
+				  OASTARTTRIG6_B4_CUSTOM_EVENT_ENABLE));
+	I915_WRITE(OASTARTTRIG5, 0); /* threshold value */
+
+	I915_WRITE(OASTARTTRIG2, (OASTARTTRIG2_B0_TO_B3_THRESHOLD_ENABLE |
+				  OASTARTTRIG2_B0_CUSTOM_EVENT_ENABLE));
+	I915_WRITE(OASTARTTRIG1, 0); /* threshold value */
+
+	/* Setup B0 as the gpu clock counter... */
+	I915_WRITE(OACEC0_0, OACEC0_0_B0_COMPARE_GREATER_OR_EQUAL); /* to 0 */
+	I915_WRITE(OACEC0_1, 0xfffe); /* Select NOA[0] */
+
+	spin_lock_irqsave(&dev_priv->oa_pmu.lock, lock_flags);
+
+	dev_priv->oa_pmu.event_active = true;
+	update_oacontrol(dev_priv);
+
+	/* Reset the head ptr to ensure we don't forward reports relating
+	 * to a previous perf event */
+	oastatus1 = I915_READ(GEN7_OASTATUS1);
+	tail = oastatus1 & GEN7_OASTATUS1_TAIL_MASK;
+	I915_WRITE(GEN7_OASTATUS2, (tail & GEN7_OASTATUS2_HEAD_MASK) |
+				    GEN7_OASTATUS2_GGTT);
+
+	mmiowb();
+	spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, lock_flags);
+
+	if (event->attr.sample_period)
+		__hrtimer_start_range_ns(&dev_priv->oa_pmu.timer,
+					 ns_to_ktime(PERIOD), 0,
+					 HRTIMER_MODE_REL_PINNED, 0);
+
+	event->hw.state = 0;
+}
+
+static void i915_oa_event_stop(struct perf_event *event, int flags)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	unsigned long lock_flags;
+
+	spin_lock_irqsave(&dev_priv->oa_pmu.lock, lock_flags);
+
+	dev_priv->oa_pmu.event_active = false;
+	update_oacontrol(dev_priv);
+
+	mmiowb();
+	spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, lock_flags);
+
+	if (event->attr.sample_period) {
+		hrtimer_cancel(&dev_priv->oa_pmu.timer);
+		flush_oa_snapshots(dev_priv, false);
+	}
+
+	event->hw.state = PERF_HES_STOPPED;
+}
+
+static int i915_oa_event_add(struct perf_event *event, int flags)
+{
+	if (flags & PERF_EF_START)
+		i915_oa_event_start(event, flags);
+
+	return 0;
+}
+
+static void i915_oa_event_del(struct perf_event *event, int flags)
+{
+	i915_oa_event_stop(event, flags);
+}
+
+static void i915_oa_event_read(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), oa_pmu.pmu);
+
+	/* XXX: What counter would be useful here? */
+	local64_set(&event->count, 0);
+}
+
+static int i915_oa_event_flush(struct perf_event *event)
+{
+	if (event->attr.sample_period) {
+		struct drm_i915_private *i915 =
+			container_of(event->pmu, typeof(*i915), oa_pmu.pmu);
+
+		flush_oa_snapshots(i915, true);
+	}
+
+	return 0;
+}
+
+static int i915_oa_event_event_idx(struct perf_event *event)
+{
+	return 0;
+}
+
+void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
+				struct intel_context *context)
+{
+	unsigned long flags;
+
+	if (dev_priv->oa_pmu.pmu.event_init == NULL)
+		return;
+
+	spin_lock_irqsave(&dev_priv->oa_pmu.lock, flags);
+
+	if (dev_priv->oa_pmu.specific_ctx == context)
+		update_oacontrol(dev_priv);
+
+	mmiowb();
+	spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, flags);
+}
+
+void i915_oa_context_unpin_notify(struct drm_i915_private *dev_priv,
+				  struct intel_context *context)
+{
+	unsigned long flags;
+
+	if (dev_priv->oa_pmu.pmu.event_init == NULL)
+		return;
+
+	spin_lock_irqsave(&dev_priv->oa_pmu.lock, flags);
+
+	if (dev_priv->oa_pmu.specific_ctx == context)
+		update_oacontrol(dev_priv);
+
+	mmiowb();
+	spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, flags);
+}
+
+void i915_oa_pmu_register(struct drm_device *dev)
+{
+	struct drm_i915_private *i915 = to_i915(dev);
+
+	if (!IS_HASWELL(dev))
+		return;
+
+	/* We need to be careful about forwarding cpu metrics to
+	 * userspace considering that PERF_PMU_CAP_IS_DEVICE bypasses
+	 * the events/core security check that stops an unprivileged
+	 * process collecting metrics for other processes.
+	 */
+	i915->oa_pmu.dummy_regs = *task_pt_regs(current);
+
+	hrtimer_init(&i915->oa_pmu.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	i915->oa_pmu.timer.function = hrtimer_sample;
+
+	spin_lock_init(&i915->oa_pmu.lock);
+
+	i915->oa_pmu.pmu.capabilities  = PERF_PMU_CAP_IS_DEVICE;
+
+	/* Effectively disallow opening an event with a specific pid
+	 * since we aren't interested in processes running on the cpu...
+	 */
+	i915->oa_pmu.pmu.task_ctx_nr   = perf_invalid_context;
+
+	i915->oa_pmu.pmu.event_init    = i915_oa_event_init;
+	i915->oa_pmu.pmu.add	       = i915_oa_event_add;
+	i915->oa_pmu.pmu.del	       = i915_oa_event_del;
+	i915->oa_pmu.pmu.start	       = i915_oa_event_start;
+	i915->oa_pmu.pmu.stop	       = i915_oa_event_stop;
+	i915->oa_pmu.pmu.read	       = i915_oa_event_read;
+	i915->oa_pmu.pmu.flush	       = i915_oa_event_flush;
+	i915->oa_pmu.pmu.event_idx     = i915_oa_event_event_idx;
+
+	if (perf_pmu_register(&i915->oa_pmu.pmu, "i915_oa", -1))
+		i915->oa_pmu.pmu.event_init = NULL;
+}
+
+void i915_oa_pmu_unregister(struct drm_device *dev)
+{
+	struct drm_i915_private *i915 = to_i915(dev);
+
+	if (i915->oa_pmu.pmu.event_init == NULL)
+		return;
+
+	perf_pmu_unregister(&i915->oa_pmu.pmu);
+	i915->oa_pmu.pmu.event_init = NULL;
+}
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index dc6907b..40fc44f 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -516,6 +516,73 @@
 #define GEN7_3DPRIM_BASE_VERTEX         0x2440
 
 #define GEN7_OACONTROL 0x2360
+#define  GEN7_OACONTROL_CTX_MASK	    0xFFFFF000
+#define  GEN7_OACONTROL_TIMER_PERIOD_MASK   0x3F
+#define  GEN7_OACONTROL_TIMER_PERIOD_SHIFT  6
+#define  GEN7_OACONTROL_TIMER_ENABLE	    (1<<5)
+#define  GEN7_OACONTROL_FORMAT_A13	    (0<<2)
+#define  GEN7_OACONTROL_FORMAT_A29	    (1<<2)
+#define  GEN7_OACONTROL_FORMAT_A13_B8_C8    (2<<2)
+#define  GEN7_OACONTROL_FORMAT_A29_B8_C8    (3<<2)
+#define  GEN7_OACONTROL_FORMAT_B4_C8	    (4<<2)
+#define  GEN7_OACONTROL_FORMAT_A45_B8_C8    (5<<2)
+#define  GEN7_OACONTROL_FORMAT_B4_C8_A16    (6<<2)
+#define  GEN7_OACONTROL_FORMAT_C4_B8	    (7<<2)
+#define  GEN7_OACONTROL_FORMAT_SHIFT	    2
+#define  GEN7_OACONTROL_PER_CTX_ENABLE	    (1<<1)
+#define  GEN7_OACONTROL_ENABLE		    (1<<0)
+
+#define OASTARTTRIG5 0x02720
+#define  OASTARTTRIG5_THRESHOLD_VALUE_MASK	0xffff
+
+#define OASTARTTRIG6 0x02724
+#define  OASTARTTRIG6_B4_TO_B7_THRESHOLD_ENABLE (1<<23)
+#define  OASTARTTRIG6_B4_CUSTOM_EVENT_ENABLE	(1<<28)
+
+#define OASTARTTRIG1 0x02710
+#define  OASTARTTRIG1_THRESHOLD_VALUE_MASK	0xffff
+
+#define OASTARTTRIG2 0x02714
+#define  OASTARTTRIG2_B0_TO_B3_THRESHOLD_ENABLE (1<<23)
+#define  OASTARTTRIG2_B0_CUSTOM_EVENT_ENABLE	(1<<28)
+
+#define OACEC0_0 0x2770
+#define  OACEC0_0_B0_COMPARE_ANY_EQUAL		0
+#define  OACEC0_0_B0_COMPARE_OR			0
+#define  OACEC0_0_B0_COMPARE_GREATER_THAN	1
+#define  OACEC0_0_B0_COMPARE_EQUAL		2
+#define  OACEC0_0_B0_COMPARE_GREATER_OR_EQUAL	3
+#define  OACEC0_0_B0_COMPARE_LESS_THAN		4
+#define  OACEC0_0_B0_COMPARE_NOT_EQUAL		5
+#define  OACEC0_0_B0_COMPARE_LESS_OR_EQUAL	6
+#define  OACEC0_0_B0_COMPARE_VALUE_MASK		0xffff
+#define  OACEC0_0_B0_COMPARE_VALUE_SHIFT	3
+
+#define OACEC0_1 0x2774
+
+#define GEN7_OABUFFER 0x23B0 /* R/W */
+#define  GEN7_OABUFFER_OVERRUN_DISABLE	    (1<<3)
+#define  GEN7_OABUFFER_EDGE_TRIGGER	    (1<<2)
+#define  GEN7_OABUFFER_STOP_RESUME_ENABLE   (1<<1)
+#define  GEN7_OABUFFER_RESUME		    (1<<0)
+
+#define GEN7_OASTATUS1 0x2364
+#define  GEN7_OASTATUS1_TAIL_MASK	    0xffffffc0
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_128K  (0<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_256K  (1<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_512K  (2<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_1M    (3<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_2M    (4<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_4M    (5<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_8M    (6<<3)
+#define  GEN7_OASTATUS1_OABUFFER_SIZE_16M   (7<<3)
+#define  GEN7_OASTATUS1_COUNTER_OVERFLOW    (1<<2)
+#define  GEN7_OASTATUS1_OABUFFER_OVERFLOW   (1<<1)
+#define  GEN7_OASTATUS1_REPORT_LOST	    (1<<0)
+
+#define GEN7_OASTATUS2 0x2368
+#define GEN7_OASTATUS2_HEAD_MASK    0xffffffc0
+#define GEN7_OASTATUS2_GGTT	    0x1
 
 #define _GEN7_PIPEA_DE_LOAD_SL	0x70068
 #define _GEN7_PIPEB_DE_LOAD_SL	0x71068
@@ -6545,6 +6612,7 @@ enum skl_disp_power_wells {
 # define GEN6_RCCUNIT_CLOCK_GATE_DISABLE		(1 << 11)
 
 #define GEN6_UCGCTL3				0x9408
+# define GEN6_OACSUNIT_CLOCK_GATE_DISABLE		(1 << 20)
 
 #define GEN7_UCGCTL4				0x940c
 #define  GEN7_L3BANK2X_CLOCK_GATE_DISABLE	(1<<25)
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 4851d66..f78f232 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -58,6 +58,35 @@
 #define I915_ERROR_UEVENT		"ERROR"
 #define I915_RESET_UEVENT		"RESET"
 
+/**
+ * DOC: perf events configuration exposed by i915 through /sys/bus/event_sources/drivers/i915_oa
+ *
+ */
+
+#define I915_OA_FORMAT_A13_HSW		0
+#define I915_OA_FORMAT_A29_HSW		1
+#define I915_OA_FORMAT_A13_B8_C8_HSW	2
+#define I915_OA_FORMAT_B4_C8_HSW	4
+#define I915_OA_FORMAT_A45_B8_C8_HSW	5
+#define I915_OA_FORMAT_B4_C8_A16_HSW	6
+#define I915_OA_FORMAT_C4_B8_HSW	7
+
+#define I915_OA_ATTR_SIZE_VER0		32  /* sizeof first published struct */
+
+typedef struct _drm_i915_oa_attr {
+	__u32 size;
+
+	__u32 format;
+	__u32 metrics_set;
+	__u32 timer_exponent;
+
+	__u32 drm_fd;
+	__u32 ctx_id;
+
+	__u64 single_context : 1,
+	      __reserved_1 : 63;
+} drm_i915_oa_attr_t;
+
 /* Each region is a minimum of 16k, and there are at most 255 of them.
  */
 #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
-- 
2.4.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH] squash: be more careful stopping oacontrol updates
  2015-05-18 16:36     ` Robert Bragg
  2015-05-18 17:17       ` [RFC PATCH v2] " Robert Bragg
@ 2015-05-18 17:21       ` Robert Bragg
  1 sibling, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-18 17:21 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api, Chris Wilson

This makes sure we've stopped touching oacontrol before we start
resetting OA, PM and clock gating. Shouldn't strictly be needed since we
know that oacontrol will have been disabled before we start destroying
an event but it seems worth highlighting that update_oacontrol() could
still be running asynchronously and stopping it early in case it might
become an issue in the future.
---
 drivers/gpu/drm/i915/i915_oa_perf.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index bf1c1d6..e47ed90 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -166,12 +166,21 @@ static void i915_oa_event_destroy(struct perf_event *event)
 {
 	struct drm_i915_private *i915 =
 		container_of(event->pmu, typeof(*i915), oa_pmu.pmu);
+	unsigned long lock_flags;
 
 	WARN_ON(event->parent);
 
-	oa_buffer_destroy(i915);
-
+	/* Stop updating oacontrol via _oa_context_pin_[un]notify()... */
+	spin_lock_irqsave(&i915->oa_pmu.lock, lock_flags);
 	i915->oa_pmu.specific_ctx = NULL;
+	spin_unlock_irqrestore(&i915->oa_pmu.lock, lock_flags);
+
+	/* Don't let the compiler start resetting OA, PM and clock gating
+	 * state before we've stopped update_oacontrol()
+	 */
+	barrier();
+
+	oa_buffer_destroy(i915);
 
 	BUG_ON(i915->oa_pmu.exclusive_event != event);
 	i915->oa_pmu.exclusive_event = NULL;
@@ -513,6 +522,11 @@ static int i915_oa_event_init(struct perf_event *event)
 	return 0;
 }
 
+/* Note: Although pmu methods are called with the corresponding
+ * perf_event_context lock taken (so we don't need to worry about our pmu
+ * methods contending with each other) update_oacontrol() may be called
+ * asynchronously via the i915_oa_pmu_[un]register() hooks.
+ */
 static void update_oacontrol(struct drm_i915_private *dev_priv)
 {
 	BUG_ON(!spin_is_locked(&dev_priv->oa_pmu.lock));
-- 
2.4.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH v2] perf: Add PERF_EVENT_IOC_FLUSH ioctl
  2015-05-07 14:20   ` [Intel-gfx] " Chris Wilson
@ 2015-05-18 17:25     ` Robert Bragg
  2015-05-20 12:12       ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Robert Bragg @ 2015-05-18 17:25 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api, Chris Wilson

To allow for pmus that may have internal buffering (e.g. the hardware
itself writes out data to its own circular buffer which is only
periodically forwarded to userspace via perf) this ioctl enables
userspace to explicitly ensure it has received all samples before a
point in time.

v2: return int error status

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 include/linux/perf_event.h      | 7 +++++++
 include/uapi/linux/perf_event.h | 1 +
 kernel/events/core.c            | 5 +++++
 3 files changed, 13 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index cf1d096..0c591eb 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -305,6 +305,13 @@ struct pmu {
 	 * Free pmu-private AUX data structures
 	 */
 	void (*free_aux)		(void *aux); /* optional */
+
+	/*
+	 * Flush buffered samples (E.g. for pmu hardware that writes samples to
+	 * some intermediate buffer) userspace may need to explicitly ensure
+	 * such samples have been forwarded to perf.
+	 */
+	int (*flush)			(struct perf_event *event); /*optional */
 };
 
 /**
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 309211b..cbf1b80 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -389,6 +389,7 @@ struct perf_event_attr {
 #define PERF_EVENT_IOC_SET_FILTER	_IOW('$', 6, char *)
 #define PERF_EVENT_IOC_ID		_IOR('$', 7, __u64 *)
 #define PERF_EVENT_IOC_SET_BPF		_IOW('$', 8, __u32)
+#define PERF_EVENT_IOC_FLUSH		_IO ('$', 9)
 
 enum perf_event_ioc_flags {
 	PERF_IOC_FLAG_GROUP		= 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3fe532a..72daee6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4079,6 +4079,11 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
 	case PERF_EVENT_IOC_SET_BPF:
 		return perf_event_set_bpf_prog(event, arg);
 
+	case PERF_EVENT_IOC_FLUSH:
+		if (event->pmu->flush)
+			return event->pmu->flush(event);
+		return 0;
+
 	default:
 		return -ENOTTY;
 	}
-- 
2.4.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU
  2015-05-08 16:21 ` [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Peter Zijlstra
@ 2015-05-18 17:29   ` Robert Bragg
  0 siblings, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-18 17:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: intel-gfx, Daniel Vetter, Jani Nikula, David Airlie,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

On Fri, May 8, 2015 at 5:21 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So I've not yet went through the entire series; but I'm wondering if its
> at all possible to re-use some of this work:
>
> lkml.kernel.org/r/1428453299-19121-1-git-send-email-sukadev@linux.vnet.ibm.com
>
> That's for a Power8 HV call that can basically return an array of
> values; which on a superficial level sounds a bit like what this GPU
> hardware does.

Thanks for this pointer.

I think the main similarity here is the ability to capture multiple
counters consistent for the same point in time, but in our case we
don't have an explicitly controlled transaction mechanism like this.

Although we can collect a large set of counters in a latched fashion -
so they are self consistent - the selection of counters included in
our OA unit reports is more rigid.

Most of our counters aren't independently aggregated, they are derived
from signals selected as part of the OA unit configuration and the
values are only maintained by the OA unit itself, so a
re-configuration to select different signals will discard the counter
values of currently selected signals.

afik re-configuring our signal selection is also relatively slow too
(I was told this last week at least, but I haven't tested it myself)
and so it's really geared towards applications or tools choosing a
configuration to maintain for a relatively long time while profiling a
workload.

I think the other big difference here is that we don't have a way to
explicitly trigger a report to be written from the cpu. (Although we
can read the OA counters via mmio, it's only intended for debug
purposes as this subverts the hw latching of counters) This means it
would be difficult to try and treat this like a transaction including
a fixed set of event->read()s without a way for pmu->commit_txn() to
trigger a report.

>
> Let me read more of this..

Thanks.

- Robert

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU
  2015-05-15  1:07   ` Robert Bragg
@ 2015-05-19 14:53     ` Peter Zijlstra
  2015-05-20 23:17       ` Robert Bragg
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2015-05-19 14:53 UTC (permalink / raw)
  To: Robert Bragg
  Cc: intel-gfx, Daniel Vetter, Jani Nikula, David Airlie,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

On Fri, May 15, 2015 at 02:07:29AM +0100, Robert Bragg wrote:
> On Fri, May 8, 2015 at 5:24 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:
> >
> >> I've changed the uapi for configuring the i915_oa specific attributes
> >> when calling perf_event_open(2) whereby instead of cramming lots of
> >> bitfields into the perf_event_attr config members, I'm now
> >> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
> >> config member that's extensible and validated in the same way as the
> >> perf_event_attr struct. I've found this much nicer to work with while
> >> being neatly extensible too.
> >
> > This worries me a bit.. is there more background for this?
> 
> Would it maybe be helpful to see the before and after? I had kept this
> uapi change in a separate patch for a while locally but in the end
> decided to squash it before sending out my updated series.
> 
> Although I did find it a bit awkward with the bitfields, I was mainly
> concerned about the extensibility of packing logically separate
> attributes into the config members and had heard similar concerns from
> a few others who had been experimenting with my patches too.
> 
> A few simple attributes I can think of a.t.m that we might want to add
> in the future are:
> - control of the OABUFFER size
> - a way to ask the kernel to collect reports at the beginning and end
> of batch buffers, in addition to periodic reports
> - alternative ways to uniquely identify a context to support tools
> profiling a single context not necessarily owned by the current
> process
> 
> It could also be interesting to expose some counter configuration
> through these attributes too. E.g. on Broadwell+ we have 14 'Flexible
> EU' counters included in the OA unit reports, each with a 16bit
> configuration.
> 
> In a more extreme case it might also be useful to allow userspace to
> specify a complete counter config, which (depending on the
> configuration) could be over 100 32bit values to select the counter
> signals + configure the corresponding combining logic.
> 
> Since this pmu is in a device driver it also seemed reasonably
> appropriate to de-couple it slightly from the core perf_event_attr
> structure by allowing driver extensible attributes.
> 
> I wonder if it might be less worrisome if the i915_oa_copy_attr() code
> were instead a re-usable utility perhaps maintained in events/core.c,
> so if other pmu drivers were to follow suite there would be less risk
> of a mistake being made here?


So I had a peek at:

  https://01.org/sites/default/files/documentation/observability_performance_counters_haswell.pdf

In an attempt to inform myself of how the hardware worked. But the
document is rather sparse (and does not include broadwell).

So from what I can gather there's two ways to observe the counters,
through MMIO or trough the ring-buffer, which in turn seems triggered by
a MI_REPORT_PERF_COUNT command.

[ Now the document talks about shortcomings of this scheme, where the
MI_REPORT_PERF_COUNT command can only be placed every other command, but
a single command can contain so much computation that this is not fine
grained enough -- leading to the suggestion of a timer/cycle based
reporting, but that is not actually mentioned afaict ]

Now the MI_REPORT_PERF_COUNT can select a vector (Counter Select) of
which events it will write out.

This covers the regular 'A' counters. Is this correct?

Then there are the 'B' counters, which appear to be programmable through
the CEC MMIO registers.

These B events can also be part of the MI_REPORT_PERF_COUNT vector.

Right?


So for me the 'natural' way to represent this in perf would be through
event groups. Create a perf event for every single event -- yes this is
53 events.

Use the MMIO reads for the regular read() interface, and use a hrtimer
placing MI_REPORT_PERF_COUNT commands, with a counter select mask
covering the all events in the current group, for sampling.

Then obtain the vector values using PERF_SAMPLE_READ and
PERF_FORMAT_READ -- and use the read txn support to consume the
ring-buffer vectors instead of the MMIO registers.

You can use the perf_event_attr::config to select the counter (A0-A44,
B0-B7) and use perf_event_attr::config1 (low and high dword) for the
corresponding CEC registers.


This does not require random per driver ABI extentions for
perf_event_attr, nor your custom output format.

Am I missing something obvious here?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v2] perf: Add PERF_EVENT_IOC_FLUSH ioctl
  2015-05-18 17:25     ` [RFC PATCH v2] " Robert Bragg
@ 2015-05-20 12:12       ` Ingo Molnar
  2015-05-21 17:40         ` [RFC PATCH] perf: enable fsync to flush buffered samples Robert Bragg
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2015-05-20 12:12 UTC (permalink / raw)
  To: Robert Bragg
  Cc: intel-gfx, Daniel Vetter, Jani Nikula, David Airlie,
	Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, dri-devel, linux-api,
	Chris Wilson


* Robert Bragg <robert@sixbynine.org> wrote:

> To allow for pmus that may have internal buffering (e.g. the hardware
> itself writes out data to its own circular buffer which is only
> periodically forwarded to userspace via perf) this ioctl enables
> userspace to explicitly ensure it has received all samples before a
> point in time.
> 
> v2: return int error status
> 
> Signed-off-by: Robert Bragg <robert@sixbynine.org>
> ---
>  include/linux/perf_event.h      | 7 +++++++
>  include/uapi/linux/perf_event.h | 1 +
>  kernel/events/core.c            | 5 +++++
>  3 files changed, 13 insertions(+)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index cf1d096..0c591eb 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -305,6 +305,13 @@ struct pmu {
>  	 * Free pmu-private AUX data structures
>  	 */
>  	void (*free_aux)		(void *aux); /* optional */
> +
> +	/*
> +	 * Flush buffered samples (E.g. for pmu hardware that writes samples to
> +	 * some intermediate buffer) userspace may need to explicitly ensure
> +	 * such samples have been forwarded to perf.
> +	 */
> +	int (*flush)			(struct perf_event *event); /*optional */
>  };
>  
>  /**
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 309211b..cbf1b80 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -389,6 +389,7 @@ struct perf_event_attr {
>  #define PERF_EVENT_IOC_SET_FILTER	_IOW('$', 6, char *)
>  #define PERF_EVENT_IOC_ID		_IOR('$', 7, __u64 *)
>  #define PERF_EVENT_IOC_SET_BPF		_IOW('$', 8, __u32)
> +#define PERF_EVENT_IOC_FLUSH		_IO ('$', 9)
>  
>  enum perf_event_ioc_flags {
>  	PERF_IOC_FLAG_GROUP		= 1U << 0,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3fe532a..72daee6 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4079,6 +4079,11 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
>  	case PERF_EVENT_IOC_SET_BPF:
>  		return perf_event_set_bpf_prog(event, arg);
>  
> +	case PERF_EVENT_IOC_FLUSH:
> +		if (event->pmu->flush)
> +			return event->pmu->flush(event);
> +		return 0;
> +
>  	default:
>  		return -ENOTTY;
>  	}

So 'struct file_operations' has a callback for:

        int (*fsync) (struct file *, loff_t, loff_t, int datasync);

Could we use that perhaps, instead of an ioctl()? Not sure how it all 
integrates with the VFS though.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU
  2015-05-19 14:53     ` Peter Zijlstra
@ 2015-05-20 23:17       ` Robert Bragg
  2015-05-21  8:24         ` [Intel-gfx] " Daniel Vetter
  2015-05-27 15:39         ` Peter Zijlstra
  0 siblings, 2 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-20 23:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: intel-gfx, Daniel Vetter, Jani Nikula, David Airlie,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

On Tue, May 19, 2015 at 3:53 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 15, 2015 at 02:07:29AM +0100, Robert Bragg wrote:
>> On Fri, May 8, 2015 at 5:24 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:
>> >
>> >> I've changed the uapi for configuring the i915_oa specific attributes
>> >> when calling perf_event_open(2) whereby instead of cramming lots of
>> >> bitfields into the perf_event_attr config members, I'm now
>> >> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
>> >> config member that's extensible and validated in the same way as the
>> >> perf_event_attr struct. I've found this much nicer to work with while
>> >> being neatly extensible too.
>> >
>> > This worries me a bit.. is there more background for this?
>>
>> Would it maybe be helpful to see the before and after? I had kept this
>> uapi change in a separate patch for a while locally but in the end
>> decided to squash it before sending out my updated series.
>>
>> Although I did find it a bit awkward with the bitfields, I was mainly
>> concerned about the extensibility of packing logically separate
>> attributes into the config members and had heard similar concerns from
>> a few others who had been experimenting with my patches too.
>>
>> A few simple attributes I can think of a.t.m that we might want to add
>> in the future are:
>> - control of the OABUFFER size
>> - a way to ask the kernel to collect reports at the beginning and end
>> of batch buffers, in addition to periodic reports
>> - alternative ways to uniquely identify a context to support tools
>> profiling a single context not necessarily owned by the current
>> process
>>
>> It could also be interesting to expose some counter configuration
>> through these attributes too. E.g. on Broadwell+ we have 14 'Flexible
>> EU' counters included in the OA unit reports, each with a 16bit
>> configuration.
>>
>> In a more extreme case it might also be useful to allow userspace to
>> specify a complete counter config, which (depending on the
>> configuration) could be over 100 32bit values to select the counter
>> signals + configure the corresponding combining logic.
>>
>> Since this pmu is in a device driver it also seemed reasonably
>> appropriate to de-couple it slightly from the core perf_event_attr
>> structure by allowing driver extensible attributes.
>>
>> I wonder if it might be less worrisome if the i915_oa_copy_attr() code
>> were instead a re-usable utility perhaps maintained in events/core.c,
>> so if other pmu drivers were to follow suite there would be less risk
>> of a mistake being made here?
>
>
> So I had a peek at:
>
>   https://01.org/sites/default/files/documentation/observability_performance_counters_haswell.pdf
>
> In an attempt to inform myself of how the hardware worked. But the
> document is rather sparse (and does not include broadwell).

Thanks. Sorry I can't easily fix this myself, but there is work
ongoing to update this documentation. In the interim I can try and
cover some of the key details here...

>
> So from what I can gather there's two ways to observe the counters,
> through MMIO or trough the ring-buffer, which in turn seems triggered by
> a MI_REPORT_PERF_COUNT command.

There are three ways; mmio, MI_REPORT_PERF_COUNT via the ring-buffer
and periodic sampling where the HW writes into a circular 'oabuffer'.

I think it's best to discount mmio as a debug feature since none of
the counters are useful in isolation and mmio subverts the latching
mechanism we get with the other two methods. We typically at least
want to relate a counter to the number of clock cycles elapsed or the
gpu time, or thread spawn counts.

This pmu  driver is primarily for exposing periodic metrics, while
it's up to applications to choose to emit MI_REPORT_PERF_COUNT
commands as part of their command streams so I think we can mostly
ignore MI_REPORT_PERF_COUNT here too.

>
> [ Now the document talks about shortcomings of this scheme, where the
> MI_REPORT_PERF_COUNT command can only be placed every other command, but
> a single command can contain so much computation that this is not fine
> grained enough -- leading to the suggestion of a timer/cycle based
> reporting, but that is not actually mentioned afaict ]

It's in there, though unfortunately the documentation isn't very clear
currently. The 'Performance Event Counting' section seems to be the
appropriate place to introduce the periodic sampling feature, but just
reading it myself it really only talks about the limitations of
reporting like you say. I'll see if I can prod to get this improved.

If you see page 18 "Performance Statistics Registers":

OACONTROL has a 'Timer Period' field and 'Timer Enable'
OABUFFER points to a circular buffer for periodic reports
OASTATUS1/2 contain the head/tail pointers


>
> Now the MI_REPORT_PERF_COUNT can select a vector (Counter Select) of
> which events it will write out.
>
> This covers the regular 'A' counters. Is this correct?

There isn't a counter select in MI_REPORT_PERF_COUNT commands. I guess
you either saw 'GTT select' which is for dealing with the different
address spaces that the destination buffer can reside in, or in the
overview you saw the mention of the 'Counter Select' field following
the description of MI_REPORT_PERF_COUNT but just missed that it was
referring to the OACONTROL register.

The counter configuration is shared by all three modes of reading the
counters. (Though beware of confusing terms here, imho the above
motioned 'Counter Select' would be better described as a 'Report
Format' which is only relevant to MI_REPORT_PERF_COUNT and periodic
sampling, not mmio reading)

The A counters have fixed semantics on Haswell so there's nothing to
configure with these. On Broadwell some of these do become
configurable as "Flexible EU Counters" which is something for us to
keep in mind for later.

The semantics of the B counters are determined through the
configuration of a MUX to select signals of interest and the
configuration of some fixed-function ('boolean') logic that can affect
how those signals are combined before feeding into one of the B
counters. This configuration process is relatively slow, involving in
the order of 100 register writes.

After configuring the counter semantics, then for periodic and
MI_REPORT_PERF_COUNT usage (which work in terms of reports), we have
to choose the layout of those reports.

OACONTROL has a report format field ('Counter Select') that gives us a
limited way to trade off how many A or B counters are included, and
the memory bandwidth consumed writing out reports. So it doesn't
affect the configuration of counters per-se, it's just a way to ignore
some of the currently configured counters by selecting smaller
reports. Just looking at the OACONTROL documentation though I see
something amiss, as this field isn't only pre-Haswell. The possible
formats/layouts for Haswell are documented on page 10.


>
> Then there are the 'B' counters, which appear to be programmable through
> the CEC MMIO registers.

The fixed-function logic affecting B counters is configured via those
CEC registers as well as some per-counter 'report trigger' and 'start
trigger' registers not currently described. Overall though programming
the B counters is a combination of the fixed-function logic and the
MUX configuration that determines the signals that feed into this
logic, before counting.

>
> These B events can also be part of the MI_REPORT_PERF_COUNT vector.
>
> Right?

Right, part of MI_REPORT_PERF_COUNT and periodic reports.

>
>
> So for me the 'natural' way to represent this in perf would be through
> event groups. Create a perf event for every single event -- yes this is
> 53 events.

So when I was first looking at this work I had considered the
possibility of separate events, and these are some of the things that
in the end made me forward the hardware's raw report data via a single
event instead...

There are 100s of possible B counters depending on the MUX
configuration + fixed-function logic in addition to the A counters. A
choice would need to be made about whether to expose events for the
configurable counters that aren't inherently associated with any
semantics, or instead defining events for counters with specific
semantics (with 100s of possible counters to define). The later would
seem more complex for userspace and the kernel if they both now have
to understand the constraints on what counters can be used together. I
guess with either approach we would also need to have some form of
dedicated group leader event accepting attributes for configuring the
state that affects the group as a whole, such as the counter
configuration (3D vs GPGPU vs media etc). I'm not sure where we would
handle the context-id + drm file descriptor attributes for initiating
single context profiling but guess we'd need to authenticate each
individual event open. It's not clear if we'd configure the report
layout via the group leader, or try to automatically choose the most
compact format based on the group members. I'm not sure how pmus
currently handle the opening of enabled events on an enabled group but
I think there would need to be limitations in our case that new
members can't result in a reconfigure of the counters if that might
loose the current counter values known to userspace.

>From a user's pov, there's no real freedom to mix and match which
counters are configured together, and there's only some limited
ability to ignore some of the currently selected counters by not
including them in reports.

Something to understand here is that we have to work with sets of
pre-defined MUX + fixed-function logic configurations that have been
validated to give useful metrics for specific use cases, such as
benchmarking 3D rendering, GPGPU or media workloads.

As it is currently the kernel doesn't need to know anything about the
semantics of individual counters being selected, so it's currently
convenient that we can aim to maintain all the counter meta data we
have in userspace according to the changing needs of tools or drivers
(e.g. names, descriptions, units, max values, normalization
equations), de-coupled from the kernel, instead of splitting it
between the kernel and userspace.

A benefit of being able to change the report size is to reduce memory
bandwidth usage that can skew measurements. It's possible to request
the gpu to write out periodic snapshots at a very high frequency (we
can program a period as low as 160 nanoseconds) and higher frequencies
can start to expose some interesting details about how the gpu is
utilized - though with notable observer effects too. How careful we
are to not waste bandwidth is expected to determine what sampling
resolutions we can achieve before significantly impacting what we are
measuring.

Splitting the counters up looked like it could increase the bandwidth
we use quite a bit. The main difference comes from requiring 64bit
values instead of the 32bit values in our raw reports. This can be
offset partly since there are quite a few 'reserved'/redundant A
counters that don't need forwarding. As an example in the most extreme
case, instead of an 8 byte perf_event_header + 4byte raw_size + 256
byte reports + 4 byte padding every 160ns ~= 1.5GB/s, we might have 33
A counters (ignoring redundant ones) + 16 configurable counters = 400
byte struct read_format (using PERF_FORMAT_GROUP) + 8 byte
perf_event_header every 160ns ~= 2.4GB/s. On the other hand though we
could choose to forward only 2 or 3 counters of interest at these high
frequencies which isn't possible currently.

>
> Use the MMIO reads for the regular read() interface, and use a hrtimer
> placing MI_REPORT_PERF_COUNT commands, with a counter select mask
> covering the all events in the current group, for sampling.

Unfortunately due to the mmio limitations and the need to relate
counters I can't imagine many use cases for directly accessing the
counters individually via the read() interface.

MI_REPORT_PERF_COUNT commands are really only intended for collecting
reports in sync with a command stream. We are experimenting currently
with an extension of my PMU driver that emits MI_REPORT_PERF_COUNT
commands automatically around the batches of commands submitted by
userspace so we can do a better job of filtering metrics across many
gpu contexts, but for now the expectation is that the kernel shouldn't
be emitting MI_REPORT_PERF_COUNT commands. We emit
MI_REPORT_PERF_COUNT commands within Mesa for example to implement the
GL_INTEL_performance_query extension, at the start and end of a query
around a sequence of commands that the application is interested in
measuring.

>
> Then obtain the vector values using PERF_SAMPLE_READ and
> PERF_FORMAT_READ -- and use the read txn support to consume the
> ring-buffer vectors instead of the MMIO registers.

It sounds potentially doable to consume periodic OABUFFER reports via
the read_txn support.

>
> You can use the perf_event_attr::config to select the counter (A0-A44,
> B0-B7) and use perf_event_attr::config1 (low and high dword) for the
> corresponding CEC registers.
>

Hopefully covered above, but since the fixed-function state is so
dependent on the MUX configuration I think it currently makes sense to
treat the MUX plus logic state (including the CEC state) a tightly
coupled unit.

The Flexible EU Counters for Broadwell+ could be more amenable to this
kind of independent configuration, as I don't believe they are
dependant on the MUX configuration.

One idea that's come up a lot though is having the possibility of
being able to configure an event with a full MUX + fixed-function
state description.

>
> This does not require random per driver ABI extentions for
> perf_event_attr, nor your custom output format.
>
> Am I missing something obvious here?

Definitely nothing 'obvious' since the current documentation is
notably incomplete a.t.m, but I don't think we were on the same page
about how the hardware works and our use cases.

Hopefully some of my above comments help clarify some details.

I think I'll hold off from looking at changing anything specific until
you've had a chance to read my comments above in case you have more
thoughts and feedback.


Thanks for looking at this,
- Robert

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU
  2015-05-20 23:17       ` Robert Bragg
@ 2015-05-21  8:24         ` Daniel Vetter
  2015-05-27 15:39         ` Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2015-05-21  8:24 UTC (permalink / raw)
  To: Robert Bragg
  Cc: Peter Zijlstra, dri-devel, David Airlie, linux-api, intel-gfx,
	linux-kernel, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Daniel Vetter

On Thu, May 21, 2015 at 12:17:48AM +0100, Robert Bragg wrote:
> On Tue, May 19, 2015 at 3:53 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, May 15, 2015 at 02:07:29AM +0100, Robert Bragg wrote:
> >> On Fri, May 8, 2015 at 5:24 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:
> >> >
> >> >> I've changed the uapi for configuring the i915_oa specific attributes
> >> >> when calling perf_event_open(2) whereby instead of cramming lots of
> >> >> bitfields into the perf_event_attr config members, I'm now
> >> >> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
> >> >> config member that's extensible and validated in the same way as the
> >> >> perf_event_attr struct. I've found this much nicer to work with while
> >> >> being neatly extensible too.
> >> >
> >> > This worries me a bit.. is there more background for this?
> >>
> >> Would it maybe be helpful to see the before and after? I had kept this
> >> uapi change in a separate patch for a while locally but in the end
> >> decided to squash it before sending out my updated series.
> >>
> >> Although I did find it a bit awkward with the bitfields, I was mainly
> >> concerned about the extensibility of packing logically separate
> >> attributes into the config members and had heard similar concerns from
> >> a few others who had been experimenting with my patches too.
> >>
> >> A few simple attributes I can think of a.t.m that we might want to add
> >> in the future are:
> >> - control of the OABUFFER size
> >> - a way to ask the kernel to collect reports at the beginning and end
> >> of batch buffers, in addition to periodic reports
> >> - alternative ways to uniquely identify a context to support tools
> >> profiling a single context not necessarily owned by the current
> >> process
> >>
> >> It could also be interesting to expose some counter configuration
> >> through these attributes too. E.g. on Broadwell+ we have 14 'Flexible
> >> EU' counters included in the OA unit reports, each with a 16bit
> >> configuration.
> >>
> >> In a more extreme case it might also be useful to allow userspace to
> >> specify a complete counter config, which (depending on the
> >> configuration) could be over 100 32bit values to select the counter
> >> signals + configure the corresponding combining logic.
> >>
> >> Since this pmu is in a device driver it also seemed reasonably
> >> appropriate to de-couple it slightly from the core perf_event_attr
> >> structure by allowing driver extensible attributes.
> >>
> >> I wonder if it might be less worrisome if the i915_oa_copy_attr() code
> >> were instead a re-usable utility perhaps maintained in events/core.c,
> >> so if other pmu drivers were to follow suite there would be less risk
> >> of a mistake being made here?
> >
> >
> > So I had a peek at:
> >
> >   https://01.org/sites/default/files/documentation/observability_performance_counters_haswell.pdf
> >
> > In an attempt to inform myself of how the hardware worked. But the
> > document is rather sparse (and does not include broadwell).
> 
> Thanks. Sorry I can't easily fix this myself, but there is work
> ongoing to update this documentation. In the interim I can try and
> cover some of the key details here...

Yeah the docs are decidely less than stellar and especially lack
examples for how to use the flexible counters properly :( A few additional
comments from me below.

> > So from what I can gather there's two ways to observe the counters,
> > through MMIO or trough the ring-buffer, which in turn seems triggered by
> > a MI_REPORT_PERF_COUNT command.
> 
> There are three ways; mmio, MI_REPORT_PERF_COUNT via the ring-buffer
> and periodic sampling where the HW writes into a circular 'oabuffer'.
> 
> I think it's best to discount mmio as a debug feature since none of
> the counters are useful in isolation and mmio subverts the latching
> mechanism we get with the other two methods. We typically at least
> want to relate a counter to the number of clock cycles elapsed or the
> gpu time, or thread spawn counts.
> 
> This pmu  driver is primarily for exposing periodic metrics, while
> it's up to applications to choose to emit MI_REPORT_PERF_COUNT
> commands as part of their command streams so I think we can mostly
> ignore MI_REPORT_PERF_COUNT here too.

Yeah this is a crucial bit here - userspace inserts MI_REPORT_PERF_COUNT
into the gpu batchbuffer. And the gpu stores the samples at the userspace
provided address in the per-process gpu address space. Which means that
the kernel never ever sees these samples, nor can it get (efficiently at
least) at the target memory.

The mid-batchbuffer placement of perf samples is important since wrapping
just the entire batch is way too coarse. And timer-based sampling isn't
aligned to the different cmds, which means you can't restrict per sampling
to e.g. only a specific shader. Right now the gpu doesn't provide any
means to restrict sampling except emitting a full pipeline flush and then
wraping the interesting draw commands with MI_PEPORT_PERF.

> > [ Now the document talks about shortcomings of this scheme, where the
> > MI_REPORT_PERF_COUNT command can only be placed every other command, but
> > a single command can contain so much computation that this is not fine
> > grained enough -- leading to the suggestion of a timer/cycle based
> > reporting, but that is not actually mentioned afaict ]
> 
> It's in there, though unfortunately the documentation isn't very clear
> currently. The 'Performance Event Counting' section seems to be the
> appropriate place to introduce the periodic sampling feature, but just
> reading it myself it really only talks about the limitations of
> reporting like you say. I'll see if I can prod to get this improved.
> 
> If you see page 18 "Performance Statistics Registers":
> 
> OACONTROL has a 'Timer Period' field and 'Timer Enable'
> OABUFFER points to a circular buffer for periodic reports
> OASTATUS1/2 contain the head/tail pointers
> 
> 
> >
> > Now the MI_REPORT_PERF_COUNT can select a vector (Counter Select) of
> > which events it will write out.
> >
> > This covers the regular 'A' counters. Is this correct?
> 
> There isn't a counter select in MI_REPORT_PERF_COUNT commands. I guess
> you either saw 'GTT select' which is for dealing with the different
> address spaces that the destination buffer can reside in, or in the
> overview you saw the mention of the 'Counter Select' field following
> the description of MI_REPORT_PERF_COUNT but just missed that it was
> referring to the OACONTROL register.

Clarification: For userspace-inserted MI_REPORT_PERF the gpu forces the
access to the per-process address space, i.e. the GTT select bit isn't
obeyed.

> The counter configuration is shared by all three modes of reading the
> counters. (Though beware of confusing terms here, imho the above
> motioned 'Counter Select' would be better described as a 'Report
> Format' which is only relevant to MI_REPORT_PERF_COUNT and periodic
> sampling, not mmio reading)
> 
> The A counters have fixed semantics on Haswell so there's nothing to
> configure with these. On Broadwell some of these do become
> configurable as "Flexible EU Counters" which is something for us to
> keep in mind for later.
> 
> The semantics of the B counters are determined through the
> configuration of a MUX to select signals of interest and the
> configuration of some fixed-function ('boolean') logic that can affect
> how those signals are combined before feeding into one of the B
> counters. This configuration process is relatively slow, involving in
> the order of 100 register writes.
> 
> After configuring the counter semantics, then for periodic and
> MI_REPORT_PERF_COUNT usage (which work in terms of reports), we have
> to choose the layout of those reports.
> 
> OACONTROL has a report format field ('Counter Select') that gives us a
> limited way to trade off how many A or B counters are included, and
> the memory bandwidth consumed writing out reports. So it doesn't
> affect the configuration of counters per-se, it's just a way to ignore
> some of the currently configured counters by selecting smaller
> reports. Just looking at the OACONTROL documentation though I see
> something amiss, as this field isn't only pre-Haswell. The possible
> formats/layouts for Haswell are documented on page 10.

The tricky bit to keep in mind is that userspace and the kernel need to be
somewhat in sync about which sampling format is selected. Otherwise the
address in the MI_REPORT_PERF generated by userspace will not increment by
the correct amount (there's unfortunately no nice autoincrement support
for this).

> > Then there are the 'B' counters, which appear to be programmable through
> > the CEC MMIO registers.
> 
> The fixed-function logic affecting B counters is configured via those
> CEC registers as well as some per-counter 'report trigger' and 'start
> trigger' registers not currently described. Overall though programming
> the B counters is a combination of the fixed-function logic and the
> MUX configuration that determines the signals that feed into this
> logic, before counting.
> 
> >
> > These B events can also be part of the MI_REPORT_PERF_COUNT vector.
> >
> > Right?
> 
> Right, part of MI_REPORT_PERF_COUNT and periodic reports.

Imo that's a big reason for why the kernel should just forward the raw
samples to userspace. If we try to decode them in the kernel for the
timer-based ring, then userspace needs a dupe of that logic for the
MI_REPORT_PERF samples it collects.

If otoh we just forward the raw samples to userspace the same userspace
code can do the decoding for both the timer-based samples (collected
through perf) and the in-batch samples.

> > So for me the 'natural' way to represent this in perf would be through
> > event groups. Create a perf event for every single event -- yes this is
> > 53 events.
> 
> So when I was first looking at this work I had considered the
> possibility of separate events, and these are some of the things that
> in the end made me forward the hardware's raw report data via a single
> event instead...
> 
> There are 100s of possible B counters depending on the MUX
> configuration + fixed-function logic in addition to the A counters. A
> choice would need to be made about whether to expose events for the
> configurable counters that aren't inherently associated with any
> semantics, or instead defining events for counters with specific
> semantics (with 100s of possible counters to define). The later would
> seem more complex for userspace and the kernel if they both now have
> to understand the constraints on what counters can be used together. I
> guess with either approach we would also need to have some form of
> dedicated group leader event accepting attributes for configuring the
> state that affects the group as a whole, such as the counter
> configuration (3D vs GPGPU vs media etc). I'm not sure where we would
> handle the context-id + drm file descriptor attributes for initiating
> single context profiling but guess we'd need to authenticate each
> individual event open. It's not clear if we'd configure the report
> layout via the group leader, or try to automatically choose the most
> compact format based on the group members. I'm not sure how pmus
> currently handle the opening of enabled events on an enabled group but
> I think there would need to be limitations in our case that new
> members can't result in a reconfigure of the counters if that might
> loose the current counter values known to userspace.
> 
> From a user's pov, there's no real freedom to mix and match which
> counters are configured together, and there's only some limited
> ability to ignore some of the currently selected counters by not
> including them in reports.
> 
> Something to understand here is that we have to work with sets of
> pre-defined MUX + fixed-function logic configurations that have been
> validated to give useful metrics for specific use cases, such as
> benchmarking 3D rendering, GPGPU or media workloads.
> 
> As it is currently the kernel doesn't need to know anything about the
> semantics of individual counters being selected, so it's currently
> convenient that we can aim to maintain all the counter meta data we
> have in userspace according to the changing needs of tools or drivers
> (e.g. names, descriptions, units, max values, normalization
> equations), de-coupled from the kernel, instead of splitting it
> between the kernel and userspace.
> 
> A benefit of being able to change the report size is to reduce memory
> bandwidth usage that can skew measurements. It's possible to request
> the gpu to write out periodic snapshots at a very high frequency (we
> can program a period as low as 160 nanoseconds) and higher frequencies
> can start to expose some interesting details about how the gpu is
> utilized - though with notable observer effects too. How careful we
> are to not waste bandwidth is expected to determine what sampling
> resolutions we can achieve before significantly impacting what we are
> measuring.
> 
> Splitting the counters up looked like it could increase the bandwidth
> we use quite a bit. The main difference comes from requiring 64bit
> values instead of the 32bit values in our raw reports. This can be
> offset partly since there are quite a few 'reserved'/redundant A
> counters that don't need forwarding. As an example in the most extreme
> case, instead of an 8 byte perf_event_header + 4byte raw_size + 256
> byte reports + 4 byte padding every 160ns ~= 1.5GB/s, we might have 33
> A counters (ignoring redundant ones) + 16 configurable counters = 400
> byte struct read_format (using PERF_FORMAT_GROUP) + 8 byte
> perf_event_header every 160ns ~= 2.4GB/s. On the other hand though we
> could choose to forward only 2 or 3 counters of interest at these high
> frequencies which isn't possible currently.
> 
> >
> > Use the MMIO reads for the regular read() interface, and use a hrtimer
> > placing MI_REPORT_PERF_COUNT commands, with a counter select mask
> > covering the all events in the current group, for sampling.
> 
> Unfortunately due to the mmio limitations and the need to relate
> counters I can't imagine many use cases for directly accessing the
> counters individually via the read() interface.
> 
> MI_REPORT_PERF_COUNT commands are really only intended for collecting
> reports in sync with a command stream. We are experimenting currently
> with an extension of my PMU driver that emits MI_REPORT_PERF_COUNT
> commands automatically around the batches of commands submitted by
> userspace so we can do a better job of filtering metrics across many
> gpu contexts, but for now the expectation is that the kernel shouldn't
> be emitting MI_REPORT_PERF_COUNT commands. We emit
> MI_REPORT_PERF_COUNT commands within Mesa for example to implement the
> GL_INTEL_performance_query extension, at the start and end of a query
> around a sequence of commands that the application is interested in
> measuring.
> 
> >
> > Then obtain the vector values using PERF_SAMPLE_READ and
> > PERF_FORMAT_READ -- and use the read txn support to consume the
> > ring-buffer vectors instead of the MMIO registers.
> 
> It sounds potentially doable to consume periodic OABUFFER reports via
> the read_txn support.
> 
> >
> > You can use the perf_event_attr::config to select the counter (A0-A44,
> > B0-B7) and use perf_event_attr::config1 (low and high dword) for the
> > corresponding CEC registers.
> >
> 
> Hopefully covered above, but since the fixed-function state is so
> dependent on the MUX configuration I think it currently makes sense to
> treat the MUX plus logic state (including the CEC state) a tightly
> coupled unit.
> 
> The Flexible EU Counters for Broadwell+ could be more amenable to this
> kind of independent configuration, as I don't believe they are
> dependant on the MUX configuration.
> 
> One idea that's come up a lot though is having the possibility of
> being able to configure an event with a full MUX + fixed-function
> state description.
> 
> >
> > This does not require random per driver ABI extentions for
> > perf_event_attr, nor your custom output format.
> >
> > Am I missing something obvious here?
> 
> Definitely nothing 'obvious' since the current documentation is
> notably incomplete a.t.m, but I don't think we were on the same page
> about how the hardware works and our use cases.
> 
> Hopefully some of my above comments help clarify some details.
> 
> I think I'll hold off from looking at changing anything specific until
> you've had a chance to read my comments above in case you have more
> thoughts and feedback.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH] perf: enable fsync to flush buffered samples
  2015-05-20 12:12       ` Ingo Molnar
@ 2015-05-21 17:40         ` Robert Bragg
  0 siblings, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-05-21 17:40 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Jani Nikula, David Airlie, Peter Zijlstra,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api, Chris Wilson

Instead of having a PERF_EVENT_IOC_FLUSH ioctl this instead allows
userspace to use fsync for flushing pmu samples, as suggested by Ingo
Molnar - thanks.

For reference I've also pushed a patch to my Mesa branch to test
this: https://github.com/rib/mesa wip/rib/oa-hsw-4.0.0

- Robert

--- >8 ---

To allow for pmus that may have internal buffering (e.g. the hardware
writes out data to a circular buffer which is only periodically
forwarded to userspace via perf) this enables userspace to explicitly
ensure it has received all samples before a point in time.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 include/linux/perf_event.h |  7 +++++++
 kernel/events/core.c       | 23 +++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 04e98c8..d7fac05 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -305,6 +305,13 @@ struct pmu {
 	 * Free pmu-private AUX data structures
 	 */
 	void (*free_aux)		(void *aux); /* optional */
+
+	/*
+	 * Flush buffered samples (E.g. for pmu hardware that writes samples to
+	 * some intermediate buffer) userspace may need to explicitly ensure
+	 * such samples have been forwarded to perf.
+	 */
+	int (*flush)			(struct perf_event *event); /*optional */
 };
 
 /**
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2ba89a1..a604e0c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4728,6 +4728,28 @@ static int perf_fasync(int fd, struct file *filp, int on)
 	return 0;
 }
 
+static int perf_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
+{
+	struct perf_event *event = filp->private_data;
+	struct perf_event_context *ctx;
+	int ret;
+
+	/* We don't have a use for synchonizing a specific range, or datasync
+	 * but lets not silently ignore them in case we think of uses later...
+	 */
+	if (start != 0 || end != LLONG_MAX || datasync != 0)
+		return -EINVAL;
+
+	if (!event->pmu->flush)
+		return 0;
+
+	ctx = perf_event_ctx_lock(event);
+	ret = event->pmu->flush(event);
+	perf_event_ctx_unlock(event, ctx);
+
+	return ret;
+}
+
 static const struct file_operations perf_fops = {
 	.llseek			= no_llseek,
 	.release		= perf_release,
@@ -4737,6 +4759,7 @@ static const struct file_operations perf_fops = {
 	.compat_ioctl		= perf_compat_ioctl,
 	.mmap			= perf_mmap,
 	.fasync			= perf_fasync,
+	.fsync			= perf_fsync,
 };
 
 /*
-- 
2.4.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU
  2015-05-20 23:17       ` Robert Bragg
  2015-05-21  8:24         ` [Intel-gfx] " Daniel Vetter
@ 2015-05-27 15:39         ` Peter Zijlstra
  2015-05-27 16:41           ` Ingo Molnar
  2015-06-04 18:53           ` [Intel-gfx] " Robert Bragg
  1 sibling, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2015-05-27 15:39 UTC (permalink / raw)
  To: Robert Bragg
  Cc: intel-gfx, Daniel Vetter, Jani Nikula, David Airlie,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

On Thu, May 21, 2015 at 12:17:48AM +0100, Robert Bragg wrote:
> >
> > So for me the 'natural' way to represent this in perf would be through
> > event groups. Create a perf event for every single event -- yes this is
> > 53 events.
> 
> So when I was first looking at this work I had considered the
> possibility of separate events, and these are some of the things that
> in the end made me forward the hardware's raw report data via a single
> event instead...
> 
> There are 100s of possible B counters depending on the MUX
> configuration + fixed-function logic in addition to the A counters.

There's only 8 B counters, what there is 100s of is configurations, and
that's no different from any other PMU. There's 100s of events to select
on the CPU side as well - yet only 4 (general purpose) counters (8 if
you disable HT on recent machines) per CPU.

Of course, you could create more than 8 events at any one time, and then
perf will round-robin the configuration for you.

> A
> choice would need to be made about whether to expose events for the
> configurable counters that aren't inherently associated with any
> semantics, or instead defining events for counters with specific
> semantics (with 100s of possible counters to define). The later would
> seem more complex for userspace and the kernel if they both now have
> to understand the constraints on what counters can be used together.

Again, no different from existing PMUs. The most 'interesting' part of
any PMU driver is event scheduling.

The explicit design of perf was to put event scheduling _in_ the kernel
and not allow userspace direct access to the hardware (as previous PMU
models did).

> I
> guess with either approach we would also need to have some form of
> dedicated group leader event accepting attributes for configuring the
> state that affects the group as a whole, such as the counter
> configuration (3D vs GPGPU vs media etc). 

That depends a bit on how flexible you want to switch between these
modes; with the cost being in the 100s of MMIO register writes, it might
just not make sense to make this too dynamic.

An option might be to select the mode through the PMU driver's sysfs
files. Another option might be to expose the B counters 3 times and have
each set be mutually exclusive. That is, as soon as you've created a 3D
event, you can no longer create GPGPU/Media events.

> I'm not sure where we would
> handle the context-id + drm file descriptor attributes for initiating
> single context profiling but guess we'd need to authenticate each
> individual event open.

Event open is a slow path, so that should not be a problem, right?

> It's not clear if we'd configure the report
> layout via the group leader, or try to automatically choose the most
> compact format based on the group members. I'm not sure how pmus
> currently handle the opening of enabled events on an enabled group but

With the PERF_FORMAT_GROUP layout changing in-flight. I would recommend
not doing that -- decoding the output will be 'interesting' but not
impossible.

> I think there would need to be limitations in our case that new
> members can't result in a reconfigure of the counters if that might
> loose the current counter values known to userspace.

I'm not entirely sure what you mean here.

> From a user's pov, there's no real freedom to mix and match which
> counters are configured together, and there's only some limited
> ability to ignore some of the currently selected counters by not
> including them in reports.

It is 'impossible' to create a group that is not programmable. That is,
the pmu::event_init() call _should_ verify that the addition of the
event to the group (as given by event->group_leader) is valid.

> Something to understand here is that we have to work with sets of
> pre-defined MUX + fixed-function logic configurations that have been
> validated to give useful metrics for specific use cases, such as
> benchmarking 3D rendering, GPGPU or media workloads.

This is fine; as stated above. Since these are limited pieces of
'firmware' which are expensive to load, you don't have to do a fully
dynamic solution here.

> As it is currently the kernel doesn't need to know anything about the
> semantics of individual counters being selected, so it's currently
> convenient that we can aim to maintain all the counter meta data we
> have in userspace according to the changing needs of tools or drivers
> (e.g. names, descriptions, units, max values, normalization
> equations), de-coupled from the kernel, instead of splitting it
> between the kernel and userspace.

And that fully violates the design premise of perf. The kernel should be
in control of resources, not userspace.

If we'd have put userspace in charge, we could now not profile the same
task by two (or more) different observers. But with kernel side counter
management that is no problem at all.

> A benefit of being able to change the report size is to reduce memory
> bandwidth usage that can skew measurements. It's possible to request
> the gpu to write out periodic snapshots at a very high frequency (we
> can program a period as low as 160 nanoseconds) and higher frequencies
> can start to expose some interesting details about how the gpu is
> utilized - though with notable observer effects too. How careful we
> are to not waste bandwidth is expected to determine what sampling
> resolutions we can achieve before significantly impacting what we are
> measuring.
> 
> Splitting the counters up looked like it could increase the bandwidth
> we use quite a bit. The main difference comes from requiring 64bit
> values instead of the 32bit values in our raw reports. This can be
> offset partly since there are quite a few 'reserved'/redundant A
> counters that don't need forwarding. As an example in the most extreme
> case, instead of an 8 byte perf_event_header + 4byte raw_size + 256
> byte reports + 4 byte padding every 160ns ~= 1.5GB/s, we might have 33
> A counters (ignoring redundant ones) + 16 configurable counters = 400

In that PDF there's only 8 configurable 'B' counters.

> byte struct read_format (using PERF_FORMAT_GROUP) + 8 byte
> perf_event_header every 160ns ~= 2.4GB/s. On the other hand though we
> could choose to forward only 2 or 3 counters of interest at these high
> frequencies which isn't possible currently.

Right; although you'd have to spend some cpu cycles on the format shift.
Which might make it moot again. That said; I would really prefer we
start out with trying to make the generic format stuff work before
trying to come up with special case hacks.

> > Use the MMIO reads for the regular read() interface, and use a hrtimer
> > placing MI_REPORT_PERF_COUNT commands, with a counter select mask
> > covering the all events in the current group, for sampling.
> 
> Unfortunately due to the mmio limitations and the need to relate
> counters I can't imagine many use cases for directly accessing the
> counters individually via the read() interface.

Fair enough.

> MI_REPORT_PERF_COUNT commands are really only intended for collecting
> reports in sync with a command stream. We are experimenting currently
> with an extension of my PMU driver that emits MI_REPORT_PERF_COUNT
> commands automatically around the batches of commands submitted by
> userspace so we can do a better job of filtering metrics across many
> gpu contexts, but for now the expectation is that the kernel shouldn't
> be emitting MI_REPORT_PERF_COUNT commands. We emit
> MI_REPORT_PERF_COUNT commands within Mesa for example to implement the
> GL_INTEL_performance_query extension, at the start and end of a query
> around a sequence of commands that the application is interested in
> measuring.

I will try and read up on the GL_INTEL_performance_query thing.

> > You can use the perf_event_attr::config to select the counter (A0-A44,
> > B0-B7) and use perf_event_attr::config1 (low and high dword) for the
> > corresponding CEC registers.
> >
> 
> Hopefully covered above, but since the fixed-function state is so
> dependent on the MUX configuration I think it currently makes sense to
> treat the MUX plus logic state (including the CEC state) a tightly
> coupled unit.

Oh wait, the A counters are also affected by the MUX programming!
That wasn't clear to me before this.

Same difference though; you could still do either the sysfs aided flips
or the mutually exclusive counter types to deal with this; all still
assuming reprogramming all that is 'expensive'.

> The Flexible EU Counters for Broadwell+ could be more amenable to this
> kind of independent configuration, as I don't believe they are
> dependant on the MUX configuration.

That sounds good :-)

> One idea that's come up a lot though is having the possibility of
> being able to configure an event with a full MUX + fixed-function
> state description.

Right; seeing how there's only a very limited number of these MUX
programs validated (3D, GPGPU, Media, any others?) this should be
doable. And as mentioned before, you could disable the other types once
you create one in order to limit the reprogramming thing.

> >
> > This does not require random per driver ABI extentions for
> > perf_event_attr, nor your custom output format.
> >
> > Am I missing something obvious here?
> 
> Definitely nothing 'obvious' since the current documentation is
> notably incomplete a.t.m, but I don't think we were on the same page
> about how the hardware works and our use cases.
> 
> Hopefully some of my above comments help clarify some details.

Yes, thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU
  2015-05-27 15:39         ` Peter Zijlstra
@ 2015-05-27 16:41           ` Ingo Molnar
  2015-06-04 18:53           ` [Intel-gfx] " Robert Bragg
  1 sibling, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2015-05-27 16:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Robert Bragg, intel-gfx, Daniel Vetter, Jani Nikula,
	David Airlie, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, dri-devel, linux-api


* Peter Zijlstra <peterz@infradead.org> wrote:

> > As it is currently the kernel doesn't need to know anything about the 
> > semantics of individual counters being selected, so it's currently convenient 
> > that we can aim to maintain all the counter meta data we have in userspace 
> > according to the changing needs of tools or drivers (e.g. names, descriptions, 
> > units, max values, normalization equations), de-coupled from the kernel, 
> > instead of splitting it between the kernel and userspace.
> 
> And that fully violates the design premise of perf. The kernel should be in 
> control of resources, not userspace.
> 
> If we'd have put userspace in charge, we could now not profile the same task by 
> two (or more) different observers. But with kernel side counter management that 
> is no problem at all.

Beyond that there are a couple of other advantages of managing performance metrics 
via the kernel:

2)

Per task/workload/user separation of events, for example you can do:

	perf record make bzImage
        perf stat make bzImage

These will only profile/count in the child task hierarchy spawned by the shell. 
Multiple such sessions can be started and the perf sampling/counting of the 
separate workloads is independent of each other.

3)

Finegrained security model: you can chose which events to expose to what privilege 
levels. You can also work around hardware bugs without complicating tooling.

4)

Seemless integration of the various PMU and other event metrics with each other: 
you can start a CPU PMU event, a trace event, a software event and an uncore 
event, which all output into the same coherent, ordered ring-buffer, timestamped 
and correlated consistently, etc.

5)

The unified event model on the kernel side also brings with itself 100 KLOC of 
rich, unified tooling in tools/perf/, which tries to keep it all together and 
functional.

================

The cost of all that is having to implement kernel support for it, but once that 
initial hurdle is taken, it's well worth it! A PMU driver can be as simple as 
offering only a single system-wide counter.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU
  2015-05-27 15:39         ` Peter Zijlstra
  2015-05-27 16:41           ` Ingo Molnar
@ 2015-06-04 18:53           ` Robert Bragg
  1 sibling, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-06-04 18:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: intel-gfx, Daniel Vetter, Jani Nikula, David Airlie,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, dri-devel, linux-api

On Wed, May 27, 2015 at 4:39 PM, <> wrote:
> On Thu, May 21, 2015 at 12:17:48AM +0100, Robert Bragg wrote:
>> >
>> > So for me the 'natural' way to represent this in perf would be through
>> > event groups. Create a perf event for every single event -- yes this is
>> > 53 events.
>>
>> So when I was first looking at this work I had considered the
>> possibility of separate events, and these are some of the things that
>> in the end made me forward the hardware's raw report data via a single
>> event instead...
>>
>> There are 100s of possible B counters depending on the MUX
>> configuration + fixed-function logic in addition to the A counters.
>
> There's only 8 B counters, what there is 100s of is configurations, and
> that's no different from any other PMU. There's 100s of events to select
> on the CPU side as well - yet only 4 (general purpose) counters (8 if
> you disable HT on recent machines) per CPU.

To clarify one thing here; In the PRM you'll see a reference to
'Reserved' slots in the report layouts and these effectively
correspond to a further 8 configurable counters. These further
counters get referred to as 'C' counters for 'custom' and similar to
the B counters they are defined based on the MUX configuration except
they skip the boolean logic pipeline.

Thanks for the comparison with the CPU.

My main thought here is to beware of considering these configurable
counters as 'general purpose' if that might also imply orthogonality.
All our configurable counters build on top of a common/shared MUX
configuration so I'd tend to think of them more like a 'general
purpose counter set'.

>
> Of course, you could create more than 8 events at any one time, and then
> perf will round-robin the configuration for you.

This hardware really isn't designed to allow round-robin
reconfiguration. If we change the metric set then we will loose the
aggregate value of our B counters. Besides writing the large MUX
config + boolean logic state we would also need to save the current
aggregated counter values only maintained by the OA unit so they could
be restored later as well as any internal state that's simply not
accessible to us. If we reconfigure to try and round-robin between
sets of counters each reconfiguration will trash state that we have no
way of restoring. The counters have to no longer be in use if there's
no going back once we reconfigure.

>
>> A
>> choice would need to be made about whether to expose events for the
>> configurable counters that aren't inherently associated with any
>> semantics, or instead defining events for counters with specific
>> semantics (with 100s of possible counters to define). The later would
>> seem more complex for userspace and the kernel if they both now have
>> to understand the constraints on what counters can be used together.
>
> Again, no different from existing PMUs. The most 'interesting' part of
> any PMU driver is event scheduling.
>
> The explicit design of perf was to put event scheduling _in_ the kernel
> and not allow userspace direct access to the hardware (as previous PMU
> models did).

I'm thinking your 'event scheduling' here refers to the kernel being
able to handle multiple concurrent users of the same pmu, but then we
may be talking cross purposes...

I was describing how (if we exposed events for counters with well
defined semantics - as opposed to just 16 events for the B/C counters)
the driver would need to do some kind of matching based on the events
added to a group to deduce which MUX configuration should be used and
saying that in this case userspace would need to be very aware of
which events belong to a specific MUX configuration.

I don't think we have the hardware flexibility to support scheduling
in terms of multiple concurrent users, unless they happen to want
exactly the same configuration.

I also don't think we have a use case for accessing gpu metrics where
concurrent access is really important. It's possible to think of cases
where it might be nice, except they are unlikely to involve the same
configuration so given the current OA unit design I haven't had plans
to try and support concurrent access and haven't had any requests for
it so far.

>
>> I
>> guess with either approach we would also need to have some form of
>> dedicated group leader event accepting attributes for configuring the
>> state that affects the group as a whole, such as the counter
>> configuration (3D vs GPGPU vs media etc).
>
> That depends a bit on how flexible you want to switch between these
> modes; with the cost being in the 100s of MMIO register writes, it might
> just not make sense to make this too dynamic.
>
> An option might be to select the mode through the PMU driver's sysfs
> files. Another option might be to expose the B counters 3 times and have
> each set be mutually exclusive. That is, as soon as you've created a 3D
> event, you can no longer create GPGPU/Media events.

Hmm, interesting idea to have global/pmu state be controlled via sysfs...

I suppose the security model doesn't seem as clear, with sysfs access
gated by file permissions while we want to gate some control based on
whether you have access to a gpu context. The risk of multiple users
looking to open a group of events and first setting a global mode via
sysfs seems like it could also silently misconfigure your events with
the wrong mode. Maybe it could become immutable once one event is
opened and you could double check afterwards.

Besides the mode/MUX config choice, other state that is shared by the
group includes:
- single context vs system wide profiling
- the timer exponent

Using sysfs I think we'd have to be careful to not expose
configuration options that relate to the security policy if there's a
risk of a non-privileged process affecting it.

>
>> I'm not sure where we would
>> handle the context-id + drm file descriptor attributes for initiating
>> single context profiling but guess we'd need to authenticate each
>> individual event open.
>
> Event open is a slow path, so that should not be a problem, right?

Yeah I don't think I'd be concerned about duplicating the checks
per-event from a performance pov.

I'm more concerned about differentiating security checks that relate
to the final configuration as a whole vs individual events. In this
case we use a drm file descriptor + context handle to allow profiling
a single gpu context that the current process has access to, but the
choice of profiling a single context or across all contexts affects
the final hardware configuration too which relates to the group. Even
if a process has the privileges to additionally open an event for
system wide metrics, this choice is part of the OA unit configuration
and incompatible with single-context filtering. For reference, on
Haswell the reports don't include a context identifier which makes it
difficult to try and emulate a single-context filtering event with a
system wide hardware configuration.

I can see that we'd be able to cross-check event compatibility for
each addition to the group but I also can't help but see this as an
example of how tightly coupled the configuration of these counters
are.

>
>> It's not clear if we'd configure the report
>> layout via the group leader, or try to automatically choose the most
>> compact format based on the group members. I'm not sure how pmus
>> currently handle the opening of enabled events on an enabled group but
>
> With the PERF_FORMAT_GROUP layout changing in-flight. I would recommend
> not doing that -- decoding the output will be 'interesting' but not
> impossible.

Right, I wouldn't want to have to handle such corner cases so I think
I'd want to be able to report an error back to userspace attempting to
extend a group that's ever been activated.

>
>> I think there would need to be limitations in our case that new
>> members can't result in a reconfigure of the counters if that might
>> loose the current counter values known to userspace.
>
> I'm not entirely sure what you mean here.

This is referring to the issue of us not being able to explicitly save
and restore the state of the OA unit to be able to allow reconfiguring
the counters while they are in use.

The counter values for B/C counters are only maintained by the OA unit
and in the case of B counters which supports e.g. referencing delayed
values, there is state held in the unit that we have no way to
explicitly access and save.

>
>> From a user's pov, there's no real freedom to mix and match which
>> counters are configured together, and there's only some limited
>> ability to ignore some of the currently selected counters by not
>> including them in reports.
>
> It is 'impossible' to create a group that is not programmable. That is,
> the pmu::event_init() call _should_ verify that the addition of the
> event to the group (as given by event->group_leader) is valid.
>
>> Something to understand here is that we have to work with sets of
>> pre-defined MUX + fixed-function logic configurations that have been
>> validated to give useful metrics for specific use cases, such as
>> benchmarking 3D rendering, GPGPU or media workloads.
>
> This is fine; as stated above. Since these are limited pieces of
> 'firmware' which are expensive to load, you don't have to do a fully
> dynamic solution here.
>
>> As it is currently the kernel doesn't need to know anything about the
>> semantics of individual counters being selected, so it's currently
>> convenient that we can aim to maintain all the counter meta data we
>> have in userspace according to the changing needs of tools or drivers
>> (e.g. names, descriptions, units, max values, normalization
>> equations), de-coupled from the kernel, instead of splitting it
>> between the kernel and userspace.
>
> And that fully violates the design premise of perf. The kernel should be
> in control of resources, not userspace.

This isn't referring to programmable resources though; it's only
talking about where the higher-level meta data about these counters is
maintained, including names and descriptions detailing the specific
counter semantics as well as a description of the
equations/expressions that should be used to normalize these counters
to be useful to users.

This seems very comparable to other PMUs that expose _RAW, hardware
specific data via samples that rely on a userspace library like
libpfm4 to understand.

I see different tools have slightly different needs and may want to
normalize some counters differently. This data is still evolving over
time, and overall feel it's going to be more practical/maintainable
for us to try and keep this semantic data consolidated to userspace.

For reference; what I'm really referring to here is an established XML
description of OA counters maintained and validated by engineers
closely involved with the hardware. I think it makes sense for us to
factor into this trade-off the benefit we get from being able to
easily leverage this data but considering the relative lack of
stability of this data makes me prefer to limit ourselves to just
putting the MUX and boolean logic configurations in the kernel.

>
> If we'd have put userspace in charge, we could now not profile the same
> task by two (or more) different observers. But with kernel side counter
> management that is no problem at all.

Sorry, I think my comment must have come across wrong because I wasn't
talking about a choice that affects who's responsible for configuring
the counters or supporting multiple observers or not.

I was talking about a trade-off for where we maintain the higher level
information about the counters that can be exposed, especially the
code for normalizing individual counters.

>
>> A benefit of being able to change the report size is to reduce memory
>> bandwidth usage that can skew measurements. It's possible to request
>> the gpu to write out periodic snapshots at a very high frequency (we
>> can program a period as low as 160 nanoseconds) and higher frequencies
>> can start to expose some interesting details about how the gpu is
>> utilized - though with notable observer effects too. How careful we
>> are to not waste bandwidth is expected to determine what sampling
>> resolutions we can achieve before significantly impacting what we are
>> measuring.
>>
>> Splitting the counters up looked like it could increase the bandwidth
>> we use quite a bit. The main difference comes from requiring 64bit
>> values instead of the 32bit values in our raw reports. This can be
>> offset partly since there are quite a few 'reserved'/redundant A
>> counters that don't need forwarding. As an example in the most extreme
>> case, instead of an 8 byte perf_event_header + 4byte raw_size + 256
>> byte reports + 4 byte padding every 160ns ~= 1.5GB/s, we might have 33
>> A counters (ignoring redundant ones) + 16 configurable counters = 400
>
> In that PDF there's only 8 configurable 'B' counters.

Sorry that it's another example of being incomplete. The current
document only says enough about B counters to show how to access the
gpu clock that is required to normalise many of the A counters. There
are also the 8 C counters I mentioned earlier derived from the MUX
configuration but without the boolean logic.

>
>> byte struct read_format (using PERF_FORMAT_GROUP) + 8 byte
>> perf_event_header every 160ns ~= 2.4GB/s. On the other hand though we
>> could choose to forward only 2 or 3 counters of interest at these high
>> frequencies which isn't possible currently.
>
> Right; although you'd have to spend some cpu cycles on the format shift.
> Which might make it moot again. That said; I would really prefer we
> start out with trying to make the generic format stuff work before
> trying to come up with special case hacks.
>
>> > Use the MMIO reads for the regular read() interface, and use a hrtimer
>> > placing MI_REPORT_PERF_COUNT commands, with a counter select mask
>> > covering the all events in the current group, for sampling.
>>
>> Unfortunately due to the mmio limitations and the need to relate
>> counters I can't imagine many use cases for directly accessing the
>> counters individually via the read() interface.
>
> Fair enough.
>
>> MI_REPORT_PERF_COUNT commands are really only intended for collecting
>> reports in sync with a command stream. We are experimenting currently
>> with an extension of my PMU driver that emits MI_REPORT_PERF_COUNT
>> commands automatically around the batches of commands submitted by
>> userspace so we can do a better job of filtering metrics across many
>> gpu contexts, but for now the expectation is that the kernel shouldn't
>> be emitting MI_REPORT_PERF_COUNT commands. We emit
>> MI_REPORT_PERF_COUNT commands within Mesa for example to implement the
>> GL_INTEL_performance_query extension, at the start and end of a query
>> around a sequence of commands that the application is interested in
>> measuring.
>
> I will try and read up on the GL_INTEL_performance_query thing.

You can see my code for this here, in case that's helpful:
https://github.com/rib/mesa/tree/wip/rib/oa-hsw-4.0.0

You might also be interested in my more recent 'codegen' branch too
(work in progress though):
https://github.com/rib/mesa/tree/wip/rib/oa-hsw-codegen

Here you can see an example of the meta data I mentioned earlier, e.g.
describing the equations for normalizing the counters for one metrics
set:

https://github.com/rib/mesa/blob/wip/rib/oa-hsw-codegen/src/mesa/drivers/dri/i965/brw_oa_hsw.xml

This set corresponds to the "3D" metric set enabled in the i915_oa pmu
driver.

More example data can also be seen in my gputop tool, with 'render
basic', 'compute basic', 'compute extended', 'memory reads', 'memory
writes' and 'sampler balance' metric sets:

https://github.com/rib/gputop/blob/641d7644df64a871f36f3c283cfa18a2f1530813/gputop/oa-hsw.xml

>
>> > You can use the perf_event_attr::config to select the counter (A0-A44,
>> > B0-B7) and use perf_event_attr::config1 (low and high dword) for the
>> > corresponding CEC registers.
>> >
>>
>> Hopefully covered above, but since the fixed-function state is so
>> dependent on the MUX configuration I think it currently makes sense to
>> treat the MUX plus logic state (including the CEC state) a tightly
>> coupled unit.
>
> Oh wait, the A counters are also affected by the MUX programming!
> That wasn't clear to me before this.

Er actually you were right before, the raw A counters aren't directly
affected by the MUX configuration. In this paragraph 'fixed-function
state' was referring to the boolean logic state for B counters in case
that was unclear and the CEC ('custom event counter') registers are a
part of that boolean logic state.

That said though, the raw A counters aren't really /useful/ without
the MUX programming because most of them are so closely related to the
gpu clock (which we access via a B or C counter) that we have to
program the MUX before we can normalize the A counters to be
meaningful to users.

>
> Same difference though; you could still do either the sysfs aided flips
> or the mutually exclusive counter types to deal with this; all still
> assuming reprogramming all that is 'expensive'.
>
>> The Flexible EU Counters for Broadwell+ could be more amenable to this
>> kind of independent configuration, as I don't believe they are
>> dependant on the MUX configuration.
>
> That sounds good :-)
>
>> One idea that's come up a lot though is having the possibility of
>> being able to configure an event with a full MUX + fixed-function
>> state description.
>
> Right; seeing how there's only a very limited number of these MUX
> programs validated (3D, GPGPU, Media, any others?) this should be
> doable. And as mentioned before, you could disable the other types once
> you create one in order to limit the reprogramming thing.

Here I meant being able to support userspace supplying a full MUX
configuration and boolean logic configuration (i.e. large arrays of
data), for userspace tools being able to experiment with new
configurations during the early stages of building and validating
these configurations - so mostly interesting to the engineers
responsible for defining and testing these configurations in the first
place since it means they can share experimental configurations with
tools with the caveat that you'd likely need root privileges to use
the interface. It's appealing to have a low barrier for enabling
people to test new configurations without the need for users to build
a custom kernel.

I'm not worried about supporting this in the short term, but it's
still a usecase that could be helpful to support eventually, that I
want to keep in mind at least.

About the number of different configurations, I've been using 3D,
gpgpu and media as summary examples, but there are quite a few
configurations really...

On Haswell these are names of some of the sets I'm aware of...
Render Metrics Basic Gen7.5
Compute Metrics Basic Gen7.5
Compute Metrics Extended Gen7.5
Render Metrics Slice Balance Gen7.5
Memory Reads Distribution Gen7.5
Memory Writes Distribution Gen7.5
Metric set SamplerBalance
Memory Reads on Write Port Distribution Gen7.5
Stencil PMA Hold Metrics Gen7.5
Media Memory Reads Distribution Gen7.5
Media Memory Writes Distribution Gen7.5
Media VME Pipe Gen7.5

For Broadwell we have more; for example...
Render Metrics Basic Gen8
Compute Metrics Basic Gen8
Render Metrics for 3D Pipeline Profile
Memory Reads Distribution Gen8
Memory Writes Distribution Gen8
Compute Metrics Extended Gen8
Compute Metrics L3 Cache Gen8
Data Port Reads Coalescing Gen8
Data Port Writes Coalescing Gen8
Metric set HDCAndSF
Metric set L3_1
Metric set L3_2
Metric set L3_3
Metric set L3_4
Metric set RasterizerAndPixelBackend
Metric set Sampler_1
Metric set Sampler_2
Metric set TDL_1
Metric set TDL_2
Stencil PMA Hold Metrics Gen8
Media Memory Reads Distribution Gen8
Media Memory Writes Distribution Gen8
Media VME Pipe Gen8
HDC URB Coalescing Metrics Gen8

I'm not sure that it will end up making sense to publish all of these,
depending on what kind of testing these have been through to validate
that they give helpful data, and on the other hand these are evolving
and there may be more over time, but hopefully it at least gives a
ballpark idea of the number of configurations that could be
interesting to enable between Haswell and Broadwell.

>
>> >
>> > This does not require random per driver ABI extentions for
>> > perf_event_attr, nor your custom output format.
>> >
>> > Am I missing something obvious here?
>>
>> Definitely nothing 'obvious' since the current documentation is
>> notably incomplete a.t.m, but I don't think we were on the same page
>> about how the hardware works and our use cases.
>>
>> Hopefully some of my above comments help clarify some details.
>
> Yes, thanks!
>

Ok, I think there are some clarifications on the nature of our OA
hardware that we're drilling into here that might also help explain
more of the trade-offs that were considered regarding forwarding raw
OA hardware reports.

I wonder if it could be good for us to take a little bit of a step
back here just to re-consider the implications of enabling a pmu that
really doesn't relate to the cpu or tasks running on cpus - since I
think this is the first example of that, and it's also had a bearing
on how we currently forward samples via perf.

I think the basic implications to consider are that for our use cases
we wouldn't expect to be requesting cpu centric info in samples, such
as _IP, _CALLCHAIN, _CPU, _BRANCH_STACK and_REGS/STACK_USER and that
we wouldn't open events for specific pids/tasks.

I don't think we'd expect to use existing userspace tools with this
pmu, such as perf, and instead:

- We're accessing perf from within OpenGL for implementing a
performance query extension that applications/games/tools can then use
to get metrics for a single gpu context owned by OpenGL.
- We're creating new tools that are geared for viewing gpu metrics,
including gputop and grafips

I have a feeling that Ingo and yourself may currently be thinking of
the OA unit like an uncore pmu with the idea that tools like perf
should Just Work™ with OA counters, while I don't think that's going
to be the case.

Even so, I still feel that we benefit from being able to leverage the
perf infrastructure to support our use cases and even though perf is
currently quite cpu centric I've been happy that it's turned out to be
a good fit for exposing device metrics too and ignoring/bypassing some
of cpu specific bits for our needs has been straight forward.

I don't bring this up as a tangent to the question of having one
raw-report event vs multiple single counter events, but rather to
highlight that our gpu focused use cases might be a bit different to
what you're used to and I hope our tooling and OpenGL work can inform
this discussion too; representing the practical ends that we're most
eager to support.

Regards,
- Robert

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2015-06-04 18:53 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-07 14:15 [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Robert Bragg
2015-05-07 14:15 ` [RFC PATCH 01/11] perf: export perf_event_overflow Robert Bragg
2015-05-07 14:15 ` [RFC PATCH 02/11] perf: Add PERF_PMU_CAP_IS_DEVICE flag Robert Bragg
2015-05-07 14:15 ` [RFC PATCH 03/11] perf: Add PERF_EVENT_IOC_FLUSH ioctl Robert Bragg
2015-05-07 14:20   ` [Intel-gfx] " Chris Wilson
2015-05-18 17:25     ` [RFC PATCH v2] " Robert Bragg
2015-05-20 12:12       ` Ingo Molnar
2015-05-21 17:40         ` [RFC PATCH] perf: enable fsync to flush buffered samples Robert Bragg
2015-05-07 14:15 ` [RFC PATCH 04/11] perf: Add a PERF_RECORD_DEVICE event type Robert Bragg
2015-05-07 14:15 ` [RFC PATCH 05/11] perf: allow drivers more control over event logging Robert Bragg
2015-05-07 14:15 ` [RFC PATCH 06/11] drm/i915: rename OACONTROL GEN7_OACONTROL Robert Bragg
2015-05-07 14:15 ` [RFC PATCH 07/11] drm/i915: Expose PMU for Observation Architecture Robert Bragg
2015-05-07 14:36   ` [Intel-gfx] " Chris Wilson
2015-05-07 14:58   ` Chris Wilson
2015-05-18 16:36     ` Robert Bragg
2015-05-18 17:17       ` [RFC PATCH v2] " Robert Bragg
2015-05-18 17:21       ` [RFC PATCH] squash: be more careful stopping oacontrol updates Robert Bragg
2015-05-07 14:15 ` [RFC PATCH 08/11] drm/i915: add OA config for 3D render counters Robert Bragg
2015-05-07 14:15 ` [RFC PATCH 09/11] drm/i915: Add dev.i915.oa_event_paranoid sysctl option Robert Bragg
2015-05-07 14:15 ` [RFC PATCH 10/11] drm/i915: report OA buf overrun + report lost status Robert Bragg
2015-05-07 14:15 ` [RFC PATCH 11/11] WIP: drm/i915: constrain unit gating while using OA Robert Bragg
2015-05-08 16:21 ` [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU Peter Zijlstra
2015-05-18 17:29   ` Robert Bragg
2015-05-08 16:24 ` Peter Zijlstra
2015-05-15  1:07   ` Robert Bragg
2015-05-19 14:53     ` Peter Zijlstra
2015-05-20 23:17       ` Robert Bragg
2015-05-21  8:24         ` [Intel-gfx] " Daniel Vetter
2015-05-27 15:39         ` Peter Zijlstra
2015-05-27 16:41           ` Ingo Molnar
2015-06-04 18:53           ` [Intel-gfx] " Robert Bragg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).