All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] Enable Gen 7 Observation Architecture
@ 2016-04-20 14:23 Robert Bragg
  2016-04-20 14:23 ` [PATCH 1/9] drm/i915: Add i915 perf infrastructure Robert Bragg
                   ` (13 more replies)
  0 siblings, 14 replies; 44+ messages in thread
From: Robert Bragg @ 2016-04-20 14:23 UTC (permalink / raw)
  To: intel-gfx; +Cc: David Airlie, dri-devel, Sourab Gupta, Deepak S, Daniel Vetter

I've been working on some i-g-t tests for this new interface and while I still
have some more tests to write, it still seemed worth sending out another updated
series in the mean time.


Firstly this includes updates based on Emil's previous comments.

Then there have been a few issue hit while writing tests:

* It seems it can take a fairly long time for the MUX config to apply after
  the register writes have finished, and now the driver inserts a delay after
  configuration, before enabling periodic sampling.
* I've found that sometimes the HW tail pointer can get ahead of OA buffer
  writes, especially with higher sampling frequencies, and so now the driver
  maintains a margin behind the HW tail pointer to ensure the most recent
  reports have some time to land before attempting to copy them to userspace.
* As a sanity check that a report is valid before it's copied to userspace the
  driver checks the report-id field != 0
* The _BUFFER_OVERFLOW record has been replaced with a _BUFFER_LOST record with
  more specific semantics.
* Since we can't clear the overflow status on Haswell while periodic sampling
  is enabled, if an overflow occurs we now restart the unit.
* The maximum OA periodic sampling exponent is now 31
* We verify the head/tail pointers read back from HW look sane before using as
  offsets to read reports from the OA buffer. We reset the HW if they look bad.


For reference the work-in-progress i-g-t tests can be found here:

https://github.com/rib/intel-gpu-tools
branch = wip/rib/i915-perf-tests

or browsed here:
https://github.com/rib/intel-gpu-tools/commits/wip/rib/i915-perf-tests

Also for reference these patches can be fetched from here:

https://github.com/rib/linux
branch = wip/rib/oa-2016-04-19-nightly

Regards,
- Robert


Robert Bragg (9):
  drm/i915: Add i915 perf infrastructure
  drm/i915: rename OACONTROL GEN7_OACONTROL
  drm/i915: don't whitelist oacontrol in cmd parser
  drm/i915: Add 'render basic' Haswell OA unit config
  drm/i915: Enable i915 perf stream for Haswell OA unit
  drm/i915: advertise available metrics via sysfs
  drm/i915: Add dev.i915.perf_event_paranoid sysctl option
  drm/i915: add oa_event_min_timer_exponent sysctl
  drm/i915: Add more Haswell OA metric sets

 drivers/gpu/drm/i915/Makefile           |    4 +
 drivers/gpu/drm/i915/i915_cmd_parser.c  |   33 +-
 drivers/gpu/drm/i915/i915_dma.c         |    8 +
 drivers/gpu/drm/i915/i915_drv.h         |  155 ++++
 drivers/gpu/drm/i915/i915_gem_context.c |   24 +-
 drivers/gpu/drm/i915/i915_oa_hsw.c      |  658 ++++++++++++++
 drivers/gpu/drm/i915/i915_oa_hsw.h      |   38 +
 drivers/gpu/drm/i915/i915_perf.c        | 1439 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h         |  340 +++++++-
 include/uapi/drm/i915_drm.h             |  133 +++
 10 files changed, 2795 insertions(+), 37 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_oa_hsw.c
 create mode 100644 drivers/gpu/drm/i915/i915_oa_hsw.h
 create mode 100644 drivers/gpu/drm/i915/i915_perf.c

-- 
2.7.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/9] drm/i915: Add i915 perf infrastructure
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
@ 2016-04-20 14:23 ` Robert Bragg
  2016-04-20 22:41   ` Chris Wilson
  2016-04-20 14:23 ` [PATCH 2/9] drm/i915: rename OACONTROL GEN7_OACONTROL Robert Bragg
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Robert Bragg @ 2016-04-20 14:23 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, Sourab Gupta, Deepak S, Daniel Vetter, Robert Bragg

Adds base i915 perf infrastructure for Gen performance metrics.

This adds a DRM_IOCTL_I915_PERF_OPEN ioctl that takes an array of uint64
properties to configure a stream of metrics and returns a new fd usable
with standard VFS system calls including read() to read typed and sized
records; ioctl() to enable or disable capture and poll() to wait for
data.

A stream is opened something like:

  uint64_t properties[] = {
      /* Single context sampling */
      DRM_I915_PERF_PROP_CTX_HANDLE,        ctx_handle,

      /* Include OA reports in samples */
      DRM_I915_PERF_PROP_SAMPLE_OA,         true,

      /* OA unit configuration */
      DRM_I915_PERF_PROP_OA_METRICS_SET,    metrics_set_id,
      DRM_I915_PERF_PROP_OA_FORMAT,         report_format,
      DRM_I915_PERF_PROP_OA_EXPONENT,       period_exponent,
   };
   struct drm_i915_perf_open_param parm = {
      .flags = I915_PERF_FLAG_FD_CLOEXEC |
               I915_PERF_FLAG_FD_NONBLOCK |
               I915_PERF_FLAG_DISABLED,
      .properties_ptr = (uint64_t)properties,
      .num_properties = sizeof(properties) / 16,
   };
   int fd = drmIoctl(drm_fd, DRM_IOCTL_I915_PERF_OPEN, &param);

Records read all start with a common { type, size } header with
DRM_I915_PERF_RECORD_SAMPLE being of most interest. Sample records
contain an extensible number of fields and it's the
DRM_I915_PERF_PROP_SAMPLE_xyz properties given when opening that
determine what's included in every sample.

No specific streams are supported yet so any attempt to open a stream
will return an error.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/Makefile    |   3 +
 drivers/gpu/drm/i915/i915_dma.c  |   8 +
 drivers/gpu/drm/i915/i915_drv.h  |  86 ++++++++
 drivers/gpu/drm/i915/i915_perf.c | 443 +++++++++++++++++++++++++++++++++++++++
 include/uapi/drm/i915_drm.h      |  67 ++++++
 5 files changed, 607 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/i915_perf.c

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 0b88ba0..2f7ef71 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -98,6 +98,9 @@ i915-y += dvo_ch7017.o \
 # virtual gpu code
 i915-y += i915_vgpu.o
 
+# perf code
+i915-y += i915_perf.o
+
 # legacy horrors
 i915-y += i915_dma.o
 
diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index b377753..b62e269 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1039,6 +1039,11 @@ static int i915_driver_init_early(struct drm_i915_private *dev_priv,
 	if (ret < 0)
 		return ret;
 
+	/* Must at least be initialized before trying to pin any context
+	 * which i915_perf hooks into.
+	 */
+	i915_perf_init(dev);
+
 	/* This must be called before any calls to HAS_PCH_* */
 	intel_detect_pch(dev);
 
@@ -1425,6 +1430,8 @@ int i915_driver_unload(struct drm_device *dev)
 
 	i915_driver_unregister(dev_priv);
 
+	i915_perf_fini(dev);
+
 	drm_vblank_cleanup(dev);
 
 	intel_modeset_cleanup(dev);
@@ -1579,6 +1586,7 @@ const struct drm_ioctl_desc i915_ioctls[] = {
 	DRM_IOCTL_DEF_DRV(I915_GEM_USERPTR, i915_gem_userptr_ioctl, DRM_RENDER_ALLOW),
 	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_GETPARAM, i915_gem_context_getparam_ioctl, DRM_RENDER_ALLOW),
 	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_SETPARAM, i915_gem_context_setparam_ioctl, DRM_RENDER_ALLOW),
+	DRM_IOCTL_DEF_DRV(I915_PERF_OPEN, i915_perf_open_ioctl, DRM_RENDER_ALLOW),
 };
 
 int i915_max_ioctl = ARRAY_SIZE(i915_ioctls);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 85102ad..5a2a4d6 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1708,6 +1708,79 @@ struct intel_wm_config {
 	bool sprites_scaled;
 };
 
+struct i915_perf_read_state {
+	int count;
+	ssize_t read;
+	char __user *buf;
+};
+
+struct i915_perf_stream {
+	struct drm_i915_private *dev_priv;
+
+	struct list_head link;
+
+	u32 sample_flags;
+
+	struct intel_context *ctx;
+	bool enabled;
+
+	/* Enables the collection of HW samples, either in response to
+	 * I915_PERF_IOCTL_ENABLE or implicitly called when stream is
+	 * opened without I915_PERF_FLAG_DISABLED.
+	 */
+	void (*enable)(struct i915_perf_stream *stream);
+
+	/* Disables the collection of HW samples, either in response to
+	 * I915_PERF_IOCTL_DISABLE or implicitly called before
+	 * destroying the stream.
+	 */
+	void (*disable)(struct i915_perf_stream *stream);
+
+	/* Return: true if any i915 perf records are ready to read()
+	 * for this stream.
+	 */
+	bool (*can_read)(struct i915_perf_stream *stream);
+
+	/* Call poll_wait, passing a wait queue that will be woken
+	 * once there is something ready to read() for the stream
+	 */
+	void (*poll_wait)(struct i915_perf_stream *stream,
+			  struct file *file,
+			  poll_table *wait);
+
+	/* For handling a blocking read, wait until there is something
+	 * to ready to read() for the stream. E.g. wait on the same
+	 * wait queue that would be passed to poll_wait() until
+	 * ->can_read() returns true (if its safe to call ->can_read()
+	 * without the i915 perf lock held).
+	 */
+	int (*wait_unlocked)(struct i915_perf_stream *stream);
+
+	/* Copy as many buffered i915 perf samples and records for
+	 * this stream to userspace as will fit in the given buffer.
+	 *
+	 * Only write complete records.
+	 *
+	 * read_state->count is the length of read_state->buf
+	 *
+	 * Update read_state->read with the number of bytes written.
+	 *
+	 * Return any error condition that results in a short read
+	 * such as -ENOSPC or -EFAULT, even though these may be
+	 * squashed to 0 before returning to userspace (if at least
+	 * one record was successfully copied, as determined via
+	 * the read_state->read length)
+	 */
+	int (*read)(struct i915_perf_stream *stream,
+		    struct i915_perf_read_state *read_state);
+
+	/* Cleanup any stream specific resources.
+	 *
+	 * The stream will always be disabled before this is called.
+	 */
+	void (*destroy)(struct i915_perf_stream *stream);
+};
+
 struct drm_i915_private {
 	struct drm_device *dev;
 	struct kmem_cache *objects;
@@ -1979,6 +2052,12 @@ struct drm_i915_private {
 
 	struct i915_runtime_pm pm;
 
+	struct {
+		bool initialized;
+		struct mutex lock;
+		struct list_head streams;
+	} perf;
+
 	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
 	struct {
 		int (*execbuf_submit)(struct i915_execbuffer_params *params,
@@ -3334,6 +3413,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file_priv);
 
+int i915_perf_open_ioctl(struct drm_device *dev, void *data,
+			 struct drm_file *file);
+
 /* i915_gem_evict.c */
 int __must_check i915_gem_evict_something(struct drm_device *dev,
 					  struct i915_address_space *vm,
@@ -3450,6 +3532,10 @@ int i915_parse_cmds(struct intel_engine_cs *engine,
 		    u32 batch_len,
 		    bool is_master);
 
+/* i915_perf.c */
+extern void i915_perf_init(struct drm_device *dev);
+extern void i915_perf_fini(struct drm_device *dev);
+
 /* i915_suspend.c */
 extern int i915_save_state(struct drm_device *dev);
 extern int i915_restore_state(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
new file mode 100644
index 0000000..2143401
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -0,0 +1,443 @@
+/*
+ * Copyright © 2015 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include <linux/anon_inodes.h>
+#include <linux/sizes.h>
+
+#include "i915_drv.h"
+
+struct perf_open_properties {
+	u32 sample_flags;
+
+	u64 single_context:1;
+	u64 ctx_handle;
+};
+
+static ssize_t i915_perf_read_locked(struct i915_perf_stream *stream,
+				     struct file *file,
+				     char __user *buf,
+				     size_t count,
+				     loff_t *ppos)
+{
+	struct i915_perf_read_state state = { count, 0, buf };
+	int ret = stream->read(stream, &state);
+
+	if ((ret == -ENOSPC || ret == -EFAULT) && state.read)
+		ret = 0;
+
+	if (ret)
+		return ret;
+
+	return state.read ? state.read : -EAGAIN;
+}
+
+static ssize_t i915_perf_read(struct file *file,
+			      char __user *buf,
+			      size_t count,
+			      loff_t *ppos)
+{
+	struct i915_perf_stream *stream = file->private_data;
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	ssize_t ret;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		/* There's the small chance of false positives from
+		 * stream->wait_unlocked.
+		 *
+		 * E.g. with single context filtering since we only wait until
+		 * oabuffer has >= 1 report we don't immediately know whether
+		 * any reports really belong to the current context
+		 */
+		do {
+			ret = stream->wait_unlocked(stream);
+			if (ret)
+				return ret;
+
+			mutex_lock(&dev_priv->perf.lock);
+			ret = i915_perf_read_locked(stream, file,
+						    buf, count, ppos);
+			mutex_unlock(&dev_priv->perf.lock);
+		} while (ret == -EAGAIN);
+	} else {
+		mutex_lock(&dev_priv->perf.lock);
+		ret = i915_perf_read_locked(stream, file, buf, count, ppos);
+		mutex_unlock(&dev_priv->perf.lock);
+	}
+
+	return ret;
+}
+
+static unsigned int i915_perf_poll_locked(struct i915_perf_stream *stream,
+					  struct file *file,
+					  poll_table *wait)
+{
+	unsigned int streams = 0;
+
+	stream->poll_wait(stream, file, wait);
+
+	if (stream->can_read(stream))
+		streams |= POLLIN;
+
+	return streams;
+}
+
+static unsigned int i915_perf_poll(struct file *file, poll_table *wait)
+{
+	struct i915_perf_stream *stream = file->private_data;
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	int ret;
+
+	mutex_lock(&dev_priv->perf.lock);
+	ret = i915_perf_poll_locked(stream, file, wait);
+	mutex_unlock(&dev_priv->perf.lock);
+
+	return ret;
+}
+
+static void i915_perf_enable_locked(struct i915_perf_stream *stream)
+{
+	if (stream->enabled)
+		return;
+
+	/* Allow stream->enable() to refer to this */
+	stream->enabled = true;
+
+	if (stream->enable)
+		stream->enable(stream);
+}
+
+static void i915_perf_disable_locked(struct i915_perf_stream *stream)
+{
+	if (!stream->enabled)
+		return;
+
+	/* Allow stream->disable() to refer to this */
+	stream->enabled = false;
+
+	if (stream->disable)
+		stream->disable(stream);
+}
+
+static long i915_perf_ioctl_locked(struct i915_perf_stream *stream,
+				   unsigned int cmd,
+				   unsigned long arg)
+{
+	switch (cmd) {
+	case I915_PERF_IOCTL_ENABLE:
+		i915_perf_enable_locked(stream);
+		return 0;
+	case I915_PERF_IOCTL_DISABLE:
+		i915_perf_disable_locked(stream);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static long i915_perf_ioctl(struct file *file,
+			    unsigned int cmd,
+			    unsigned long arg)
+{
+	struct i915_perf_stream *stream = file->private_data;
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	long ret;
+
+	mutex_lock(&dev_priv->perf.lock);
+	ret = i915_perf_ioctl_locked(stream, cmd, arg);
+	mutex_unlock(&dev_priv->perf.lock);
+
+	return ret;
+}
+
+static void i915_perf_destroy_locked(struct i915_perf_stream *stream)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+
+	if (stream->enabled)
+		i915_perf_disable_locked(stream);
+
+	if (stream->destroy)
+		stream->destroy(stream);
+
+	list_del(&stream->link);
+
+	if (stream->ctx) {
+		mutex_lock(&dev_priv->dev->struct_mutex);
+		i915_gem_context_unreference(stream->ctx);
+		mutex_unlock(&dev_priv->dev->struct_mutex);
+	}
+
+	kfree(stream);
+}
+
+static int i915_perf_release(struct inode *inode, struct file *file)
+{
+	struct i915_perf_stream *stream = file->private_data;
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+
+	mutex_lock(&dev_priv->perf.lock);
+	i915_perf_destroy_locked(stream);
+	mutex_unlock(&dev_priv->perf.lock);
+
+	return 0;
+}
+
+
+static const struct file_operations fops = {
+	.owner		= THIS_MODULE,
+	.llseek		= no_llseek,
+	.release	= i915_perf_release,
+	.poll		= i915_perf_poll,
+	.read		= i915_perf_read,
+	.unlocked_ioctl	= i915_perf_ioctl,
+};
+
+static struct intel_context *
+lookup_context(struct drm_i915_private *dev_priv,
+	       struct file *user_filp,
+	       u32 ctx_user_handle)
+{
+	struct intel_context *ctx;
+
+	mutex_lock(&dev_priv->dev->struct_mutex);
+	list_for_each_entry(ctx, &dev_priv->context_list, link) {
+		struct drm_file *drm_file;
+
+		if (!ctx->file_priv)
+			continue;
+
+		drm_file = ctx->file_priv->file;
+
+		if (user_filp->private_data == drm_file &&
+		    ctx->user_handle == ctx_user_handle) {
+			i915_gem_context_reference(ctx);
+			mutex_unlock(&dev_priv->dev->struct_mutex);
+
+			return ctx;
+		}
+	}
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+
+	return NULL;
+}
+
+int i915_perf_open_ioctl_locked(struct drm_device *dev,
+				struct drm_i915_perf_open_param *param,
+				struct perf_open_properties *props,
+				struct drm_file *file)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct intel_context *specific_ctx = NULL;
+	struct i915_perf_stream *stream = NULL;
+	unsigned long f_flags = 0;
+	int stream_fd;
+	int ret = 0;
+
+	if (props->single_context) {
+		u32 ctx_handle = props->ctx_handle;
+
+		specific_ctx = lookup_context(dev_priv, file->filp, ctx_handle);
+		if (!specific_ctx) {
+			DRM_ERROR("Failed to look up context with ID %u for opening perf stream\n",
+				  ctx_handle);
+			ret = -EINVAL;
+			goto err;
+		}
+	}
+
+	if (!specific_ctx && !capable(CAP_SYS_ADMIN)) {
+		DRM_ERROR("Insufficient privileges to open system-wide i915 perf stream\n");
+		ret = -EACCES;
+		goto err_ctx;
+	}
+
+	stream = kzalloc(sizeof(*stream), GFP_KERNEL);
+	if (!stream) {
+		ret = -ENOMEM;
+		goto err_ctx;
+	}
+
+	stream->sample_flags = props->sample_flags;
+	stream->dev_priv = dev_priv;
+	stream->ctx = specific_ctx;
+
+	/*
+	 * TODO: support sampling something
+	 *
+	 * For now this is as far as we can go.
+	 */
+	DRM_ERROR("Unsupported i915 perf stream configuration\n");
+	ret = -EINVAL;
+	goto err_alloc;
+
+	list_add(&stream->link, &dev_priv->perf.streams);
+
+	if (param->flags & I915_PERF_FLAG_FD_CLOEXEC)
+		f_flags |= O_CLOEXEC;
+	if (param->flags & I915_PERF_FLAG_FD_NONBLOCK)
+		f_flags |= O_NONBLOCK;
+
+	stream_fd = anon_inode_getfd("[i915_perf]", &fops, stream, f_flags);
+	if (stream_fd < 0) {
+		ret = stream_fd;
+		goto err_open;
+	}
+
+	if (!(param->flags & I915_PERF_FLAG_DISABLED))
+		i915_perf_enable_locked(stream);
+
+	return stream_fd;
+
+err_open:
+	list_del(&stream->link);
+	if (stream->destroy)
+		stream->destroy(stream);
+err_alloc:
+	kfree(stream);
+err_ctx:
+	if (specific_ctx) {
+		mutex_lock(&dev_priv->dev->struct_mutex);
+		i915_gem_context_unreference(specific_ctx);
+		mutex_unlock(&dev_priv->dev->struct_mutex);
+	}
+err:
+	return ret;
+}
+
+/* Note we copy the properties from userspace outside of the i915 perf
+ * mutex to avoid an awkward lockdep with mmap_sem.
+ *
+ * Note this function only validates properties in isolation it doesn't
+ * validate that the combination of properties makes sense or that all
+ * properties necessary for a particular kind of stream have been set.
+ */
+static int read_properties_unlocked(struct drm_i915_private *dev_priv,
+				    u64 __user *uprops,
+				    u32 n_props,
+				    struct perf_open_properties *props)
+{
+	u64 __user *uprop = uprops;
+	int i;
+
+	memset(props, 0, sizeof(struct perf_open_properties));
+
+	if (!n_props) {
+		DRM_ERROR("No i915 perf properties given");
+		return -EINVAL;
+	}
+
+	if (n_props > DRM_I915_PERF_PROP_MAX) {
+		DRM_ERROR("More i915 perf properties specified than exist");
+		return -EINVAL;
+	}
+
+	for (i = 0; i < n_props; i++) {
+		u64 id, value;
+		int ret;
+
+		ret = get_user(id, (u64 __user *)uprop);
+		if (ret)
+			return ret;
+
+		if (id == 0 || id >= DRM_I915_PERF_PROP_MAX) {
+			DRM_ERROR("Unknown i915 perf property ID");
+			return -EINVAL;
+		}
+
+		ret = get_user(value, (u64 __user *)uprop + 1);
+		if (ret)
+			return ret;
+
+		switch ((enum drm_i915_perf_property_id)id) {
+		case DRM_I915_PERF_PROP_CTX_HANDLE:
+			props->single_context = 1;
+			props->ctx_handle = value;
+			break;
+
+		case DRM_I915_PERF_PROP_MAX:
+			BUG();
+		}
+
+		uprop += 2;
+	}
+
+	return 0;
+}
+
+int i915_perf_open_ioctl(struct drm_device *dev, void *data,
+			 struct drm_file *file)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct drm_i915_perf_open_param *param = data;
+	struct perf_open_properties props;
+	u32 known_open_flags = 0;
+	int ret;
+
+	if (!dev_priv->perf.initialized) {
+		DRM_ERROR("i915 perf interface not available for this system");
+		return -ENOTSUPP;
+	}
+
+	known_open_flags = I915_PERF_FLAG_FD_CLOEXEC |
+			   I915_PERF_FLAG_FD_NONBLOCK |
+			   I915_PERF_FLAG_DISABLED;
+	if (param->flags & ~known_open_flags) {
+		DRM_ERROR("Unknown drm_i915_perf_open_param flag\n");
+		return -EINVAL;
+	}
+
+	ret = read_properties_unlocked(dev_priv,
+				       to_user_ptr(param->properties_ptr),
+				       param->num_properties,
+				       &props);
+	if (ret)
+		return ret;
+
+	mutex_lock(&dev_priv->perf.lock);
+	ret = i915_perf_open_ioctl_locked(dev, param, &props, file);
+	mutex_unlock(&dev_priv->perf.lock);
+
+	return ret;
+}
+
+void i915_perf_init(struct drm_device *dev)
+{
+	struct drm_i915_private *dev_priv = to_i915(dev);
+
+	INIT_LIST_HEAD(&dev_priv->perf.streams);
+	mutex_init(&dev_priv->perf.lock);
+
+	dev_priv->perf.initialized = true;
+}
+
+void i915_perf_fini(struct drm_device *dev)
+{
+	struct drm_i915_private *dev_priv = to_i915(dev);
+
+	if (!dev_priv->perf.initialized)
+		return;
+
+	/* Currently nothing to clean up */
+
+	dev_priv->perf.initialized = false;
+}
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index a5524cc..962cc96 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -230,6 +230,7 @@ typedef struct _drm_i915_sarea {
 #define DRM_I915_GEM_USERPTR		0x33
 #define DRM_I915_GEM_CONTEXT_GETPARAM	0x34
 #define DRM_I915_GEM_CONTEXT_SETPARAM	0x35
+#define DRM_I915_PERF_OPEN		0x36
 
 #define DRM_IOCTL_I915_INIT		DRM_IOW( DRM_COMMAND_BASE + DRM_I915_INIT, drm_i915_init_t)
 #define DRM_IOCTL_I915_FLUSH		DRM_IO ( DRM_COMMAND_BASE + DRM_I915_FLUSH)
@@ -283,6 +284,7 @@ typedef struct _drm_i915_sarea {
 #define DRM_IOCTL_I915_GEM_USERPTR			DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GEM_USERPTR, struct drm_i915_gem_userptr)
 #define DRM_IOCTL_I915_GEM_CONTEXT_GETPARAM	DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GEM_CONTEXT_GETPARAM, struct drm_i915_gem_context_param)
 #define DRM_IOCTL_I915_GEM_CONTEXT_SETPARAM	DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GEM_CONTEXT_SETPARAM, struct drm_i915_gem_context_param)
+#define DRM_IOCTL_I915_PERF_OPEN	DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_PERF_OPEN, struct drm_i915_perf_open_param)
 
 /* Allow drivers to submit batchbuffers directly to hardware, relying
  * on the security mechanisms provided by hardware.
@@ -1170,4 +1172,69 @@ struct drm_i915_gem_context_param {
 	__u64 value;
 };
 
+enum drm_i915_perf_property_id {
+	/**
+	 * Open the stream for a specific context handle (as used with
+	 * execbuffer2). A stream opened for a specific context this way
+	 * won't typically require root privileges.
+	 */
+	DRM_I915_PERF_PROP_CTX_HANDLE = 1,
+
+	DRM_I915_PERF_PROP_MAX /* non-ABI */
+};
+
+struct drm_i915_perf_open_param {
+	__u32 flags;
+#define I915_PERF_FLAG_FD_CLOEXEC	(1<<0)
+#define I915_PERF_FLAG_FD_NONBLOCK	(1<<1)
+#define I915_PERF_FLAG_DISABLED		(1<<2)
+
+	/** The number of u64 (id, value) pairs */
+	__u32 num_properties;
+
+	/**
+	 * Pointer to array of u64 (id, value) pairs configuring the stream
+	 * to open.
+	 */
+	__u64 __user properties_ptr;
+};
+
+#define I915_PERF_IOCTL_ENABLE	_IO('i', 0x0)
+#define I915_PERF_IOCTL_DISABLE	_IO('i', 0x1)
+
+/**
+ * Common to all i915 perf records
+ */
+struct drm_i915_perf_record_header {
+	__u32 type;
+	__u16 pad;
+	__u16 size;
+};
+
+enum drm_i915_perf_record_type {
+
+	/**
+	 * Samples are the work horse record type whose contents are extensible
+	 * and defined when opening an i915 perf stream based on the given
+	 * properties.
+	 *
+	 * Boolean properties following the naming convention
+	 * DRM_I915_PERF_SAMPLE_xyz_PROP request the inclusion of 'xyz' data in
+	 * every sample.
+	 *
+	 * The order of these sample properties given by userspace has no
+	 * affect on the ordering of data within a sample. The order will be
+	 * documented here.
+	 *
+	 * struct {
+	 *     struct drm_i915_perf_record_header header;
+	 *
+	 *     TODO: itemize extensible sample data here
+	 * };
+	 */
+	DRM_I915_PERF_RECORD_SAMPLE = 1,
+
+	DRM_I915_PERF_RECORD_MAX /* non-ABI */
+};
+
 #endif /* _UAPI_I915_DRM_H_ */
-- 
2.7.1

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/9] drm/i915: rename OACONTROL GEN7_OACONTROL
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
  2016-04-20 14:23 ` [PATCH 1/9] drm/i915: Add i915 perf infrastructure Robert Bragg
@ 2016-04-20 14:23 ` Robert Bragg
  2016-04-20 14:23 ` [PATCH 3/9] drm/i915: don't whitelist oacontrol in cmd parser Robert Bragg
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Robert Bragg @ 2016-04-20 14:23 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, Sourab Gupta, Deepak S, Daniel Vetter, Robert Bragg

OACONTROL changes quite a bit for gen8, with some bits split out into a
per-context OACTXCONTROL register. Rename now before add more gen7 OA
registers

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_cmd_parser.c | 4 ++--
 drivers/gpu/drm/i915/i915_reg.h        | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c b/drivers/gpu/drm/i915/i915_cmd_parser.c
index a337f33..035f2dd 100644
--- a/drivers/gpu/drm/i915/i915_cmd_parser.c
+++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
@@ -445,7 +445,7 @@ static const struct drm_i915_reg_descriptor gen7_render_regs[] = {
 	REG64(PS_INVOCATION_COUNT),
 	REG64(PS_DEPTH_COUNT),
 	REG64_IDX(RING_TIMESTAMP, RENDER_RING_BASE),
-	REG32(OACONTROL), /* Only allowed for LRI and SRM. See below. */
+	REG32(GEN7_OACONTROL), /* Only allowed for LRI and SRM. See below. */
 	REG64(MI_PREDICATE_SRC0),
 	REG64(MI_PREDICATE_SRC1),
 	REG32(GEN7_3DPRIM_END_OFFSET),
@@ -1092,7 +1092,7 @@ static bool check_cmd(const struct intel_engine_cs *engine,
 			 * to the register. Hence, limit OACONTROL writes to
 			 * only MI_LOAD_REGISTER_IMM commands.
 			 */
-			if (reg_addr == i915_mmio_reg_offset(OACONTROL)) {
+			if (reg_addr == i915_mmio_reg_offset(GEN7_OACONTROL)) {
 				if (desc->cmd.value == MI_LOAD_REGISTER_MEM) {
 					DRM_DEBUG_DRIVER("CMD: Rejected LRM to OACONTROL\n");
 					return false;
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 9464ba3..de1e9a0 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -593,7 +593,7 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define HSW_CS_GPR(n)                   _MMIO(0x2600 + (n) * 8)
 #define HSW_CS_GPR_UDW(n)               _MMIO(0x2600 + (n) * 8 + 4)
 
-#define OACONTROL _MMIO(0x2360)
+#define GEN7_OACONTROL _MMIO(0x2360)
 
 #define _GEN7_PIPEA_DE_LOAD_SL	0x70068
 #define _GEN7_PIPEB_DE_LOAD_SL	0x71068
-- 
2.7.1

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/9] drm/i915: don't whitelist oacontrol in cmd parser
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
  2016-04-20 14:23 ` [PATCH 1/9] drm/i915: Add i915 perf infrastructure Robert Bragg
  2016-04-20 14:23 ` [PATCH 2/9] drm/i915: rename OACONTROL GEN7_OACONTROL Robert Bragg
@ 2016-04-20 14:23 ` Robert Bragg
  2016-04-20 14:23 ` [PATCH 4/9] drm/i915: Add 'render basic' Haswell OA unit config Robert Bragg
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Robert Bragg @ 2016-04-20 14:23 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, Sourab Gupta, Deepak S, Daniel Vetter, Robert Bragg

Being able to program OACONTROL from a non-privileged batch buffer is
not sufficient to be able to configure the OA unit. This was originally
allowed to help enable Mesa to expose OA counters via the
INTEL_performance_query extension, but the current implementation based
on programming OACONTROL via a batch buffer isn't able to report useable
data without a more complete OA unit configuration. Mesa handles the
possibility that writes to OACONTROL may not be allowed and so only
advertises the extension after explicitly testing that a write to
OACONTROL succeeds. Based on this; removing OACONTROL from the whitelist
should be ok for userspace.

Removing this simplifies adding a new kernel api for configuring the OA
unit without needing to consider the possibility that userspace might
trample on OACONTROL state which we'd like to start managing within
the kernel instead. In particular running any Mesa based GL application
currently results in clearing OACONTROL when initializing which would
disable the capturing of metrics.

XXX: actually since rebasing this on a recent nightly this patch does
seem to be breaking gnome-shell on mesa-11.1 - it's not clear a.t.m
why the change didn't seem to cause a problem based on v4.5 with the
same gnome-shell/mesa versions.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_cmd_parser.c | 33 ++-------------------------------
 1 file changed, 2 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c b/drivers/gpu/drm/i915/i915_cmd_parser.c
index 035f2dd..8d323b9 100644
--- a/drivers/gpu/drm/i915/i915_cmd_parser.c
+++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
@@ -445,7 +445,6 @@ static const struct drm_i915_reg_descriptor gen7_render_regs[] = {
 	REG64(PS_INVOCATION_COUNT),
 	REG64(PS_DEPTH_COUNT),
 	REG64_IDX(RING_TIMESTAMP, RENDER_RING_BASE),
-	REG32(GEN7_OACONTROL), /* Only allowed for LRI and SRM. See below. */
 	REG64(MI_PREDICATE_SRC0),
 	REG64(MI_PREDICATE_SRC1),
 	REG32(GEN7_3DPRIM_END_OFFSET),
@@ -1044,8 +1043,7 @@ bool i915_needs_cmd_parser(struct intel_engine_cs *engine)
 static bool check_cmd(const struct intel_engine_cs *engine,
 		      const struct drm_i915_cmd_descriptor *desc,
 		      const u32 *cmd, u32 length,
-		      const bool is_master,
-		      bool *oacontrol_set)
+		      const bool is_master)
 {
 	if (desc->flags & CMD_DESC_REJECT) {
 		DRM_DEBUG_DRIVER("CMD: Rejected command: 0x%08X\n", *cmd);
@@ -1083,26 +1081,6 @@ static bool check_cmd(const struct intel_engine_cs *engine,
 			}
 
 			/*
-			 * OACONTROL requires some special handling for
-			 * writes. We want to make sure that any batch which
-			 * enables OA also disables it before the end of the
-			 * batch. The goal is to prevent one process from
-			 * snooping on the perf data from another process. To do
-			 * that, we need to check the value that will be written
-			 * to the register. Hence, limit OACONTROL writes to
-			 * only MI_LOAD_REGISTER_IMM commands.
-			 */
-			if (reg_addr == i915_mmio_reg_offset(GEN7_OACONTROL)) {
-				if (desc->cmd.value == MI_LOAD_REGISTER_MEM) {
-					DRM_DEBUG_DRIVER("CMD: Rejected LRM to OACONTROL\n");
-					return false;
-				}
-
-				if (desc->cmd.value == MI_LOAD_REGISTER_IMM(1))
-					*oacontrol_set = (cmd[offset + 1] != 0);
-			}
-
-			/*
 			 * Check the value written to the register against the
 			 * allowed mask/value pair given in the whitelist entry.
 			 */
@@ -1186,7 +1164,6 @@ int i915_parse_cmds(struct intel_engine_cs *engine,
 {
 	u32 *cmd, *batch_base, *batch_end;
 	struct drm_i915_cmd_descriptor default_desc = { 0 };
-	bool oacontrol_set = false; /* OACONTROL tracking. See check_cmd() */
 	int ret = 0;
 
 	batch_base = copy_batch(shadow_batch_obj, batch_obj,
@@ -1243,8 +1220,7 @@ int i915_parse_cmds(struct intel_engine_cs *engine,
 			break;
 		}
 
-		if (!check_cmd(engine, desc, cmd, length, is_master,
-			       &oacontrol_set)) {
+		if (!check_cmd(engine, desc, cmd, length, is_master)) {
 			ret = -EINVAL;
 			break;
 		}
@@ -1252,11 +1228,6 @@ int i915_parse_cmds(struct intel_engine_cs *engine,
 		cmd += length;
 	}
 
-	if (oacontrol_set) {
-		DRM_DEBUG_DRIVER("CMD: batch set OACONTROL but did not clear it\n");
-		ret = -EINVAL;
-	}
-
 	if (cmd >= batch_end) {
 		DRM_DEBUG_DRIVER("CMD: Got to the end of the buffer w/o a BBE cmd!\n");
 		ret = -EINVAL;
-- 
2.7.1

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/9] drm/i915: Add 'render basic' Haswell OA unit config
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
                   ` (2 preceding siblings ...)
  2016-04-20 14:23 ` [PATCH 3/9] drm/i915: don't whitelist oacontrol in cmd parser Robert Bragg
@ 2016-04-20 14:23 ` Robert Bragg
  2016-04-20 14:23 ` [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit Robert Bragg
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Robert Bragg @ 2016-04-20 14:23 UTC (permalink / raw)
  To: intel-gfx; +Cc: David Airlie, dri-devel, Sourab Gupta, Deepak S, Daniel Vetter

Adds a static OA unit, MUX + B Counter configuration for basic render
metrics on Haswell. This is autogenerated from an internal XML
description of metric sets.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/Makefile      |   3 +-
 drivers/gpu/drm/i915/i915_drv.h    |  14 ++++
 drivers/gpu/drm/i915/i915_oa_hsw.c | 132 +++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_oa_hsw.h |  34 ++++++++++
 4 files changed, 182 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/i915/i915_oa_hsw.c
 create mode 100644 drivers/gpu/drm/i915/i915_oa_hsw.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 2f7ef71..2a3dc67 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -99,7 +99,8 @@ i915-y += dvo_ch7017.o \
 i915-y += i915_vgpu.o
 
 # perf code
-i915-y += i915_perf.o
+i915-y += i915_perf.o \
+	  i915_oa_hsw.o
 
 # legacy horrors
 i915-y += i915_dma.o
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 5a2a4d6..5e959f3 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1708,6 +1708,11 @@ struct intel_wm_config {
 	bool sprites_scaled;
 };
 
+struct i915_oa_reg {
+	u32 addr;
+	u32 value;
+};
+
 struct i915_perf_read_state {
 	int count;
 	ssize_t read;
@@ -2056,6 +2061,15 @@ struct drm_i915_private {
 		bool initialized;
 		struct mutex lock;
 		struct list_head streams;
+
+		struct {
+			u32 metrics_set;
+
+			const struct i915_oa_reg *mux_regs;
+			int mux_regs_len;
+			const struct i915_oa_reg *b_counter_regs;
+			int b_counter_regs_len;
+		} oa;
 	} perf;
 
 	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
diff --git a/drivers/gpu/drm/i915/i915_oa_hsw.c b/drivers/gpu/drm/i915/i915_oa_hsw.c
new file mode 100644
index 0000000..5472aa0
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_oa_hsw.c
@@ -0,0 +1,132 @@
+/*
+ * Autogenerated file, DO NOT EDIT manually!
+ *
+ * Copyright (c) 2015 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "i915_drv.h"
+
+enum metric_set_id {
+	METRIC_SET_ID_RENDER_BASIC = 1,
+};
+
+int i915_oa_n_builtin_metric_sets_hsw = 1;
+
+static const struct i915_oa_reg b_counter_config_render_basic[] = {
+	{ _MMIO(0x2724), 0x00800000 },
+	{ _MMIO(0x2720), 0x00000000 },
+	{ _MMIO(0x2714), 0x00800000 },
+	{ _MMIO(0x2710), 0x00000000 },
+};
+
+static const struct i915_oa_reg mux_config_render_basic[] = {
+	{ _MMIO(0x253A4), 0x01600000 },
+	{ _MMIO(0x25440), 0x00100000 },
+	{ _MMIO(0x25128), 0x00000000 },
+	{ _MMIO(0x2691C), 0x00000800 },
+	{ _MMIO(0x26AA0), 0x01500000 },
+	{ _MMIO(0x26B9C), 0x00006000 },
+	{ _MMIO(0x2791C), 0x00000800 },
+	{ _MMIO(0x27AA0), 0x01500000 },
+	{ _MMIO(0x27B9C), 0x00006000 },
+	{ _MMIO(0x2641C), 0x00000400 },
+	{ _MMIO(0x25380), 0x00000010 },
+	{ _MMIO(0x2538C), 0x00000000 },
+	{ _MMIO(0x25384), 0x0800AAAA },
+	{ _MMIO(0x25400), 0x00000004 },
+	{ _MMIO(0x2540C), 0x06029000 },
+	{ _MMIO(0x25410), 0x00000002 },
+	{ _MMIO(0x25404), 0x5C30FFFF },
+	{ _MMIO(0x25100), 0x00000016 },
+	{ _MMIO(0x25110), 0x00000400 },
+	{ _MMIO(0x25104), 0x00000000 },
+	{ _MMIO(0x26804), 0x00001211 },
+	{ _MMIO(0x26884), 0x00000100 },
+	{ _MMIO(0x26900), 0x00000002 },
+	{ _MMIO(0x26908), 0x00700000 },
+	{ _MMIO(0x26904), 0x00000000 },
+	{ _MMIO(0x26984), 0x00001022 },
+	{ _MMIO(0x26A04), 0x00000011 },
+	{ _MMIO(0x26A80), 0x00000006 },
+	{ _MMIO(0x26A88), 0x00000C02 },
+	{ _MMIO(0x26A84), 0x00000000 },
+	{ _MMIO(0x26B04), 0x00001000 },
+	{ _MMIO(0x26B80), 0x00000002 },
+	{ _MMIO(0x26B8C), 0x00000007 },
+	{ _MMIO(0x26B84), 0x00000000 },
+	{ _MMIO(0x27804), 0x00004844 },
+	{ _MMIO(0x27884), 0x00000400 },
+	{ _MMIO(0x27900), 0x00000002 },
+	{ _MMIO(0x27908), 0x0E000000 },
+	{ _MMIO(0x27904), 0x00000000 },
+	{ _MMIO(0x27984), 0x00004088 },
+	{ _MMIO(0x27A04), 0x00000044 },
+	{ _MMIO(0x27A80), 0x00000006 },
+	{ _MMIO(0x27A88), 0x00018040 },
+	{ _MMIO(0x27A84), 0x00000000 },
+	{ _MMIO(0x27B04), 0x00004000 },
+	{ _MMIO(0x27B80), 0x00000002 },
+	{ _MMIO(0x27B8C), 0x000000E0 },
+	{ _MMIO(0x27B84), 0x00000000 },
+	{ _MMIO(0x26104), 0x00002222 },
+	{ _MMIO(0x26184), 0x0C006666 },
+	{ _MMIO(0x26284), 0x04000000 },
+	{ _MMIO(0x26304), 0x04000000 },
+	{ _MMIO(0x26400), 0x00000002 },
+	{ _MMIO(0x26410), 0x000000A0 },
+	{ _MMIO(0x26404), 0x00000000 },
+	{ _MMIO(0x25420), 0x04108020 },
+	{ _MMIO(0x25424), 0x1284A420 },
+	{ _MMIO(0x2541C), 0x00000000 },
+	{ _MMIO(0x25428), 0x00042049 },
+};
+
+static int select_render_basic_config(struct drm_i915_private *dev_priv)
+{
+	dev_priv->perf.oa.mux_regs =
+		mux_config_render_basic;
+	dev_priv->perf.oa.mux_regs_len =
+		ARRAY_SIZE(mux_config_render_basic);
+
+	dev_priv->perf.oa.b_counter_regs =
+		b_counter_config_render_basic;
+	dev_priv->perf.oa.b_counter_regs_len =
+		ARRAY_SIZE(b_counter_config_render_basic);
+
+	return 0;
+}
+
+int i915_oa_select_metric_set_hsw(struct drm_i915_private *dev_priv)
+{
+	dev_priv->perf.oa.mux_regs = NULL;
+	dev_priv->perf.oa.mux_regs_len = 0;
+	dev_priv->perf.oa.b_counter_regs = NULL;
+	dev_priv->perf.oa.b_counter_regs_len = 0;
+
+	switch (dev_priv->perf.oa.metrics_set) {
+	case METRIC_SET_ID_RENDER_BASIC:
+		return select_render_basic_config(dev_priv);
+	default:
+		return -ENODEV;
+	}
+}
diff --git a/drivers/gpu/drm/i915/i915_oa_hsw.h b/drivers/gpu/drm/i915/i915_oa_hsw.h
new file mode 100644
index 0000000..b618a1f
--- /dev/null
+++ b/drivers/gpu/drm/i915/i915_oa_hsw.h
@@ -0,0 +1,34 @@
+/*
+ * Autogenerated file, DO NOT EDIT manually!
+ *
+ * Copyright (c) 2015 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __I915_OA_HSW_H__
+#define __I915_OA_HSW_H__
+
+extern int i915_oa_n_builtin_metric_sets_hsw;
+
+extern int i915_oa_select_metric_set_hsw(struct drm_i915_private *dev_priv);
+
+#endif
-- 
2.7.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
                   ` (3 preceding siblings ...)
  2016-04-20 14:23 ` [PATCH 4/9] drm/i915: Add 'render basic' Haswell OA unit config Robert Bragg
@ 2016-04-20 14:23 ` Robert Bragg
  2016-04-20 16:16   ` kbuild test robot
                     ` (8 more replies)
  2016-04-20 14:23 ` [PATCH 6/9] drm/i915: advertise available metrics via sysfs Robert Bragg
                   ` (8 subsequent siblings)
  13 siblings, 9 replies; 44+ messages in thread
From: Robert Bragg @ 2016-04-20 14:23 UTC (permalink / raw)
  To: intel-gfx; +Cc: David Airlie, dri-devel, Sourab Gupta, Deepak S, Daniel Vetter

Gen graphics hardware can be set up to periodically write snapshots of
performance counters into a circular buffer via its Observation
Architecture and this patch exposes that capability to userspace via the
i915 perf interface.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Robert Bragg <robert@sixbynine.org>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  56 +-
 drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
 drivers/gpu/drm/i915/i915_perf.c        | 940 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
 include/uapi/drm/i915_drm.h             |  70 ++-
 5 files changed, 1408 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 5e959f3..972ae6c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1708,8 +1708,13 @@ struct intel_wm_config {
 	bool sprites_scaled;
 };
 
+struct i915_oa_format {
+	u32 format;
+	int size;
+};
+
 struct i915_oa_reg {
-	u32 addr;
+	i915_reg_t addr;
 	u32 value;
 };
 
@@ -1725,6 +1730,7 @@ struct i915_perf_stream {
 	struct list_head link;
 
 	u32 sample_flags;
+	int sample_size;
 
 	struct intel_context *ctx;
 	bool enabled;
@@ -1786,6 +1792,20 @@ struct i915_perf_stream {
 	void (*destroy)(struct i915_perf_stream *stream);
 };
 
+struct i915_oa_ops {
+	void (*init_oa_buffer)(struct drm_i915_private *dev_priv);
+	int (*enable_metric_set)(struct drm_i915_private *dev_priv);
+	void (*disable_metric_set)(struct drm_i915_private *dev_priv);
+	void (*oa_enable)(struct drm_i915_private *dev_priv);
+	void (*oa_disable)(struct drm_i915_private *dev_priv);
+	void (*update_oacontrol)(struct drm_i915_private *dev_priv);
+	void (*update_hw_ctx_id_locked)(struct drm_i915_private *dev_priv,
+					u32 ctx_id);
+	int (*read)(struct i915_perf_stream *stream,
+		    struct i915_perf_read_state *read_state);
+	bool (*oa_buffer_is_empty)(struct drm_i915_private *dev_priv);
+};
+
 struct drm_i915_private {
 	struct drm_device *dev;
 	struct kmem_cache *objects;
@@ -2059,16 +2079,46 @@ struct drm_i915_private {
 
 	struct {
 		bool initialized;
+
 		struct mutex lock;
 		struct list_head streams;
 
+		spinlock_t hook_lock;
+
 		struct {
-			u32 metrics_set;
+			struct i915_perf_stream *exclusive_stream;
+
+			u32 specific_ctx_id;
+
+			struct hrtimer poll_check_timer;
+			wait_queue_head_t poll_wq;
+
+			bool periodic;
+			int period_exponent;
+			int timestamp_frequency;
+
+			int tail_margin;
+
+			int metrics_set;
 
 			const struct i915_oa_reg *mux_regs;
 			int mux_regs_len;
 			const struct i915_oa_reg *b_counter_regs;
 			int b_counter_regs_len;
+
+			struct {
+				struct drm_i915_gem_object *obj;
+				u32 gtt_offset;
+				u8 *addr;
+				int format;
+				int format_size;
+			} oa_buffer;
+
+			u32 gen7_latched_oastatus1;
+
+			struct i915_oa_ops ops;
+			const struct i915_oa_format *oa_formats;
+			int n_builtin_sets;
 		} oa;
 	} perf;
 
@@ -3429,6 +3479,8 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 
 int i915_perf_open_ioctl(struct drm_device *dev, void *data,
 			 struct drm_file *file);
+void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
+				struct intel_context *context);
 
 /* i915_gem_evict.c */
 int __must_check i915_gem_evict_something(struct drm_device *dev,
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index e5acc39..ed5665f 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -133,6 +133,23 @@ static int get_context_size(struct drm_device *dev)
 	return ret;
 }
 
+static int i915_gem_context_pin_state(struct drm_device *dev,
+				      struct intel_context *ctx)
+{
+	int ret;
+
+	BUG_ON(!mutex_is_locked(&dev->struct_mutex));
+
+	ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state,
+				    get_context_alignment(dev), 0);
+	if (ret)
+		return ret;
+
+	i915_oa_context_pin_notify(dev->dev_private, ctx);
+
+	return 0;
+}
+
 static void i915_gem_context_clean(struct intel_context *ctx)
 {
 	struct i915_hw_ppgtt *ppgtt = ctx->ppgtt;
@@ -287,8 +304,7 @@ i915_gem_create_context(struct drm_device *dev,
 		 * be available. To avoid this we always pin the default
 		 * context.
 		 */
-		ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state,
-					    get_context_alignment(dev), 0);
+		ret = i915_gem_context_pin_state(dev, ctx);
 		if (ret) {
 			DRM_DEBUG_DRIVER("Couldn't pin %d\n", ret);
 			goto err_destroy;
@@ -671,9 +687,7 @@ static int do_rcs_switch(struct drm_i915_gem_request *req)
 		return 0;
 
 	/* Trying to pin first makes error handling easier. */
-	ret = i915_gem_obj_ggtt_pin(to->legacy_hw_ctx.rcs_state,
-				    get_context_alignment(engine->dev),
-				    0);
+	ret = i915_gem_context_pin_state(engine->dev, to);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 2143401..5e58520 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -25,14 +25,830 @@
 #include <linux/sizes.h>
 
 #include "i915_drv.h"
+#include "intel_ringbuffer.h"
+#include "intel_lrc.h"
+#include "i915_oa_hsw.h"
+
+/* Must be a power of two */
+#define OA_BUFFER_SIZE		SZ_16M
+#define OA_TAKEN(tail, head)	((tail - head) & (OA_BUFFER_SIZE - 1))
+
+/* There's a HW race condition between OA unit tail pointer register updates and
+ * writes to memory whereby the tail pointer can sometimes get ahead of what's
+ * been written out to the OA buffer so far.
+ *
+ * Although this can be observed explicitly by checking for a zeroed report-id
+ * field in tail reports, it seems preferable to account for this earlier e.g.
+ * as part of the _oa_buffer_is_empty checks to minimize -EAGAIN polling cycles
+ * in this situation.
+ *
+ * To give time for the most recent reports to land before they may be copied to
+ * userspace, the driver operates as if the tail pointer effectively lags behind
+ * the HW tail pointer by 'tail_margin' bytes. The margin in bytes is calculated
+ * based on this constant in nanoseconds, the current OA sampling exponent
+ * and current report size.
+ *
+ * There is also a fallback check while reading to simply skip over reports with
+ * a zeroed report-id.
+ */
+#define OA_TAIL_MARGIN_NSEC	100000ULL
+
+/* frequency for checking whether the OA unit has written new reports to the
+ * circular OA buffer...
+ */
+#define POLL_FREQUENCY 200
+#define POLL_PERIOD (NSEC_PER_SEC / POLL_FREQUENCY)
+
+/* The maximum exponent the hardware accepts is 63 (essentially it selects one
+ * of the 64bit timestamp bits to trigger reports from) but there's currently
+ * no known use case for sampling as infrequently as once per 47 thousand years.
+ *
+ * Since the timestamps included in OA reports are only 32bits it seems
+ * reasonable to limit the OA exponent where it's still possible to account for
+ * overflow in OA report timestamps.
+ */
+#define OA_EXPONENT_MAX 31
+
+/* XXX: beware if future OA HW adds new report formats that the current
+ * code assumes all reports have a power-of-two size and ~(size - 1) can
+ * be used as a mask to align the OA tail pointer.
+ */
+static struct i915_oa_format hsw_oa_formats[I915_OA_FORMAT_MAX] = {
+	[I915_OA_FORMAT_A13]	    = { 0, 64 },
+	[I915_OA_FORMAT_A29]	    = { 1, 128 },
+	[I915_OA_FORMAT_A13_B8_C8]  = { 2, 128 },
+	/* A29_B8_C8 Disallowed as 192 bytes doesn't factor into buffer size */
+	[I915_OA_FORMAT_B4_C8]	    = { 4, 64 },
+	[I915_OA_FORMAT_A45_B8_C8]  = { 5, 256 },
+	[I915_OA_FORMAT_B4_C8_A16]  = { 6, 128 },
+	[I915_OA_FORMAT_C4_B8]	    = { 7, 64 },
+};
+
+#define SAMPLE_OA_REPORT      (1<<0)
 
 struct perf_open_properties {
 	u32 sample_flags;
 
 	u64 single_context:1;
 	u64 ctx_handle;
+
+	/* OA sampling state */
+	int metrics_set;
+	int oa_format;
+	bool oa_periodic;
+	int oa_period_exponent;
 };
 
+/* NB: This is either called via fops or the poll check hrtimer (atomic ctx)
+ *
+ * It's safe to read OA config state here unlocked, assuming that this is only
+ * called while the stream is enabled, while the global OA configuration can't
+ * be modified.
+ *
+ * Note: we don't lock around the head/tail reads even though there's the slim
+ * possibility of read() fop errors forcing a re-init of the OA buffer
+ * pointers.  A race here could result in a false positive !empty status which
+ * is acceptable.
+ */
+static bool gen7_oa_buffer_is_empty_fop_unlocked(struct drm_i915_private *dev_priv)
+{
+	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
+	u32 oastatus2 = I915_READ(GEN7_OASTATUS2);
+	u32 oastatus1 = I915_READ(GEN7_OASTATUS1);
+	u32 head = oastatus2 & GEN7_OASTATUS2_HEAD_MASK;
+	u32 tail = oastatus1 & GEN7_OASTATUS1_TAIL_MASK;
+
+	return OA_TAKEN(tail, head) <
+		dev_priv->perf.oa.tail_margin + report_size;
+}
+
+/**
+ * Appends a status record to a userspace read() buffer.
+ */
+static int append_oa_status(struct i915_perf_stream *stream,
+			    struct i915_perf_read_state *read_state,
+			    enum drm_i915_perf_record_type type)
+{
+	struct drm_i915_perf_record_header header = { type, 0, sizeof(header) };
+
+	if ((read_state->count - read_state->read) < header.size)
+		return -ENOSPC;
+
+	if (copy_to_user(read_state->buf, &header, sizeof(header)))
+		return -EFAULT;
+
+	read_state->buf += header.size;
+	read_state->read += header.size;
+
+	return 0;
+}
+
+/**
+ * Copies single OA report into userspace read() buffer.
+ */
+static int append_oa_sample(struct i915_perf_stream *stream,
+			    struct i915_perf_read_state *read_state,
+			    const u8 *report)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
+	struct drm_i915_perf_record_header header;
+	u32 sample_flags = stream->sample_flags;
+	char __user *buf = read_state->buf;
+
+	header.type = DRM_I915_PERF_RECORD_SAMPLE;
+	header.pad = 0;
+	header.size = stream->sample_size;
+
+	if ((read_state->count - read_state->read) < header.size)
+		return -ENOSPC;
+
+	if (copy_to_user(buf, &header, sizeof(header)))
+		return -EFAULT;
+	buf += sizeof(header);
+
+	if (sample_flags & SAMPLE_OA_REPORT) {
+		if (copy_to_user(buf, report, report_size))
+			return -EFAULT;
+	}
+
+	read_state->buf += header.size;
+	read_state->read += header.size;
+
+	return 0;
+}
+
+/**
+ * Copies all buffered OA reports into userspace read() buffer.
+ * @head_ptr: (inout): the head pointer before and after appending
+ *
+ * Returns 0 on success, negative error code on failure.
+ *
+ * Notably any error condition resulting in a short read (-ENOSPC or
+ * -EFAULT) will be returned even though one or more records may
+ * have been successfully copied. In this case the error may be
+ * squashed before returning to userspace.
+ */
+static int gen7_append_oa_reports(struct i915_perf_stream *stream,
+				  struct i915_perf_read_state *read_state,
+				  u32 *head_ptr,
+				  u32 tail)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
+	u8 *oa_buf_base = dev_priv->perf.oa.oa_buffer.addr;
+	int tail_margin = dev_priv->perf.oa.tail_margin;
+	u32 mask = (OA_BUFFER_SIZE - 1);
+	u32 head;
+	u32 taken;
+	int ret = 0;
+
+	BUG_ON(!stream->enabled);
+
+	head = *head_ptr - dev_priv->perf.oa.oa_buffer.gtt_offset;
+	tail -= dev_priv->perf.oa.oa_buffer.gtt_offset;
+
+	/* The OA unit is expected to wrap the tail pointer according to the OA
+	 * buffer size and since we should never write a misaligned head
+	 * pointer we don't expect to read one back either...
+	 */
+	if (tail > OA_BUFFER_SIZE || head > OA_BUFFER_SIZE ||
+	    head % report_size) {
+		DRM_ERROR("Inconsistent OA buffer pointer (head = %u, tail = %u): force restart",
+			  head, tail);
+		dev_priv->perf.oa.ops.oa_disable(dev_priv);
+		dev_priv->perf.oa.ops.oa_enable(dev_priv);
+		*head_ptr = I915_READ(GEN7_OASTATUS2) &
+			GEN7_OASTATUS2_HEAD_MASK;
+		return -EIO;
+	}
+
+
+	/* The tail pointer increases in 64 byte increments, not in report_size
+	 * steps...
+	 */
+	tail &= ~(report_size - 1);
+
+	/* Move the tail pointer back by the current tail_margin to account for
+	 * the possibility that the latest reports may not have really landed
+	 * in memory yet...
+	 */
+
+	if (OA_TAKEN(tail, head) < report_size + tail_margin)
+		return -EAGAIN;
+
+	tail -= tail_margin;
+	tail &= mask;
+
+	for (/* none */;
+	     (taken = OA_TAKEN(tail, head));
+	     head = (head + report_size) & mask) {
+		u8 *report = oa_buf_base + head;
+		u32 *report32 = (void *)report;
+
+		/* All the report sizes factor neatly into the buffer
+		 * size so we never expect to see a report split
+		 * between the beginning and end of the buffer.
+		 *
+		 * Given the initial alignment check a misalignment
+		 * here would imply a driver bug that would result
+		 * in an overrun.
+		 */
+		BUG_ON((OA_BUFFER_SIZE - head) < report_size);
+
+		/* The report-ID field for periodic samples includes
+		 * some undocumented flags related to what triggered
+		 * the report and is never expected to be zero so we
+		 * can check that the report isn't invalid before
+		 * copying it to userspace...
+		 */
+		if (report32[0] == 0) {
+			DRM_ERROR("Skipping spurious, invalid OA report\n");
+			continue;
+		}
+
+		ret = append_oa_sample(stream, read_state, report);
+		if (ret)
+			break;
+
+		/* The above report-id field sanity check is based on
+		 * the assumption that the OA buffer is initially
+		 * zeroed and we reset the field after copying so the
+		 * check is still meaningful once old reports start
+		 * being overwritten.
+		 */
+		report32[0] = 0;
+	}
+
+	*head_ptr = dev_priv->perf.oa.oa_buffer.gtt_offset + head;
+
+	return ret;
+}
+
+static int gen7_oa_read(struct i915_perf_stream *stream,
+			struct i915_perf_read_state *read_state)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
+	u32 oastatus2;
+	u32 oastatus1;
+	u32 head;
+	u32 tail;
+	int ret;
+
+	WARN_ON(!dev_priv->perf.oa.oa_buffer.addr);
+
+	oastatus2 = I915_READ(GEN7_OASTATUS2);
+	oastatus1 = I915_READ(GEN7_OASTATUS1);
+
+	head = oastatus2 & GEN7_OASTATUS2_HEAD_MASK;
+	tail = oastatus1 & GEN7_OASTATUS1_TAIL_MASK;
+
+	/* XXX: On Haswell we don't have a safe way to clear oastatus1
+	 * bits while the OA unit is enabled (while the tail pointer
+	 * may be updated asynchronously) so we ignore status bits
+	 * that have already been reported to userspace.
+	 */
+	oastatus1 &= ~dev_priv->perf.oa.gen7_latched_oastatus1;
+
+	/* We treat OABUFFER_OVERFLOW as a significant error:
+	 *
+	 * - The status can be interpreted to mean that the buffer is
+	 *   currently full (with a higher precedence than OA_TAKEN()
+	 *   which will start to report a near-empty buffer after an
+	 *   overflow) but it's awkward that we can't clear the status
+	 *   on Haswell, so without a reset we won't be able to catch
+	 *   the state again.
+	 *
+	 * - Since it also implies the HW has started overwriting old
+	 *   reports it may also affect our sanity checks for invalid
+	 *   reports when copying to userspace that assume new reports
+	 *   are being written to cleared memory.
+	 *
+	 * - In the future we may want to introduce a flight recorder
+	 *   mode where the driver will automatically maintain a safe
+	 *   guard band between head/tail, avoiding this overflow
+	 *   condition, but we avoid the added driver complexity for
+	 *   now.
+	 */
+	if (unlikely(oastatus1 & GEN7_OASTATUS1_OABUFFER_OVERFLOW)) {
+		ret = append_oa_status(stream, read_state,
+				       DRM_I915_PERF_RECORD_OA_BUFFER_LOST);
+		if (ret)
+			return ret;
+
+		DRM_ERROR("OA buffer overflow: force restart");
+
+		dev_priv->perf.oa.ops.oa_disable(dev_priv);
+		dev_priv->perf.oa.ops.oa_enable(dev_priv);
+
+		oastatus2 = I915_READ(GEN7_OASTATUS2);
+		oastatus1 = I915_READ(GEN7_OASTATUS1);
+
+		head = oastatus2 & GEN7_OASTATUS2_HEAD_MASK;
+		tail = oastatus1 & GEN7_OASTATUS1_TAIL_MASK;
+	}
+
+	if (unlikely(oastatus1 & GEN7_OASTATUS1_REPORT_LOST)) {
+		ret = append_oa_status(stream, read_state,
+				       DRM_I915_PERF_RECORD_OA_REPORT_LOST);
+		if (ret)
+			return ret;
+		dev_priv->perf.oa.gen7_latched_oastatus1 |=
+			GEN7_OASTATUS1_REPORT_LOST;
+	}
+
+	ret = gen7_append_oa_reports(stream, read_state, &head, tail);
+
+	/* All the report sizes are a power of two and the
+	 * head should always be incremented by some multiple
+	 * of the report size.
+	 *
+	 * A warning here, but notably if we later read back a
+	 * misaligned pointer we will treat that as a bug since
+	 * it could lead to a buffer overrun.
+	 */
+	WARN_ONCE(head & (report_size - 1),
+		  "i915: Writing misaligned OA head pointer");
+
+	/* Note: we update the head pointer here even if an error
+	 * was returned since the error may represent a short read
+	 * where some some reports were successfully copied.
+	 */
+	I915_WRITE(GEN7_OASTATUS2,
+		   ((head & GEN7_OASTATUS2_HEAD_MASK) |
+		    OA_MEM_SELECT_GGTT));
+
+	return ret;
+}
+
+static bool i915_oa_can_read(struct i915_perf_stream *stream)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+
+	return !dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv);
+}
+
+static int i915_oa_wait_unlocked(struct i915_perf_stream *stream)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+
+	/* Note: the oa_buffer_is_empty() condition is ok to run unlocked as it
+	 * just performs mmio reads of the OA buffer head + tail pointers and
+	 * it's assumed we're handling some operation that implies the stream
+	 * can't be destroyed until completion (such as a read()) that ensures
+	 * the device + OA buffer can't disappear
+	 */
+	return wait_event_interruptible(dev_priv->perf.oa.poll_wq,
+					!dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv));
+}
+
+static void i915_oa_poll_wait(struct i915_perf_stream *stream,
+			      struct file *file,
+			      poll_table *wait)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+
+	poll_wait(file, &dev_priv->perf.oa.poll_wq, wait);
+}
+
+static int i915_oa_read(struct i915_perf_stream *stream,
+			struct i915_perf_read_state *read_state)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+
+	return dev_priv->perf.oa.ops.read(stream, read_state);
+}
+
+static void
+free_oa_buffer(struct drm_i915_private *i915)
+{
+	mutex_lock(&i915->dev->struct_mutex);
+
+	vunmap(i915->perf.oa.oa_buffer.addr);
+	i915_gem_object_ggtt_unpin(i915->perf.oa.oa_buffer.obj);
+	drm_gem_object_unreference(&i915->perf.oa.oa_buffer.obj->base);
+
+	i915->perf.oa.oa_buffer.obj = NULL;
+	i915->perf.oa.oa_buffer.gtt_offset = 0;
+	i915->perf.oa.oa_buffer.addr = NULL;
+
+	mutex_unlock(&i915->dev->struct_mutex);
+}
+
+static void i915_oa_stream_destroy(struct i915_perf_stream *stream)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+
+	BUG_ON(stream != dev_priv->perf.oa.exclusive_stream);
+
+	dev_priv->perf.oa.ops.disable_metric_set(dev_priv);
+
+	free_oa_buffer(dev_priv);
+
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+	intel_runtime_pm_put(dev_priv);
+
+	dev_priv->perf.oa.exclusive_stream = NULL;
+}
+
+static void *vmap_oa_buffer(struct drm_i915_gem_object *obj)
+{
+	int i;
+	void *addr = NULL;
+	struct sg_page_iter sg_iter;
+	struct page **pages;
+
+	pages = drm_malloc_ab(obj->base.size >> PAGE_SHIFT, sizeof(*pages));
+	if (pages == NULL) {
+		DRM_DEBUG_DRIVER("Failed to get space for pages\n");
+		goto finish;
+	}
+
+	i = 0;
+	for_each_sg_page(obj->pages->sgl, &sg_iter, obj->pages->nents, 0) {
+		pages[i] = sg_page_iter_page(&sg_iter);
+		i++;
+	}
+
+	addr = vmap(pages, i, 0, PAGE_KERNEL);
+	if (addr == NULL) {
+		DRM_DEBUG_DRIVER("Failed to vmap pages\n");
+		goto finish;
+	}
+
+finish:
+	if (pages)
+		drm_free_large(pages);
+	return addr;
+}
+
+static void gen7_init_oa_buffer(struct drm_i915_private *dev_priv)
+{
+	/* Pre-DevBDW: OABUFFER must be set with counters off,
+	 * before OASTATUS1, but after OASTATUS2
+	 */
+	I915_WRITE(GEN7_OASTATUS2, dev_priv->perf.oa.oa_buffer.gtt_offset |
+		   OA_MEM_SELECT_GGTT); /* head */
+	I915_WRITE(GEN7_OABUFFER, dev_priv->perf.oa.oa_buffer.gtt_offset);
+	I915_WRITE(GEN7_OASTATUS1, dev_priv->perf.oa.oa_buffer.gtt_offset |
+		   OABUFFER_SIZE_16M); /* tail */
+
+	/* On Haswell we have to track which OASTATUS1 flags we've
+	 * already seen since they can't be cleared while periodic
+	 * sampling is enabled.
+	 */
+	dev_priv->perf.oa.gen7_latched_oastatus1 = 0;
+
+	/* We have a sanity check in gen7_append_oa_reports() that
+	 * looks at the report-id field to make sure it's non-zero
+	 * which relies on the assumption that new reports are
+	 * being written to zeroed memory...
+	 */
+	memset(dev_priv->perf.oa.oa_buffer.addr, 0, SZ_16M);
+}
+
+static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
+{
+	struct drm_i915_gem_object *bo;
+	int ret;
+
+	BUG_ON(dev_priv->perf.oa.oa_buffer.obj);
+
+	ret = i915_mutex_lock_interruptible(dev_priv->dev);
+	if (ret)
+		return ret;
+
+	bo = i915_gem_alloc_object(dev_priv->dev, OA_BUFFER_SIZE);
+	if (bo == NULL) {
+		DRM_ERROR("Failed to allocate OA buffer\n");
+		ret = -ENOMEM;
+		goto unlock;
+	}
+	dev_priv->perf.oa.oa_buffer.obj = bo;
+
+	ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
+	if (ret)
+		goto err_unref;
+
+	/* PreHSW required 512K alignment, HSW requires 16M */
+	ret = i915_gem_obj_ggtt_pin(bo, SZ_16M, 0);
+	if (ret)
+		goto err_unref;
+
+	dev_priv->perf.oa.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo);
+	dev_priv->perf.oa.oa_buffer.addr = vmap_oa_buffer(bo);
+
+	dev_priv->perf.oa.ops.init_oa_buffer(dev_priv);
+
+	DRM_DEBUG_DRIVER("OA Buffer initialized, gtt offset = 0x%x, vaddr = %p",
+			 dev_priv->perf.oa.oa_buffer.gtt_offset,
+			 dev_priv->perf.oa.oa_buffer.addr);
+
+	goto unlock;
+
+err_unref:
+	drm_gem_object_unreference(&bo->base);
+
+unlock:
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+	return ret;
+}
+
+static void config_oa_regs(struct drm_i915_private *dev_priv,
+			   const struct i915_oa_reg *regs,
+			   int n_regs)
+{
+	int i;
+
+	for (i = 0; i < n_regs; i++) {
+		const struct i915_oa_reg *reg = regs + i;
+
+		I915_WRITE(reg->addr, reg->value);
+	}
+}
+
+static int hsw_enable_metric_set(struct drm_i915_private *dev_priv)
+{
+	int ret = i915_oa_select_metric_set_hsw(dev_priv);
+
+	if (ret)
+		return ret;
+
+	I915_WRITE(GDT_CHICKEN_BITS, GT_NOA_ENABLE);
+
+	/* PRM:
+	 *
+	 * OA unit is using “crclk” for its functionality. When trunk
+	 * level clock gating takes place, OA clock would be gated,
+	 * unable to count the events from non-render clock domain.
+	 * Render clock gating must be disabled when OA is enabled to
+	 * count the events from non-render domain. Unit level clock
+	 * gating for RCS should also be disabled.
+	 */
+	I915_WRITE(GEN7_MISCCPCTL, (I915_READ(GEN7_MISCCPCTL) &
+				    ~GEN7_DOP_CLOCK_GATE_ENABLE));
+	I915_WRITE(GEN6_UCGCTL1, (I915_READ(GEN6_UCGCTL1) |
+				  GEN6_CSUNIT_CLOCK_GATE_DISABLE));
+
+	config_oa_regs(dev_priv, dev_priv->perf.oa.mux_regs,
+		       dev_priv->perf.oa.mux_regs_len);
+
+	/* It takes a fairly long time for a new MUX configuration to
+	 * be be applied after these register writes. This delay
+	 * duration was derived empirically based on the render_basic
+	 * config but hopefully it covers the maximum configuration
+	 * latency...
+	 */
+	mdelay(100);
+
+	config_oa_regs(dev_priv, dev_priv->perf.oa.b_counter_regs,
+		       dev_priv->perf.oa.b_counter_regs_len);
+
+	return 0;
+}
+
+static void hsw_disable_metric_set(struct drm_i915_private *dev_priv)
+{
+	I915_WRITE(GEN6_UCGCTL1, (I915_READ(GEN6_UCGCTL1) &
+				  ~GEN6_CSUNIT_CLOCK_GATE_DISABLE));
+	I915_WRITE(GEN7_MISCCPCTL, (I915_READ(GEN7_MISCCPCTL) |
+				    GEN7_DOP_CLOCK_GATE_ENABLE));
+
+	I915_WRITE(GDT_CHICKEN_BITS, (I915_READ(GDT_CHICKEN_BITS) &
+				      ~GT_NOA_ENABLE));
+}
+
+static void gen7_update_oacontrol_locked(struct drm_i915_private *dev_priv)
+{
+	assert_spin_locked(&dev_priv->perf.hook_lock);
+
+	if (dev_priv->perf.oa.exclusive_stream->enabled) {
+		unsigned long ctx_id = 0;
+
+		if (dev_priv->perf.oa.exclusive_stream->ctx)
+			ctx_id = dev_priv->perf.oa.specific_ctx_id;
+
+		if (dev_priv->perf.oa.exclusive_stream->ctx == NULL || ctx_id) {
+			bool periodic = dev_priv->perf.oa.periodic;
+			u32 period_exponent = dev_priv->perf.oa.period_exponent;
+			u32 report_format = dev_priv->perf.oa.oa_buffer.format;
+
+			I915_WRITE(GEN7_OACONTROL,
+				   (ctx_id & GEN7_OACONTROL_CTX_MASK) |
+				   (period_exponent <<
+				    GEN7_OACONTROL_TIMER_PERIOD_SHIFT) |
+				   (periodic ?
+				    GEN7_OACONTROL_TIMER_ENABLE : 0) |
+				   (report_format <<
+				    GEN7_OACONTROL_FORMAT_SHIFT) |
+				   (ctx_id ?
+				    GEN7_OACONTROL_PER_CTX_ENABLE : 0) |
+				   GEN7_OACONTROL_ENABLE);
+			return;
+		}
+	}
+
+	I915_WRITE(GEN7_OACONTROL, 0);
+}
+
+static void gen7_oa_enable(struct drm_i915_private *dev_priv)
+{
+	unsigned long flags;
+
+	/* Reset buf pointers so we don't forward reports from before now. */
+	gen7_init_oa_buffer(dev_priv);
+
+	spin_lock_irqsave(&dev_priv->perf.hook_lock, flags);
+	gen7_update_oacontrol_locked(dev_priv);
+	spin_unlock_irqrestore(&dev_priv->perf.hook_lock, flags);
+}
+
+static void i915_oa_stream_enable(struct i915_perf_stream *stream)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+
+	dev_priv->perf.oa.ops.oa_enable(dev_priv);
+
+	if (dev_priv->perf.oa.periodic)
+		hrtimer_start(&dev_priv->perf.oa.poll_check_timer,
+			      ns_to_ktime(POLL_PERIOD),
+			      HRTIMER_MODE_REL_PINNED);
+}
+
+static void gen7_oa_disable(struct drm_i915_private *dev_priv)
+{
+	I915_WRITE(GEN7_OACONTROL, 0);
+}
+
+static void i915_oa_stream_disable(struct i915_perf_stream *stream)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+
+	dev_priv->perf.oa.ops.oa_disable(dev_priv);
+
+	if (dev_priv->perf.oa.periodic)
+		hrtimer_cancel(&dev_priv->perf.oa.poll_check_timer);
+}
+
+static u64 oa_exponent_to_ns(struct drm_i915_private *dev_priv, int exponent)
+{
+	return 1000000000ULL * (2ULL << exponent) /
+		dev_priv->perf.oa.timestamp_frequency;
+}
+
+static int i915_oa_stream_init(struct i915_perf_stream *stream,
+			       struct drm_i915_perf_open_param *param,
+			       struct perf_open_properties *props)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	int format_size;
+	int ret;
+
+	if (!(props->sample_flags & SAMPLE_OA_REPORT)) {
+		DRM_ERROR("Only OA report sampling supported\n");
+		return -EINVAL;
+	}
+
+	if (!dev_priv->perf.oa.ops.init_oa_buffer) {
+		DRM_ERROR("OA unit not supported\n");
+		return -ENODEV;
+	}
+
+	/* To avoid the complexity of having to accurately filter
+	 * counter reports and marshal to the appropriate client
+	 * we currently only allow exclusive access
+	 */
+	if (dev_priv->perf.oa.exclusive_stream) {
+		DRM_ERROR("OA unit already in use\n");
+		return -EBUSY;
+	}
+
+	if (!props->metrics_set) {
+		DRM_ERROR("OA metric set not specified\n");
+		return -EINVAL;
+	}
+
+	if (!props->oa_format) {
+		DRM_ERROR("OA report format not specified\n");
+		return -EINVAL;
+	}
+
+	stream->sample_size = sizeof(struct drm_i915_perf_record_header);
+
+	format_size = dev_priv->perf.oa.oa_formats[props->oa_format].size;
+
+	stream->sample_flags |= SAMPLE_OA_REPORT;
+	stream->sample_size += format_size;
+
+	dev_priv->perf.oa.oa_buffer.format_size = format_size;
+	BUG_ON(dev_priv->perf.oa.oa_buffer.format_size == 0);
+
+	dev_priv->perf.oa.oa_buffer.format =
+		dev_priv->perf.oa.oa_formats[props->oa_format].format;
+
+	dev_priv->perf.oa.metrics_set = props->metrics_set;
+
+	dev_priv->perf.oa.periodic = props->oa_periodic;
+	if (dev_priv->perf.oa.periodic) {
+		u64 period_ns = oa_exponent_to_ns(dev_priv,
+						  props->oa_period_exponent);
+
+		dev_priv->perf.oa.period_exponent = props->oa_period_exponent;
+
+		/* See comment for OA_TAIL_MARGIN_NSEC for details
+		 * about this tail_margin...
+		 */
+		dev_priv->perf.oa.tail_margin =
+			((OA_TAIL_MARGIN_NSEC / period_ns) + 1) * format_size;
+	}
+
+	ret = alloc_oa_buffer(dev_priv);
+	if (ret)
+		return ret;
+
+	/* PRM - observability performance counters:
+	 *
+	 *   OACONTROL, performance counter enable, note:
+	 *
+	 *   "When this bit is set, in order to have coherent counts,
+	 *   RC6 power state and trunk clock gating must be disabled.
+	 *   This can be achieved by programming MMIO registers as
+	 *   0xA094=0 and 0xA090[31]=1"
+	 *
+	 *   In our case we are expecting that taking pm + FORCEWAKE
+	 *   references will effectively disable RC6.
+	 */
+	intel_runtime_pm_get(dev_priv);
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+	ret = dev_priv->perf.oa.ops.enable_metric_set(dev_priv);
+	if (ret) {
+		intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+		intel_runtime_pm_put(dev_priv);
+		free_oa_buffer(dev_priv);
+		return ret;
+	}
+
+	stream->destroy = i915_oa_stream_destroy;
+	stream->enable = i915_oa_stream_enable;
+	stream->disable = i915_oa_stream_disable;
+	stream->can_read = i915_oa_can_read;
+	stream->wait_unlocked = i915_oa_wait_unlocked;
+	stream->poll_wait = i915_oa_poll_wait;
+	stream->read = i915_oa_read;
+
+	/* On Haswell we have to track which OASTATUS1 flags we've already
+	 * seen since they can't be cleared while periodic sampling is enabled.
+	 */
+	dev_priv->perf.oa.gen7_latched_oastatus1 = 0;
+
+	dev_priv->perf.oa.exclusive_stream = stream;
+
+	return 0;
+}
+
+static void gen7_update_hw_ctx_id_locked(struct drm_i915_private *dev_priv,
+					 u32 ctx_id)
+{
+	assert_spin_locked(&dev_priv->perf.hook_lock);
+
+	dev_priv->perf.oa.specific_ctx_id = ctx_id;
+	gen7_update_oacontrol_locked(dev_priv);
+}
+
+static void i915_oa_context_pin_notify_locked(struct drm_i915_private *dev_priv,
+					      struct intel_context *context)
+{
+	assert_spin_locked(&dev_priv->perf.hook_lock);
+
+	if (i915.enable_execlists ||
+	    dev_priv->perf.oa.ops.update_hw_ctx_id_locked == NULL)
+		return;
+
+	if (dev_priv->perf.oa.exclusive_stream &&
+	    dev_priv->perf.oa.exclusive_stream->ctx == context) {
+		struct drm_i915_gem_object *obj =
+			context->legacy_hw_ctx.rcs_state;
+		u32 ctx_id = i915_gem_obj_ggtt_offset(obj);
+
+		dev_priv->perf.oa.ops.update_hw_ctx_id_locked(dev_priv, ctx_id);
+	}
+}
+
+void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
+				struct intel_context *context)
+{
+	unsigned long flags;
+
+	if (!dev_priv->perf.initialized)
+		return;
+
+	spin_lock_irqsave(&dev_priv->perf.hook_lock, flags);
+	i915_oa_context_pin_notify_locked(dev_priv, context);
+	spin_unlock_irqrestore(&dev_priv->perf.hook_lock, flags);
+}
+
 static ssize_t i915_perf_read_locked(struct i915_perf_stream *stream,
 				     struct file *file,
 				     char __user *buf,
@@ -42,7 +858,10 @@ static ssize_t i915_perf_read_locked(struct i915_perf_stream *stream,
 	struct i915_perf_read_state state = { count, 0, buf };
 	int ret = stream->read(stream, &state);
 
-	if ((ret == -ENOSPC || ret == -EFAULT) && state.read)
+	/* Squash any internal error status in any case where we've
+	 * successfully copied some data so the data isn't lost.
+	 */
+	if (state.read)
 		ret = 0;
 
 	if (ret)
@@ -60,6 +879,13 @@ static ssize_t i915_perf_read(struct file *file,
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	ssize_t ret;
 
+	/* To ensure it's handled consistently we simply treat all reads of a
+	 * disabled stream as an error. In particular it might otherwise lead
+	 * to a deadlock for blocking file descriptors...
+	 */
+	if (!stream->enabled)
+		return -EIO;
+
 	if (!(file->f_flags & O_NONBLOCK)) {
 		/* There's the small chance of false positives from
 		 * stream->wait_unlocked.
@@ -87,6 +913,20 @@ static ssize_t i915_perf_read(struct file *file,
 	return ret;
 }
 
+static enum hrtimer_restart oa_poll_check_timer_cb(struct hrtimer *hrtimer)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(hrtimer, typeof(*dev_priv),
+			     perf.oa.poll_check_timer);
+
+	if (!dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv))
+		wake_up(&dev_priv->perf.oa.poll_wq);
+
+	hrtimer_forward_now(hrtimer, ns_to_ktime(POLL_PERIOD));
+
+	return HRTIMER_RESTART;
+}
+
 static unsigned int i915_perf_poll_locked(struct i915_perf_stream *stream,
 					  struct file *file,
 					  poll_table *wait)
@@ -277,18 +1117,18 @@ int i915_perf_open_ioctl_locked(struct drm_device *dev,
 		goto err_ctx;
 	}
 
-	stream->sample_flags = props->sample_flags;
 	stream->dev_priv = dev_priv;
 	stream->ctx = specific_ctx;
 
-	/*
-	 * TODO: support sampling something
-	 *
-	 * For now this is as far as we can go.
+	ret = i915_oa_stream_init(stream, param, props);
+	if (ret)
+		goto err_alloc;
+
+	/* we avoid simply assigning stream->sample_flags = props->sample_flags
+	 * to have _stream_init check the combination of sample flags more
+	 * thoroughly, but still this is the expected result at this point.
 	 */
-	DRM_ERROR("Unsupported i915 perf stream configuration\n");
-	ret = -EINVAL;
-	goto err_alloc;
+	BUG_ON(stream->sample_flags != props->sample_flags);
 
 	list_add(&stream->link, &dev_priv->perf.streams);
 
@@ -373,7 +1213,56 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 			props->single_context = 1;
 			props->ctx_handle = value;
 			break;
-
+		case DRM_I915_PERF_PROP_SAMPLE_OA:
+			props->sample_flags |= SAMPLE_OA_REPORT;
+			break;
+		case DRM_I915_PERF_PROP_OA_METRICS_SET:
+			if (value == 0 ||
+			    value > dev_priv->perf.oa.n_builtin_sets) {
+				DRM_ERROR("Unknown OA metric set ID");
+				return -EINVAL;
+			}
+			props->metrics_set = value;
+			break;
+		case DRM_I915_PERF_PROP_OA_FORMAT:
+			if (value == 0 || value >= I915_OA_FORMAT_MAX) {
+				DRM_ERROR("Invalid OA report format\n");
+				return -EINVAL;
+			}
+			if (!dev_priv->perf.oa.oa_formats[value].size) {
+				DRM_ERROR("Invalid OA report format\n");
+				return -EINVAL;
+			}
+			props->oa_format = value;
+			break;
+		case DRM_I915_PERF_PROP_OA_EXPONENT:
+			if (value > OA_EXPONENT_MAX) {
+				DRM_ERROR("OA timer exponent too high (> %u)\n",
+					  OA_EXPONENT_MAX);
+				return -EINVAL;
+			}
+
+			/* NB: The exponent represents a period as follows:
+			 *
+			 *   80ns * 2^(period_exponent + 1)
+			 *
+			 * Theoretically we can program the OA unit to sample
+			 * every 160ns but don't allow that by default unless
+			 * root.
+			 *
+			 * Referring to perf's
+			 * kernel.perf_event_max_sample_rate for a precedent
+			 * (100000 by default); with an OA exponent of 6 we get
+			 * a period of 10.240 microseconds -just under 100000Hz
+			 */
+			if (value < 6 && !capable(CAP_SYS_ADMIN)) {
+				DRM_ERROR("Sampling period too high without root privileges\n");
+				return -EACCES;
+			}
+
+			props->oa_periodic = true;
+			props->oa_period_exponent = value;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
@@ -424,8 +1313,37 @@ void i915_perf_init(struct drm_device *dev)
 {
 	struct drm_i915_private *dev_priv = to_i915(dev);
 
+	if (!IS_HASWELL(dev))
+		return;
+
+	hrtimer_init(&dev_priv->perf.oa.poll_check_timer,
+		     CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	dev_priv->perf.oa.poll_check_timer.function = oa_poll_check_timer_cb;
+	init_waitqueue_head(&dev_priv->perf.oa.poll_wq);
+
 	INIT_LIST_HEAD(&dev_priv->perf.streams);
 	mutex_init(&dev_priv->perf.lock);
+	spin_lock_init(&dev_priv->perf.hook_lock);
+
+	dev_priv->perf.oa.ops.init_oa_buffer = gen7_init_oa_buffer;
+	dev_priv->perf.oa.ops.enable_metric_set = hsw_enable_metric_set;
+	dev_priv->perf.oa.ops.disable_metric_set = hsw_disable_metric_set;
+	dev_priv->perf.oa.ops.oa_enable = gen7_oa_enable;
+	dev_priv->perf.oa.ops.oa_disable = gen7_oa_disable;
+	dev_priv->perf.oa.ops.update_hw_ctx_id_locked =
+		gen7_update_hw_ctx_id_locked;
+	dev_priv->perf.oa.ops.read = gen7_oa_read;
+	dev_priv->perf.oa.ops.oa_buffer_is_empty =
+		gen7_oa_buffer_is_empty_fop_unlocked;
+
+	dev_priv->perf.oa.timestamp_frequency = 12500000;
+
+	dev_priv->perf.oa.oa_formats = hsw_oa_formats;
+
+	dev_priv->perf.oa.n_builtin_sets =
+		i915_oa_n_builtin_metric_sets_hsw;
+
+	dev_priv->perf.oa.oa_formats = hsw_oa_formats;
 
 	dev_priv->perf.initialized = true;
 }
@@ -437,7 +1355,7 @@ void i915_perf_fini(struct drm_device *dev)
 	if (!dev_priv->perf.initialized)
 		return;
 
-	/* Currently nothing to clean up */
+	dev_priv->perf.oa.ops.init_oa_buffer = NULL;
 
 	dev_priv->perf.initialized = false;
 }
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index de1e9a0..22bf23c 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -594,6 +594,343 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define HSW_CS_GPR_UDW(n)               _MMIO(0x2600 + (n) * 8 + 4)
 
 #define GEN7_OACONTROL _MMIO(0x2360)
+#define  GEN7_OACONTROL_CTX_MASK	    0xFFFFF000
+#define  GEN7_OACONTROL_TIMER_PERIOD_MASK   0x3F
+#define  GEN7_OACONTROL_TIMER_PERIOD_SHIFT  6
+#define  GEN7_OACONTROL_TIMER_ENABLE	    (1<<5)
+#define  GEN7_OACONTROL_FORMAT_A13	    (0<<2)
+#define  GEN7_OACONTROL_FORMAT_A29	    (1<<2)
+#define  GEN7_OACONTROL_FORMAT_A13_B8_C8    (2<<2)
+#define  GEN7_OACONTROL_FORMAT_A29_B8_C8    (3<<2)
+#define  GEN7_OACONTROL_FORMAT_B4_C8	    (4<<2)
+#define  GEN7_OACONTROL_FORMAT_A45_B8_C8    (5<<2)
+#define  GEN7_OACONTROL_FORMAT_B4_C8_A16    (6<<2)
+#define  GEN7_OACONTROL_FORMAT_C4_B8	    (7<<2)
+#define  GEN7_OACONTROL_FORMAT_SHIFT	    2
+#define  GEN7_OACONTROL_PER_CTX_ENABLE	    (1<<1)
+#define  GEN7_OACONTROL_ENABLE		    (1<<0)
+
+#define GEN8_OACTXID _MMIO(0x2364)
+
+#define GEN8_OACONTROL _MMIO(0x2B00)
+#define  GEN8_OA_REPORT_FORMAT_A12	    (0<<2)
+#define  GEN8_OA_REPORT_FORMAT_A12_B8_C8    (2<<2)
+#define  GEN8_OA_REPORT_FORMAT_A36_B8_C8    (5<<2)
+#define  GEN8_OA_REPORT_FORMAT_C4_B8	    (7<<2)
+#define  GEN8_OA_REPORT_FORMAT_SHIFT	    2
+#define  GEN8_OA_SPECIFIC_CONTEXT_ENABLE    (1<<1)
+#define  GEN8_OA_COUNTER_ENABLE             (1<<0)
+
+#define GEN8_OACTXCONTROL _MMIO(0x2360)
+#define  GEN8_OA_TIMER_PERIOD_MASK	    0x3F
+#define  GEN8_OA_TIMER_PERIOD_SHIFT	    2
+#define  GEN8_OA_TIMER_ENABLE		    (1<<1)
+#define  GEN8_OA_COUNTER_RESUME		    (1<<0)
+
+#define GEN7_OABUFFER _MMIO(0x23B0) /* R/W */
+#define  GEN7_OABUFFER_OVERRUN_DISABLE	    (1<<3)
+#define  GEN7_OABUFFER_EDGE_TRIGGER	    (1<<2)
+#define  GEN7_OABUFFER_STOP_RESUME_ENABLE   (1<<1)
+#define  GEN7_OABUFFER_RESUME		    (1<<0)
+
+#define GEN8_OABUFFER _MMIO(0x2b14)
+
+#define GEN7_OASTATUS1 _MMIO(0x2364)
+#define  GEN7_OASTATUS1_TAIL_MASK	    0xffffffc0
+#define  GEN7_OASTATUS1_COUNTER_OVERFLOW    (1<<2)
+#define  GEN7_OASTATUS1_OABUFFER_OVERFLOW   (1<<1)
+#define  GEN7_OASTATUS1_REPORT_LOST	    (1<<0)
+
+#define GEN7_OASTATUS2 _MMIO(0x2368)
+#define GEN7_OASTATUS2_HEAD_MASK    0xffffffc0
+
+#define GEN8_OASTATUS _MMIO(0x2b08)
+#define  GEN8_OASTATUS_OVERRUN_STATUS	    (1<<3)
+#define  GEN8_OASTATUS_COUNTER_OVERFLOW     (1<<2)
+#define  GEN8_OASTATUS_OABUFFER_OVERFLOW    (1<<1)
+#define  GEN8_OASTATUS_REPORT_LOST	    (1<<0)
+
+#define GEN8_OAHEADPTR _MMIO(0x2B0C)
+#define GEN8_OATAILPTR _MMIO(0x2B10)
+
+#define OABUFFER_SIZE_128K  (0<<3)
+#define OABUFFER_SIZE_256K  (1<<3)
+#define OABUFFER_SIZE_512K  (2<<3)
+#define OABUFFER_SIZE_1M    (3<<3)
+#define OABUFFER_SIZE_2M    (4<<3)
+#define OABUFFER_SIZE_4M    (5<<3)
+#define OABUFFER_SIZE_8M    (6<<3)
+#define OABUFFER_SIZE_16M   (7<<3)
+
+#define OA_MEM_SELECT_GGTT  (1<<0)
+
+#define EU_PERF_CNTL0	    _MMIO(0xe458)
+
+#define GDT_CHICKEN_BITS    _MMIO(0x9840)
+#define GT_NOA_ENABLE	    0x00000080
+
+/*
+ * OA Boolean state
+ */
+
+#define OAREPORTTRIG1 _MMIO(0x2740)
+#define OAREPORTTRIG1_THRESHOLD_MASK 0xffff
+#define OAREPORTTRIG1_EDGE_LEVEL_TRIGER_SELECT_MASK 0xffff0000 /* 0=level */
+
+#define OAREPORTTRIG2 _MMIO(0x2744)
+#define OAREPORTTRIG2_INVERT_A_0  (1<<0)
+#define OAREPORTTRIG2_INVERT_A_1  (1<<1)
+#define OAREPORTTRIG2_INVERT_A_2  (1<<2)
+#define OAREPORTTRIG2_INVERT_A_3  (1<<3)
+#define OAREPORTTRIG2_INVERT_A_4  (1<<4)
+#define OAREPORTTRIG2_INVERT_A_5  (1<<5)
+#define OAREPORTTRIG2_INVERT_A_6  (1<<6)
+#define OAREPORTTRIG2_INVERT_A_7  (1<<7)
+#define OAREPORTTRIG2_INVERT_A_8  (1<<8)
+#define OAREPORTTRIG2_INVERT_A_9  (1<<9)
+#define OAREPORTTRIG2_INVERT_A_10 (1<<10)
+#define OAREPORTTRIG2_INVERT_A_11 (1<<11)
+#define OAREPORTTRIG2_INVERT_A_12 (1<<12)
+#define OAREPORTTRIG2_INVERT_A_13 (1<<13)
+#define OAREPORTTRIG2_INVERT_A_14 (1<<14)
+#define OAREPORTTRIG2_INVERT_A_15 (1<<15)
+#define OAREPORTTRIG2_INVERT_B_0  (1<<16)
+#define OAREPORTTRIG2_INVERT_B_1  (1<<17)
+#define OAREPORTTRIG2_INVERT_B_2  (1<<18)
+#define OAREPORTTRIG2_INVERT_B_3  (1<<19)
+#define OAREPORTTRIG2_INVERT_C_0  (1<<20)
+#define OAREPORTTRIG2_INVERT_C_1  (1<<21)
+#define OAREPORTTRIG2_INVERT_D_0  (1<<22)
+#define OAREPORTTRIG2_THRESHOLD_ENABLE	    (1<<23)
+#define OAREPORTTRIG2_REPORT_TRIGGER_ENABLE (1<<31)
+
+#define OAREPORTTRIG3 _MMIO(0x2748)
+#define OAREPORTTRIG3_NOA_SELECT_MASK	    0xf
+#define OAREPORTTRIG3_NOA_SELECT_8_SHIFT    0
+#define OAREPORTTRIG3_NOA_SELECT_9_SHIFT    4
+#define OAREPORTTRIG3_NOA_SELECT_10_SHIFT   8
+#define OAREPORTTRIG3_NOA_SELECT_11_SHIFT   12
+#define OAREPORTTRIG3_NOA_SELECT_12_SHIFT   16
+#define OAREPORTTRIG3_NOA_SELECT_13_SHIFT   20
+#define OAREPORTTRIG3_NOA_SELECT_14_SHIFT   24
+#define OAREPORTTRIG3_NOA_SELECT_15_SHIFT   28
+
+#define OAREPORTTRIG4 _MMIO(0x274c)
+#define OAREPORTTRIG4_NOA_SELECT_MASK	    0xf
+#define OAREPORTTRIG4_NOA_SELECT_0_SHIFT    0
+#define OAREPORTTRIG4_NOA_SELECT_1_SHIFT    4
+#define OAREPORTTRIG4_NOA_SELECT_2_SHIFT    8
+#define OAREPORTTRIG4_NOA_SELECT_3_SHIFT    12
+#define OAREPORTTRIG4_NOA_SELECT_4_SHIFT    16
+#define OAREPORTTRIG4_NOA_SELECT_5_SHIFT    20
+#define OAREPORTTRIG4_NOA_SELECT_6_SHIFT    24
+#define OAREPORTTRIG4_NOA_SELECT_7_SHIFT    28
+
+#define OAREPORTTRIG5 _MMIO(0x2750)
+#define OAREPORTTRIG5_THRESHOLD_MASK 0xffff
+#define OAREPORTTRIG5_EDGE_LEVEL_TRIGER_SELECT_MASK 0xffff0000 /* 0=level */
+
+#define OAREPORTTRIG6 _MMIO(0x2754)
+#define OAREPORTTRIG6_INVERT_A_0  (1<<0)
+#define OAREPORTTRIG6_INVERT_A_1  (1<<1)
+#define OAREPORTTRIG6_INVERT_A_2  (1<<2)
+#define OAREPORTTRIG6_INVERT_A_3  (1<<3)
+#define OAREPORTTRIG6_INVERT_A_4  (1<<4)
+#define OAREPORTTRIG6_INVERT_A_5  (1<<5)
+#define OAREPORTTRIG6_INVERT_A_6  (1<<6)
+#define OAREPORTTRIG6_INVERT_A_7  (1<<7)
+#define OAREPORTTRIG6_INVERT_A_8  (1<<8)
+#define OAREPORTTRIG6_INVERT_A_9  (1<<9)
+#define OAREPORTTRIG6_INVERT_A_10 (1<<10)
+#define OAREPORTTRIG6_INVERT_A_11 (1<<11)
+#define OAREPORTTRIG6_INVERT_A_12 (1<<12)
+#define OAREPORTTRIG6_INVERT_A_13 (1<<13)
+#define OAREPORTTRIG6_INVERT_A_14 (1<<14)
+#define OAREPORTTRIG6_INVERT_A_15 (1<<15)
+#define OAREPORTTRIG6_INVERT_B_0  (1<<16)
+#define OAREPORTTRIG6_INVERT_B_1  (1<<17)
+#define OAREPORTTRIG6_INVERT_B_2  (1<<18)
+#define OAREPORTTRIG6_INVERT_B_3  (1<<19)
+#define OAREPORTTRIG6_INVERT_C_0  (1<<20)
+#define OAREPORTTRIG6_INVERT_C_1  (1<<21)
+#define OAREPORTTRIG6_INVERT_D_0  (1<<22)
+#define OAREPORTTRIG6_THRESHOLD_ENABLE	    (1<<23)
+#define OAREPORTTRIG6_REPORT_TRIGGER_ENABLE (1<<31)
+
+#define OAREPORTTRIG7 _MMIO(0x2758)
+#define OAREPORTTRIG7_NOA_SELECT_MASK	    0xf
+#define OAREPORTTRIG7_NOA_SELECT_8_SHIFT    0
+#define OAREPORTTRIG7_NOA_SELECT_9_SHIFT    4
+#define OAREPORTTRIG7_NOA_SELECT_10_SHIFT   8
+#define OAREPORTTRIG7_NOA_SELECT_11_SHIFT   12
+#define OAREPORTTRIG7_NOA_SELECT_12_SHIFT   16
+#define OAREPORTTRIG7_NOA_SELECT_13_SHIFT   20
+#define OAREPORTTRIG7_NOA_SELECT_14_SHIFT   24
+#define OAREPORTTRIG7_NOA_SELECT_15_SHIFT   28
+
+#define OAREPORTTRIG8 _MMIO(0x275c)
+#define OAREPORTTRIG8_NOA_SELECT_MASK	    0xf
+#define OAREPORTTRIG8_NOA_SELECT_0_SHIFT    0
+#define OAREPORTTRIG8_NOA_SELECT_1_SHIFT    4
+#define OAREPORTTRIG8_NOA_SELECT_2_SHIFT    8
+#define OAREPORTTRIG8_NOA_SELECT_3_SHIFT    12
+#define OAREPORTTRIG8_NOA_SELECT_4_SHIFT    16
+#define OAREPORTTRIG8_NOA_SELECT_5_SHIFT    20
+#define OAREPORTTRIG8_NOA_SELECT_6_SHIFT    24
+#define OAREPORTTRIG8_NOA_SELECT_7_SHIFT    28
+
+#define OASTARTTRIG1 _MMIO(0x2710)
+#define OASTARTTRIG1_THRESHOLD_COUNT_MASK_MBZ 0xffff0000
+#define OASTARTTRIG1_THRESHOLD_MASK	      0xffff
+
+#define OASTARTTRIG2 _MMIO(0x2714)
+#define OASTARTTRIG2_INVERT_A_0 (1<<0)
+#define OASTARTTRIG2_INVERT_A_1 (1<<1)
+#define OASTARTTRIG2_INVERT_A_2 (1<<2)
+#define OASTARTTRIG2_INVERT_A_3 (1<<3)
+#define OASTARTTRIG2_INVERT_A_4 (1<<4)
+#define OASTARTTRIG2_INVERT_A_5 (1<<5)
+#define OASTARTTRIG2_INVERT_A_6 (1<<6)
+#define OASTARTTRIG2_INVERT_A_7 (1<<7)
+#define OASTARTTRIG2_INVERT_A_8 (1<<8)
+#define OASTARTTRIG2_INVERT_A_9 (1<<9)
+#define OASTARTTRIG2_INVERT_A_10 (1<<10)
+#define OASTARTTRIG2_INVERT_A_11 (1<<11)
+#define OASTARTTRIG2_INVERT_A_12 (1<<12)
+#define OASTARTTRIG2_INVERT_A_13 (1<<13)
+#define OASTARTTRIG2_INVERT_A_14 (1<<14)
+#define OASTARTTRIG2_INVERT_A_15 (1<<15)
+#define OASTARTTRIG2_INVERT_B_0 (1<<16)
+#define OASTARTTRIG2_INVERT_B_1 (1<<17)
+#define OASTARTTRIG2_INVERT_B_2 (1<<18)
+#define OASTARTTRIG2_INVERT_B_3 (1<<19)
+#define OASTARTTRIG2_INVERT_C_0 (1<<20)
+#define OASTARTTRIG2_INVERT_C_1 (1<<21)
+#define OASTARTTRIG2_INVERT_D_0 (1<<22)
+#define OASTARTTRIG2_THRESHOLD_ENABLE	    (1<<23)
+#define OASTARTTRIG2_START_TRIG_FLAG_MBZ    (1<<24)
+#define OASTARTTRIG2_EVENT_SELECT_0  (1<<28)
+#define OASTARTTRIG2_EVENT_SELECT_1  (1<<29)
+#define OASTARTTRIG2_EVENT_SELECT_2  (1<<30)
+#define OASTARTTRIG2_EVENT_SELECT_3  (1<<31)
+
+#define OASTARTTRIG3 _MMIO(0x2718)
+#define OASTARTTRIG3_NOA_SELECT_MASK	   0xf
+#define OASTARTTRIG3_NOA_SELECT_8_SHIFT    0
+#define OASTARTTRIG3_NOA_SELECT_9_SHIFT    4
+#define OASTARTTRIG3_NOA_SELECT_10_SHIFT   8
+#define OASTARTTRIG3_NOA_SELECT_11_SHIFT   12
+#define OASTARTTRIG3_NOA_SELECT_12_SHIFT   16
+#define OASTARTTRIG3_NOA_SELECT_13_SHIFT   20
+#define OASTARTTRIG3_NOA_SELECT_14_SHIFT   24
+#define OASTARTTRIG3_NOA_SELECT_15_SHIFT   28
+
+#define OASTARTTRIG4 _MMIO(0x271c)
+#define OASTARTTRIG4_NOA_SELECT_MASK	    0xf
+#define OASTARTTRIG4_NOA_SELECT_0_SHIFT    0
+#define OASTARTTRIG4_NOA_SELECT_1_SHIFT    4
+#define OASTARTTRIG4_NOA_SELECT_2_SHIFT    8
+#define OASTARTTRIG4_NOA_SELECT_3_SHIFT    12
+#define OASTARTTRIG4_NOA_SELECT_4_SHIFT    16
+#define OASTARTTRIG4_NOA_SELECT_5_SHIFT    20
+#define OASTARTTRIG4_NOA_SELECT_6_SHIFT    24
+#define OASTARTTRIG4_NOA_SELECT_7_SHIFT    28
+
+#define OASTARTTRIG5 _MMIO(0x2720)
+#define OASTARTTRIG5_THRESHOLD_COUNT_MASK_MBZ 0xffff0000
+#define OASTARTTRIG5_THRESHOLD_MASK	      0xffff
+
+#define OASTARTTRIG6 _MMIO(0x2724)
+#define OASTARTTRIG6_INVERT_A_0 (1<<0)
+#define OASTARTTRIG6_INVERT_A_1 (1<<1)
+#define OASTARTTRIG6_INVERT_A_2 (1<<2)
+#define OASTARTTRIG6_INVERT_A_3 (1<<3)
+#define OASTARTTRIG6_INVERT_A_4 (1<<4)
+#define OASTARTTRIG6_INVERT_A_5 (1<<5)
+#define OASTARTTRIG6_INVERT_A_6 (1<<6)
+#define OASTARTTRIG6_INVERT_A_7 (1<<7)
+#define OASTARTTRIG6_INVERT_A_8 (1<<8)
+#define OASTARTTRIG6_INVERT_A_9 (1<<9)
+#define OASTARTTRIG6_INVERT_A_10 (1<<10)
+#define OASTARTTRIG6_INVERT_A_11 (1<<11)
+#define OASTARTTRIG6_INVERT_A_12 (1<<12)
+#define OASTARTTRIG6_INVERT_A_13 (1<<13)
+#define OASTARTTRIG6_INVERT_A_14 (1<<14)
+#define OASTARTTRIG6_INVERT_A_15 (1<<15)
+#define OASTARTTRIG6_INVERT_B_0 (1<<16)
+#define OASTARTTRIG6_INVERT_B_1 (1<<17)
+#define OASTARTTRIG6_INVERT_B_2 (1<<18)
+#define OASTARTTRIG6_INVERT_B_3 (1<<19)
+#define OASTARTTRIG6_INVERT_C_0 (1<<20)
+#define OASTARTTRIG6_INVERT_C_1 (1<<21)
+#define OASTARTTRIG6_INVERT_D_0 (1<<22)
+#define OASTARTTRIG6_THRESHOLD_ENABLE	    (1<<23)
+#define OASTARTTRIG6_START_TRIG_FLAG_MBZ    (1<<24)
+#define OASTARTTRIG6_EVENT_SELECT_4  (1<<28)
+#define OASTARTTRIG6_EVENT_SELECT_5  (1<<29)
+#define OASTARTTRIG6_EVENT_SELECT_6  (1<<30)
+#define OASTARTTRIG6_EVENT_SELECT_7  (1<<31)
+
+#define OASTARTTRIG7 _MMIO(0x2728)
+#define OASTARTTRIG7_NOA_SELECT_MASK	   0xf
+#define OASTARTTRIG7_NOA_SELECT_8_SHIFT    0
+#define OASTARTTRIG7_NOA_SELECT_9_SHIFT    4
+#define OASTARTTRIG7_NOA_SELECT_10_SHIFT   8
+#define OASTARTTRIG7_NOA_SELECT_11_SHIFT   12
+#define OASTARTTRIG7_NOA_SELECT_12_SHIFT   16
+#define OASTARTTRIG7_NOA_SELECT_13_SHIFT   20
+#define OASTARTTRIG7_NOA_SELECT_14_SHIFT   24
+#define OASTARTTRIG7_NOA_SELECT_15_SHIFT   28
+
+#define OASTARTTRIG8 _MMIO(0x272c)
+#define OASTARTTRIG8_NOA_SELECT_MASK	   0xf
+#define OASTARTTRIG8_NOA_SELECT_0_SHIFT    0
+#define OASTARTTRIG8_NOA_SELECT_1_SHIFT    4
+#define OASTARTTRIG8_NOA_SELECT_2_SHIFT    8
+#define OASTARTTRIG8_NOA_SELECT_3_SHIFT    12
+#define OASTARTTRIG8_NOA_SELECT_4_SHIFT    16
+#define OASTARTTRIG8_NOA_SELECT_5_SHIFT    20
+#define OASTARTTRIG8_NOA_SELECT_6_SHIFT    24
+#define OASTARTTRIG8_NOA_SELECT_7_SHIFT    28
+
+/* CECX_0 */
+#define OACEC_COMPARE_LESS_OR_EQUAL	6
+#define OACEC_COMPARE_NOT_EQUAL		5
+#define OACEC_COMPARE_LESS_THAN		4
+#define OACEC_COMPARE_GREATER_OR_EQUAL	3
+#define OACEC_COMPARE_EQUAL		2
+#define OACEC_COMPARE_GREATER_THAN	1
+#define OACEC_COMPARE_ANY_EQUAL		0
+
+#define OACEC_COMPARE_VALUE_MASK    0xffff
+#define OACEC_COMPARE_VALUE_SHIFT   3
+
+#define OACEC_SELECT_NOA	(0<<19)
+#define OACEC_SELECT_PREV	(1<<19)
+#define OACEC_SELECT_BOOLEAN	(2<<19)
+
+/* CECX_1 */
+#define OACEC_MASK_MASK		    0xffff
+#define OACEC_CONSIDERATIONS_MASK   0xffff
+#define OACEC_CONSIDERATIONS_SHIFT  16
+
+#define OACEC0_0 _MMIO(0x2770)
+#define OACEC0_1 _MMIO(0x2774)
+#define OACEC1_0 _MMIO(0x2778)
+#define OACEC1_1 _MMIO(0x277c)
+#define OACEC2_0 _MMIO(0x2780)
+#define OACEC2_1 _MMIO(0x2784)
+#define OACEC3_0 _MMIO(0x2788)
+#define OACEC3_1 _MMIO(0x278c)
+#define OACEC4_0 _MMIO(0x2790)
+#define OACEC4_1 _MMIO(0x2794)
+#define OACEC5_0 _MMIO(0x2798)
+#define OACEC5_1 _MMIO(0x279c)
+#define OACEC6_0 _MMIO(0x27a0)
+#define OACEC6_1 _MMIO(0x27a4)
+#define OACEC7_0 _MMIO(0x27a8)
+#define OACEC7_1 _MMIO(0x27ac)
+
 
 #define _GEN7_PIPEA_DE_LOAD_SL	0x70068
 #define _GEN7_PIPEB_DE_LOAD_SL	0x71068
@@ -6900,6 +7237,7 @@ enum skl_disp_power_wells {
 # define GEN6_RCCUNIT_CLOCK_GATE_DISABLE		(1 << 11)
 
 #define GEN6_UCGCTL3				_MMIO(0x9408)
+# define GEN6_OACSUNIT_CLOCK_GATE_DISABLE		(1 << 20)
 
 #define GEN7_UCGCTL4				_MMIO(0x940c)
 #define  GEN7_L3BANK2X_CLOCK_GATE_DISABLE	(1<<25)
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 962cc96..d974b71 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1172,6 +1172,18 @@ struct drm_i915_gem_context_param {
 	__u64 value;
 };
 
+enum drm_i915_oa_format {
+	I915_OA_FORMAT_A13 = 1,
+	I915_OA_FORMAT_A29,
+	I915_OA_FORMAT_A13_B8_C8,
+	I915_OA_FORMAT_B4_C8,
+	I915_OA_FORMAT_A45_B8_C8,
+	I915_OA_FORMAT_B4_C8_A16,
+	I915_OA_FORMAT_C4_B8,
+
+	I915_OA_FORMAT_MAX	    /* non-ABI */
+};
+
 enum drm_i915_perf_property_id {
 	/**
 	 * Open the stream for a specific context handle (as used with
@@ -1180,6 +1192,32 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_PROP_CTX_HANDLE = 1,
 
+	/**
+	 * A value of 1 requests the inclusion of raw OA unit reports as
+	 * part of stream samples.
+	 */
+	DRM_I915_PERF_PROP_SAMPLE_OA,
+
+	/**
+	 * The value specifies which set of OA unit metrics should be
+	 * be configured, defining the contents of any OA unit reports.
+	 */
+	DRM_I915_PERF_PROP_OA_METRICS_SET,
+
+	/**
+	 * The value specifies the size and layout of OA unit reports.
+	 */
+	DRM_I915_PERF_PROP_OA_FORMAT,
+
+	/**
+	 * Specifying this property implicitly requests periodic OA unit
+	 * sampling and (at least on Haswell) the sampling frequency is derived
+	 * from this exponent as follows:
+	 *
+	 *   80ns * 2^(period_exponent + 1)
+	 */
+	DRM_I915_PERF_PROP_OA_EXPONENT,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1199,7 +1237,22 @@ struct drm_i915_perf_open_param {
 	__u64 __user properties_ptr;
 };
 
+/**
+ * Enable data capture for a stream that was either opened in a disabled state
+ * via I915_PERF_FLAG_DIABLED or was later disabled via I915_PERF_IOCTL_DISABLE.
+ *
+ * It is intended to be cheaper to disable and enable a stream than it may be
+ * to close and re-open a stream with the same configuration.
+ *
+ * It's undefined whether any pending data for the stream will be lost.
+ */
 #define I915_PERF_IOCTL_ENABLE	_IO('i', 0x0)
+
+/**
+ * Disable data capture for a stream.
+ *
+ * It is an error to try and read a stream that is disabled.
+ */
 #define I915_PERF_IOCTL_DISABLE	_IO('i', 0x1)
 
 /**
@@ -1223,17 +1276,30 @@ enum drm_i915_perf_record_type {
 	 * every sample.
 	 *
 	 * The order of these sample properties given by userspace has no
-	 * affect on the ordering of data within a sample. The order will be
+	 * affect on the ordering of data within a sample. The order is
 	 * documented here.
 	 *
 	 * struct {
 	 *     struct drm_i915_perf_record_header header;
 	 *
-	 *     TODO: itemize extensible sample data here
+	 *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
 	 * };
 	 */
 	DRM_I915_PERF_RECORD_SAMPLE = 1,
 
+	/*
+	 * Indicates that one or more OA reports were not written by the
+	 * hardware. This can happen for example if an MI_REPORT_PERF_COUNT
+	 * command collides with periodic sampling - which would be more likely
+	 * at higher sampling frequencies.
+	 */
+	DRM_I915_PERF_RECORD_OA_REPORT_LOST = 2,
+
+	/**
+	 * An error occurred that resulted in all pending OA reports being lost.
+	 */
+	DRM_I915_PERF_RECORD_OA_BUFFER_LOST = 3,
+
 	DRM_I915_PERF_RECORD_MAX /* non-ABI */
 };
 
-- 
2.7.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/9] drm/i915: advertise available metrics via sysfs
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
                   ` (4 preceding siblings ...)
  2016-04-20 14:23 ` [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit Robert Bragg
@ 2016-04-20 14:23 ` Robert Bragg
  2016-04-20 14:23 ` [PATCH 7/9] drm/i915: Add dev.i915.perf_event_paranoid sysctl option Robert Bragg
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Robert Bragg @ 2016-04-20 14:23 UTC (permalink / raw)
  To: intel-gfx; +Cc: David Airlie, dri-devel, Sourab Gupta, Deepak S, Daniel Vetter

Each metric set is given a sysfs entry like:

/sys/class/drm/card0/metrics/<guid>/id

This allows userspace to enumerate the specific sets that are available
for the current system. The 'id' file contains an unsigned integer that
can be used to open the associated metric set via
DRM_IOCTL_I915_PERF_OPEN. The <guid> is a globally unique ID for a
specific OA unit configuration that can be reliably used as a key to
lookup corresponding counter meta data and normalization equations.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_drv.h    |  2 ++
 drivers/gpu/drm/i915/i915_oa_hsw.c | 45 ++++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_oa_hsw.h |  4 ++++
 drivers/gpu/drm/i915/i915_perf.c   | 18 ++++++++++++++-
 4 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 972ae6c..1406b93 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2080,6 +2080,8 @@ struct drm_i915_private {
 	struct {
 		bool initialized;
 
+		struct kobject *metrics_kobj;
+
 		struct mutex lock;
 		struct list_head streams;
 
diff --git a/drivers/gpu/drm/i915/i915_oa_hsw.c b/drivers/gpu/drm/i915/i915_oa_hsw.c
index 5472aa0..3aa22eb 100644
--- a/drivers/gpu/drm/i915/i915_oa_hsw.c
+++ b/drivers/gpu/drm/i915/i915_oa_hsw.c
@@ -24,6 +24,8 @@
  *
  */
 
+#include <linux/sysfs.h>
+
 #include "i915_drv.h"
 
 enum metric_set_id {
@@ -130,3 +132,46 @@ int i915_oa_select_metric_set_hsw(struct drm_i915_private *dev_priv)
 		return -ENODEV;
 	}
 }
+
+static ssize_t
+show_render_basic_id(struct device *kdev, struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", METRIC_SET_ID_RENDER_BASIC);
+}
+
+static struct device_attribute dev_attr_render_basic_id = {
+	.attr = { .name = "id", .mode = S_IRUGO },
+	.show = show_render_basic_id,
+	.store = NULL,
+};
+
+static struct attribute *attrs_render_basic[] = {
+	&dev_attr_render_basic_id.attr,
+	NULL,
+};
+
+static struct attribute_group group_render_basic = {
+	.name = "403d8832-1a27-4aa6-a64e-f5389ce7b212",
+	.attrs =  attrs_render_basic,
+};
+
+int
+i915_perf_init_sysfs_hsw(struct drm_i915_private *dev_priv)
+{
+	int ret;
+
+	ret = sysfs_create_group(dev_priv->perf.metrics_kobj, &group_render_basic);
+	if (ret)
+		goto error_render_basic;
+
+	return 0;
+
+error_render_basic:
+	return ret;
+}
+
+void
+i915_perf_deinit_sysfs_hsw(struct drm_i915_private *dev_priv)
+{
+	sysfs_remove_group(dev_priv->perf.metrics_kobj, &group_render_basic);
+}
diff --git a/drivers/gpu/drm/i915/i915_oa_hsw.h b/drivers/gpu/drm/i915/i915_oa_hsw.h
index b618a1f..e4ba89d 100644
--- a/drivers/gpu/drm/i915/i915_oa_hsw.h
+++ b/drivers/gpu/drm/i915/i915_oa_hsw.h
@@ -31,4 +31,8 @@ extern int i915_oa_n_builtin_metric_sets_hsw;
 
 extern int i915_oa_select_metric_set_hsw(struct drm_i915_private *dev_priv);
 
+extern int i915_perf_init_sysfs_hsw(struct drm_i915_private *dev_priv);
+
+extern void i915_perf_deinit_sysfs_hsw(struct drm_i915_private *dev_priv);
+
 #endif
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 5e58520..f2db3de 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1316,6 +1316,11 @@ void i915_perf_init(struct drm_device *dev)
 	if (!IS_HASWELL(dev))
 		return;
 
+	dev_priv->perf.metrics_kobj =
+		kobject_create_and_add("metrics", &dev->primary->kdev->kobj);
+	if (!dev_priv->perf.metrics_kobj)
+		return;
+
 	hrtimer_init(&dev_priv->perf.oa.poll_check_timer,
 		     CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	dev_priv->perf.oa.poll_check_timer.function = oa_poll_check_timer_cb;
@@ -1343,9 +1348,15 @@ void i915_perf_init(struct drm_device *dev)
 	dev_priv->perf.oa.n_builtin_sets =
 		i915_oa_n_builtin_metric_sets_hsw;
 
-	dev_priv->perf.oa.oa_formats = hsw_oa_formats;
+	if (i915_perf_init_sysfs_hsw(dev_priv)) {
+		kobject_put(dev_priv->perf.metrics_kobj);
+		dev_priv->perf.metrics_kobj = NULL;
+		return;
+	}
 
 	dev_priv->perf.initialized = true;
+
+	return;
 }
 
 void i915_perf_fini(struct drm_device *dev)
@@ -1355,6 +1366,11 @@ void i915_perf_fini(struct drm_device *dev)
 	if (!dev_priv->perf.initialized)
 		return;
 
+	i915_perf_deinit_sysfs_hsw(dev_priv);
+
+	kobject_put(dev_priv->perf.metrics_kobj);
+	dev_priv->perf.metrics_kobj = NULL;
+
 	dev_priv->perf.oa.ops.init_oa_buffer = NULL;
 
 	dev_priv->perf.initialized = false;
-- 
2.7.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 7/9] drm/i915: Add dev.i915.perf_event_paranoid sysctl option
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
                   ` (5 preceding siblings ...)
  2016-04-20 14:23 ` [PATCH 6/9] drm/i915: advertise available metrics via sysfs Robert Bragg
@ 2016-04-20 14:23 ` Robert Bragg
  2016-04-20 14:23 ` [PATCH 8/9] drm/i915: add oa_event_min_timer_exponent sysctl Robert Bragg
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Robert Bragg @ 2016-04-20 14:23 UTC (permalink / raw)
  To: intel-gfx; +Cc: David Airlie, dri-devel, Sourab Gupta, Deepak S, Daniel Vetter

Consistent with the kernel.perf_event_paranoid sysctl option that can
allow non-root users to access system wide cpu metrics, this can
optionally allow non-root users to access system wide OA counter metrics
from Gen graphics hardware.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_drv.h  |  1 +
 drivers/gpu/drm/i915/i915_perf.c | 46 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 1406b93..2ac32b2 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2081,6 +2081,7 @@ struct drm_i915_private {
 		bool initialized;
 
 		struct kobject *metrics_kobj;
+		struct ctl_table_header *sysctl_header;
 
 		struct mutex lock;
 		struct list_head streams;
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index f2db3de..c2ba16a 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -59,6 +59,8 @@
 #define POLL_FREQUENCY 200
 #define POLL_PERIOD (NSEC_PER_SEC / POLL_FREQUENCY)
 
+static u32 i915_perf_stream_paranoid = true;
+
 /* The maximum exponent the hardware accepts is 63 (essentially it selects one
  * of the 64bit timestamp bits to trigger reports from) but there's currently
  * no known use case for sampling as infrequently as once per 47 thousand years.
@@ -1105,7 +1107,13 @@ int i915_perf_open_ioctl_locked(struct drm_device *dev,
 		}
 	}
 
-	if (!specific_ctx && !capable(CAP_SYS_ADMIN)) {
+	/* Similar to perf's kernel.perf_paranoid_cpu sysctl option
+	 * we check a dev.i915.perf_stream_paranoid sysctl option
+	 * to determine if it's ok to access system wide OA counters
+	 * without CAP_SYS_ADMIN privileges.
+	 */
+	if (!specific_ctx &&
+	    i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
 		DRM_ERROR("Insufficient privileges to open system-wide i915 perf stream\n");
 		ret = -EACCES;
 		goto err_ctx;
@@ -1309,6 +1317,38 @@ int i915_perf_open_ioctl(struct drm_device *dev, void *data,
 	return ret;
 }
 
+
+static struct ctl_table oa_table[] = {
+	{
+	 .procname = "perf_stream_paranoid",
+	 .data = &i915_perf_stream_paranoid,
+	 .maxlen = sizeof(i915_perf_stream_paranoid),
+	 .mode = 0644,
+	 .proc_handler = proc_dointvec,
+	 },
+	{}
+};
+
+static struct ctl_table i915_root[] = {
+	{
+	 .procname = "i915",
+	 .maxlen = 0,
+	 .mode = 0555,
+	 .child = oa_table,
+	 },
+	{}
+};
+
+static struct ctl_table dev_root[] = {
+	{
+	 .procname = "dev",
+	 .maxlen = 0,
+	 .mode = 0555,
+	 .child = i915_root,
+	 },
+	{}
+};
+
 void i915_perf_init(struct drm_device *dev)
 {
 	struct drm_i915_private *dev_priv = to_i915(dev);
@@ -1354,6 +1394,8 @@ void i915_perf_init(struct drm_device *dev)
 		return;
 	}
 
+	dev_priv->perf.sysctl_header = register_sysctl_table(dev_root);
+
 	dev_priv->perf.initialized = true;
 
 	return;
@@ -1366,6 +1408,8 @@ void i915_perf_fini(struct drm_device *dev)
 	if (!dev_priv->perf.initialized)
 		return;
 
+	unregister_sysctl_table(dev_priv->perf.sysctl_header);
+
 	i915_perf_deinit_sysfs_hsw(dev_priv);
 
 	kobject_put(dev_priv->perf.metrics_kobj);
-- 
2.7.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 8/9] drm/i915: add oa_event_min_timer_exponent sysctl
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
                   ` (6 preceding siblings ...)
  2016-04-20 14:23 ` [PATCH 7/9] drm/i915: Add dev.i915.perf_event_paranoid sysctl option Robert Bragg
@ 2016-04-20 14:23 ` Robert Bragg
  2016-04-20 14:23 ` [PATCH 9/9] drm/i915: Add more Haswell OA metric sets Robert Bragg
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Robert Bragg @ 2016-04-20 14:23 UTC (permalink / raw)
  To: intel-gfx; +Cc: David Airlie, dri-devel, Sourab Gupta, Deepak S, Daniel Vetter

The minimal sampling period is now configurable via a
dev.i915.oa_min_timer_exponent sysctl parameter.

Following the precedent set by perf, the default is the minimum that
won't (on its own) exceed the default kernel.perf_event_max_sample_rate
default of 100000 samples/s.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_perf.c | 42 ++++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index c2ba16a..cc1cd52 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -71,6 +71,23 @@ static u32 i915_perf_stream_paranoid = true;
  */
 #define OA_EXPONENT_MAX 31
 
+/* for sysctl proc_dointvec_minmax of i915_oa_min_timer_exponent */
+static int zero;
+static int oa_exponent_max = OA_EXPONENT_MAX;
+
+/* Theoretically we can program the OA unit to sample every 160ns but don't
+ * allow that by default unless root...
+ *
+ * The period is derived from the exponent as:
+ *
+ *   period = 80ns * 2^(exponent + 1)
+ *
+ * Referring to perf's kernel.perf_event_max_sample_rate for a precedent
+ * (100000 by default); with an OA exponent of 6 we get a period of 10.240
+ * microseconds - just under 100000Hz
+ */
+static u32 i915_oa_min_timer_exponent = 6;
+
 /* XXX: beware if future OA HW adds new report formats that the current
  * code assumes all reports have a power-of-two size and ~(size - 1) can
  * be used as a mask to align the OA tail pointer.
@@ -1250,21 +1267,13 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 				return -EINVAL;
 			}
 
-			/* NB: The exponent represents a period as follows:
-			 *
-			 *   80ns * 2^(period_exponent + 1)
-			 *
-			 * Theoretically we can program the OA unit to sample
+			/* Theoretically we can program the OA unit to sample
 			 * every 160ns but don't allow that by default unless
 			 * root.
-			 *
-			 * Referring to perf's
-			 * kernel.perf_event_max_sample_rate for a precedent
-			 * (100000 by default); with an OA exponent of 6 we get
-			 * a period of 10.240 microseconds -just under 100000Hz
 			 */
-			if (value < 6 && !capable(CAP_SYS_ADMIN)) {
-				DRM_ERROR("Sampling period too high without root privileges\n");
+			if (value < i915_oa_min_timer_exponent &&
+			    !capable(CAP_SYS_ADMIN)) {
+				DRM_ERROR("OA timer exponent too low without root privileges\n");
 				return -EACCES;
 			}
 
@@ -1326,6 +1335,15 @@ static struct ctl_table oa_table[] = {
 	 .mode = 0644,
 	 .proc_handler = proc_dointvec,
 	 },
+	{
+	 .procname = "oa_min_timer_exponent",
+	 .data = &i915_oa_min_timer_exponent,
+	 .maxlen = sizeof(i915_oa_min_timer_exponent),
+	 .mode = 0644,
+	 .proc_handler = proc_dointvec_minmax,
+	 .extra1 = &zero,
+	 .extra2 = &oa_exponent_max,
+	 },
 	{}
 };
 
-- 
2.7.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 9/9] drm/i915: Add more Haswell OA metric sets
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
                   ` (7 preceding siblings ...)
  2016-04-20 14:23 ` [PATCH 8/9] drm/i915: add oa_event_min_timer_exponent sysctl Robert Bragg
@ 2016-04-20 14:23 ` Robert Bragg
  2016-04-20 14:56 ` [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Robert Bragg @ 2016-04-20 14:23 UTC (permalink / raw)
  To: intel-gfx; +Cc: David Airlie, dri-devel, Sourab Gupta, Deepak S, Daniel Vetter

This adds 'compute', 'compute extended', 'memory reads', 'memory writes'
and 'sampler balance' metric sets for Haswell.

Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_oa_hsw.c | 483 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 482 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_oa_hsw.c b/drivers/gpu/drm/i915/i915_oa_hsw.c
index 3aa22eb..4a2de8a 100644
--- a/drivers/gpu/drm/i915/i915_oa_hsw.c
+++ b/drivers/gpu/drm/i915/i915_oa_hsw.c
@@ -30,9 +30,14 @@
 
 enum metric_set_id {
 	METRIC_SET_ID_RENDER_BASIC = 1,
+	METRIC_SET_ID_COMPUTE_BASIC,
+	METRIC_SET_ID_COMPUTE_EXTENDED,
+	METRIC_SET_ID_MEMORY_READS,
+	METRIC_SET_ID_MEMORY_WRITES,
+	METRIC_SET_ID_SAMPLER_BALANCE,
 };
 
-int i915_oa_n_builtin_metric_sets_hsw = 1;
+int i915_oa_n_builtin_metric_sets_hsw = 6;
 
 static const struct i915_oa_reg b_counter_config_render_basic[] = {
 	{ _MMIO(0x2724), 0x00800000 },
@@ -118,6 +123,332 @@ static int select_render_basic_config(struct drm_i915_private *dev_priv)
 	return 0;
 }
 
+static const struct i915_oa_reg b_counter_config_compute_basic[] = {
+	{ _MMIO(0x2710), 0x00000000 },
+	{ _MMIO(0x2714), 0x00800000 },
+	{ _MMIO(0x2718), 0xAAAAAAAA },
+	{ _MMIO(0x271C), 0xAAAAAAAA },
+	{ _MMIO(0x2720), 0x00000000 },
+	{ _MMIO(0x2724), 0x00800000 },
+	{ _MMIO(0x2728), 0xAAAAAAAA },
+	{ _MMIO(0x272C), 0xAAAAAAAA },
+	{ _MMIO(0x2740), 0x00000000 },
+	{ _MMIO(0x2744), 0x00000000 },
+	{ _MMIO(0x2748), 0x00000000 },
+	{ _MMIO(0x274C), 0x00000000 },
+	{ _MMIO(0x2750), 0x00000000 },
+	{ _MMIO(0x2754), 0x00000000 },
+	{ _MMIO(0x2758), 0x00000000 },
+	{ _MMIO(0x275C), 0x00000000 },
+};
+
+static const struct i915_oa_reg mux_config_compute_basic[] = {
+	{ _MMIO(0x253A4), 0x00000000 },
+	{ _MMIO(0x2681C), 0x01F00800 },
+	{ _MMIO(0x26820), 0x00001000 },
+	{ _MMIO(0x2781C), 0x01F00800 },
+	{ _MMIO(0x26520), 0x00000007 },
+	{ _MMIO(0x265A0), 0x00000007 },
+	{ _MMIO(0x25380), 0x00000010 },
+	{ _MMIO(0x2538C), 0x00300000 },
+	{ _MMIO(0x25384), 0xAA8AAAAA },
+	{ _MMIO(0x25404), 0xFFFFFFFF },
+	{ _MMIO(0x26800), 0x00004202 },
+	{ _MMIO(0x26808), 0x00605817 },
+	{ _MMIO(0x2680C), 0x10001005 },
+	{ _MMIO(0x26804), 0x00000000 },
+	{ _MMIO(0x27800), 0x00000102 },
+	{ _MMIO(0x27808), 0x0C0701E0 },
+	{ _MMIO(0x2780C), 0x000200A0 },
+	{ _MMIO(0x27804), 0x00000000 },
+	{ _MMIO(0x26484), 0x44000000 },
+	{ _MMIO(0x26704), 0x44000000 },
+	{ _MMIO(0x26500), 0x00000006 },
+	{ _MMIO(0x26510), 0x00000001 },
+	{ _MMIO(0x26504), 0x88000000 },
+	{ _MMIO(0x26580), 0x00000006 },
+	{ _MMIO(0x26590), 0x00000020 },
+	{ _MMIO(0x26584), 0x00000000 },
+	{ _MMIO(0x26104), 0x55822222 },
+	{ _MMIO(0x26184), 0xAA866666 },
+	{ _MMIO(0x25420), 0x08320C83 },
+	{ _MMIO(0x25424), 0x06820C83 },
+	{ _MMIO(0x2541C), 0x00000000 },
+	{ _MMIO(0x25428), 0x00000C03 },
+};
+
+static int select_compute_basic_config(struct drm_i915_private *dev_priv)
+{
+	dev_priv->perf.oa.mux_regs =
+		mux_config_compute_basic;
+	dev_priv->perf.oa.mux_regs_len =
+		ARRAY_SIZE(mux_config_compute_basic);
+
+	dev_priv->perf.oa.b_counter_regs =
+		b_counter_config_compute_basic;
+	dev_priv->perf.oa.b_counter_regs_len =
+		ARRAY_SIZE(b_counter_config_compute_basic);
+
+	return 0;
+}
+
+static const struct i915_oa_reg b_counter_config_compute_extended[] = {
+	{ _MMIO(0x2724), 0xf0800000 },
+	{ _MMIO(0x2720), 0x00000000 },
+	{ _MMIO(0x2714), 0xf0800000 },
+	{ _MMIO(0x2710), 0x00000000 },
+	{ _MMIO(0x2770), 0x0007fe2a },
+	{ _MMIO(0x2774), 0x0000ff00 },
+	{ _MMIO(0x2778), 0x0007fe6a },
+	{ _MMIO(0x277c), 0x0000ff00 },
+	{ _MMIO(0x2780), 0x0007fe92 },
+	{ _MMIO(0x2784), 0x0000ff00 },
+	{ _MMIO(0x2788), 0x0007fea2 },
+	{ _MMIO(0x278c), 0x0000ff00 },
+	{ _MMIO(0x2790), 0x0007fe32 },
+	{ _MMIO(0x2794), 0x0000ff00 },
+	{ _MMIO(0x2798), 0x0007fe9a },
+	{ _MMIO(0x279c), 0x0000ff00 },
+	{ _MMIO(0x27a0), 0x0007ff23 },
+	{ _MMIO(0x27a4), 0x0000ff00 },
+	{ _MMIO(0x27a8), 0x0007fff3 },
+	{ _MMIO(0x27ac), 0x0000fffe },
+};
+
+static const struct i915_oa_reg mux_config_compute_extended[] = {
+	{ _MMIO(0x2681C), 0x3EB00800 },
+	{ _MMIO(0x26820), 0x00900000 },
+	{ _MMIO(0x25384), 0x02AAAAAA },
+	{ _MMIO(0x25404), 0x03FFFFFF },
+	{ _MMIO(0x26800), 0x00142284 },
+	{ _MMIO(0x26808), 0x0E629062 },
+	{ _MMIO(0x2680C), 0x3F6F55CB },
+	{ _MMIO(0x26810), 0x00000014 },
+	{ _MMIO(0x26804), 0x00000000 },
+	{ _MMIO(0x26104), 0x02AAAAAA },
+	{ _MMIO(0x26184), 0x02AAAAAA },
+	{ _MMIO(0x25420), 0x00000000 },
+	{ _MMIO(0x25424), 0x00000000 },
+	{ _MMIO(0x2541C), 0x00000000 },
+	{ _MMIO(0x25428), 0x00000000 },
+};
+
+static int select_compute_extended_config(struct drm_i915_private *dev_priv)
+{
+	dev_priv->perf.oa.mux_regs =
+		mux_config_compute_extended;
+	dev_priv->perf.oa.mux_regs_len =
+		ARRAY_SIZE(mux_config_compute_extended);
+
+	dev_priv->perf.oa.b_counter_regs =
+		b_counter_config_compute_extended;
+	dev_priv->perf.oa.b_counter_regs_len =
+		ARRAY_SIZE(b_counter_config_compute_extended);
+
+	return 0;
+}
+
+static const struct i915_oa_reg b_counter_config_memory_reads[] = {
+	{ _MMIO(0x2724), 0xf0800000 },
+	{ _MMIO(0x2720), 0x00000000 },
+	{ _MMIO(0x2714), 0xf0800000 },
+	{ _MMIO(0x2710), 0x00000000 },
+	{ _MMIO(0x274c), 0x76543298 },
+	{ _MMIO(0x2748), 0x98989898 },
+	{ _MMIO(0x2744), 0x000000e4 },
+	{ _MMIO(0x2740), 0x00000000 },
+	{ _MMIO(0x275c), 0x98a98a98 },
+	{ _MMIO(0x2758), 0x88888888 },
+	{ _MMIO(0x2754), 0x000c5500 },
+	{ _MMIO(0x2750), 0x00000000 },
+	{ _MMIO(0x2770), 0x0007f81a },
+	{ _MMIO(0x2774), 0x0000fc00 },
+	{ _MMIO(0x2778), 0x0007f82a },
+	{ _MMIO(0x277c), 0x0000fc00 },
+	{ _MMIO(0x2780), 0x0007f872 },
+	{ _MMIO(0x2784), 0x0000fc00 },
+	{ _MMIO(0x2788), 0x0007f8ba },
+	{ _MMIO(0x278c), 0x0000fc00 },
+	{ _MMIO(0x2790), 0x0007f87a },
+	{ _MMIO(0x2794), 0x0000fc00 },
+	{ _MMIO(0x2798), 0x0007f8ea },
+	{ _MMIO(0x279c), 0x0000fc00 },
+	{ _MMIO(0x27a0), 0x0007f8e2 },
+	{ _MMIO(0x27a4), 0x0000fc00 },
+	{ _MMIO(0x27a8), 0x0007f8f2 },
+	{ _MMIO(0x27ac), 0x0000fc00 },
+};
+
+static const struct i915_oa_reg mux_config_memory_reads[] = {
+	{ _MMIO(0x253A4), 0x34300000 },
+	{ _MMIO(0x25440), 0x2D800000 },
+	{ _MMIO(0x25444), 0x00000008 },
+	{ _MMIO(0x25128), 0x0E600000 },
+	{ _MMIO(0x25380), 0x00000450 },
+	{ _MMIO(0x25390), 0x00052C43 },
+	{ _MMIO(0x25384), 0x00000000 },
+	{ _MMIO(0x25400), 0x00006144 },
+	{ _MMIO(0x25408), 0x0A418820 },
+	{ _MMIO(0x2540C), 0x000820E6 },
+	{ _MMIO(0x25404), 0xFF500000 },
+	{ _MMIO(0x25100), 0x000005D6 },
+	{ _MMIO(0x2510C), 0x0EF00000 },
+	{ _MMIO(0x25104), 0x00000000 },
+	{ _MMIO(0x25420), 0x02108421 },
+	{ _MMIO(0x25424), 0x00008421 },
+	{ _MMIO(0x2541C), 0x00000000 },
+	{ _MMIO(0x25428), 0x00000000 },
+};
+
+static int select_memory_reads_config(struct drm_i915_private *dev_priv)
+{
+	dev_priv->perf.oa.mux_regs =
+		mux_config_memory_reads;
+	dev_priv->perf.oa.mux_regs_len =
+		ARRAY_SIZE(mux_config_memory_reads);
+
+	dev_priv->perf.oa.b_counter_regs =
+		b_counter_config_memory_reads;
+	dev_priv->perf.oa.b_counter_regs_len =
+		ARRAY_SIZE(b_counter_config_memory_reads);
+
+	return 0;
+}
+
+static const struct i915_oa_reg b_counter_config_memory_writes[] = {
+	{ _MMIO(0x2724), 0xf0800000 },
+	{ _MMIO(0x2720), 0x00000000 },
+	{ _MMIO(0x2714), 0xf0800000 },
+	{ _MMIO(0x2710), 0x00000000 },
+	{ _MMIO(0x274c), 0x76543298 },
+	{ _MMIO(0x2748), 0x98989898 },
+	{ _MMIO(0x2744), 0x000000e4 },
+	{ _MMIO(0x2740), 0x00000000 },
+	{ _MMIO(0x275c), 0xbabababa },
+	{ _MMIO(0x2758), 0x88888888 },
+	{ _MMIO(0x2754), 0x000c5500 },
+	{ _MMIO(0x2750), 0x00000000 },
+	{ _MMIO(0x2770), 0x0007f81a },
+	{ _MMIO(0x2774), 0x0000fc00 },
+	{ _MMIO(0x2778), 0x0007f82a },
+	{ _MMIO(0x277c), 0x0000fc00 },
+	{ _MMIO(0x2780), 0x0007f822 },
+	{ _MMIO(0x2784), 0x0000fc00 },
+	{ _MMIO(0x2788), 0x0007f8ba },
+	{ _MMIO(0x278c), 0x0000fc00 },
+	{ _MMIO(0x2790), 0x0007f87a },
+	{ _MMIO(0x2794), 0x0000fc00 },
+	{ _MMIO(0x2798), 0x0007f8ea },
+	{ _MMIO(0x279c), 0x0000fc00 },
+	{ _MMIO(0x27a0), 0x0007f8e2 },
+	{ _MMIO(0x27a4), 0x0000fc00 },
+	{ _MMIO(0x27a8), 0x0007f8f2 },
+	{ _MMIO(0x27ac), 0x0000fc00 },
+};
+
+static const struct i915_oa_reg mux_config_memory_writes[] = {
+	{ _MMIO(0x253A4), 0x34300000 },
+	{ _MMIO(0x25440), 0x01500000 },
+	{ _MMIO(0x25444), 0x00000120 },
+	{ _MMIO(0x25128), 0x0C200000 },
+	{ _MMIO(0x25380), 0x00000450 },
+	{ _MMIO(0x25390), 0x00052C43 },
+	{ _MMIO(0x25384), 0x00000000 },
+	{ _MMIO(0x25400), 0x00007184 },
+	{ _MMIO(0x25408), 0x0A418820 },
+	{ _MMIO(0x2540C), 0x000820E6 },
+	{ _MMIO(0x25404), 0xFF500000 },
+	{ _MMIO(0x25100), 0x000005D6 },
+	{ _MMIO(0x2510C), 0x1E700000 },
+	{ _MMIO(0x25104), 0x00000000 },
+	{ _MMIO(0x25420), 0x02108421 },
+	{ _MMIO(0x25424), 0x00008421 },
+	{ _MMIO(0x2541C), 0x00000000 },
+	{ _MMIO(0x25428), 0x00000000 },
+};
+
+static int select_memory_writes_config(struct drm_i915_private *dev_priv)
+{
+	dev_priv->perf.oa.mux_regs =
+		mux_config_memory_writes;
+	dev_priv->perf.oa.mux_regs_len =
+		ARRAY_SIZE(mux_config_memory_writes);
+
+	dev_priv->perf.oa.b_counter_regs =
+		b_counter_config_memory_writes;
+	dev_priv->perf.oa.b_counter_regs_len =
+		ARRAY_SIZE(b_counter_config_memory_writes);
+
+	return 0;
+}
+
+static const struct i915_oa_reg b_counter_config_sampler_balance[] = {
+	{ _MMIO(0x2740), 0x00000000 },
+	{ _MMIO(0x2744), 0x00800000 },
+	{ _MMIO(0x2710), 0x00000000 },
+	{ _MMIO(0x2714), 0x00800000 },
+	{ _MMIO(0x2720), 0x00000000 },
+	{ _MMIO(0x2724), 0x00800000 },
+};
+
+static const struct i915_oa_reg mux_config_sampler_balance[] = {
+	{ _MMIO(0x2eb9c), 0x01906400 },
+	{ _MMIO(0x2fb9c), 0x01906400 },
+	{ _MMIO(0x253a4), 0x00000000 },
+	{ _MMIO(0x26b9c), 0x01906400 },
+	{ _MMIO(0x27b9c), 0x01906400 },
+	{ _MMIO(0x27104), 0x00a00000 },
+	{ _MMIO(0x27184), 0x00a50000 },
+	{ _MMIO(0x2e804), 0x00500000 },
+	{ _MMIO(0x2e984), 0x00500000 },
+	{ _MMIO(0x2eb04), 0x00500000 },
+	{ _MMIO(0x2eb80), 0x00000084 },
+	{ _MMIO(0x2eb8c), 0x14200000 },
+	{ _MMIO(0x2eb84), 0x00000000 },
+	{ _MMIO(0x2f804), 0x00050000 },
+	{ _MMIO(0x2f984), 0x00050000 },
+	{ _MMIO(0x2fb04), 0x00050000 },
+	{ _MMIO(0x2fb80), 0x00000084 },
+	{ _MMIO(0x2fb8c), 0x00050800 },
+	{ _MMIO(0x2fb84), 0x00000000 },
+	{ _MMIO(0x25380), 0x00000010 },
+	{ _MMIO(0x2538c), 0x000000c0 },
+	{ _MMIO(0x25384), 0xaa550000 },
+	{ _MMIO(0x25404), 0xffffc000 },
+	{ _MMIO(0x26804), 0x50000000 },
+	{ _MMIO(0x26984), 0x50000000 },
+	{ _MMIO(0x26b04), 0x50000000 },
+	{ _MMIO(0x26b80), 0x00000084 },
+	{ _MMIO(0x26b90), 0x00050800 },
+	{ _MMIO(0x26b84), 0x00000000 },
+	{ _MMIO(0x27804), 0x05000000 },
+	{ _MMIO(0x27984), 0x05000000 },
+	{ _MMIO(0x27b04), 0x05000000 },
+	{ _MMIO(0x27b80), 0x00000084 },
+	{ _MMIO(0x27b90), 0x00000142 },
+	{ _MMIO(0x27b84), 0x00000000 },
+	{ _MMIO(0x26104), 0xa0000000 },
+	{ _MMIO(0x26184), 0xa5000000 },
+	{ _MMIO(0x25424), 0x00008620 },
+	{ _MMIO(0x2541c), 0x00000000 },
+	{ _MMIO(0x25428), 0x0004a54a },
+};
+
+static int select_sampler_balance_config(struct drm_i915_private *dev_priv)
+{
+	dev_priv->perf.oa.mux_regs =
+		mux_config_sampler_balance;
+	dev_priv->perf.oa.mux_regs_len =
+		ARRAY_SIZE(mux_config_sampler_balance);
+
+	dev_priv->perf.oa.b_counter_regs =
+		b_counter_config_sampler_balance;
+	dev_priv->perf.oa.b_counter_regs_len =
+		ARRAY_SIZE(b_counter_config_sampler_balance);
+
+	return 0;
+}
+
 int i915_oa_select_metric_set_hsw(struct drm_i915_private *dev_priv)
 {
 	dev_priv->perf.oa.mux_regs = NULL;
@@ -128,6 +459,16 @@ int i915_oa_select_metric_set_hsw(struct drm_i915_private *dev_priv)
 	switch (dev_priv->perf.oa.metrics_set) {
 	case METRIC_SET_ID_RENDER_BASIC:
 		return select_render_basic_config(dev_priv);
+	case METRIC_SET_ID_COMPUTE_BASIC:
+		return select_compute_basic_config(dev_priv);
+	case METRIC_SET_ID_COMPUTE_EXTENDED:
+		return select_compute_extended_config(dev_priv);
+	case METRIC_SET_ID_MEMORY_READS:
+		return select_memory_reads_config(dev_priv);
+	case METRIC_SET_ID_MEMORY_WRITES:
+		return select_memory_writes_config(dev_priv);
+	case METRIC_SET_ID_SAMPLER_BALANCE:
+		return select_sampler_balance_config(dev_priv);
 	default:
 		return -ENODEV;
 	}
@@ -155,6 +496,116 @@ static struct attribute_group group_render_basic = {
 	.attrs =  attrs_render_basic,
 };
 
+static ssize_t
+show_compute_basic_id(struct device *kdev, struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", METRIC_SET_ID_COMPUTE_BASIC);
+}
+
+static struct device_attribute dev_attr_compute_basic_id = {
+	.attr = { .name = "id", .mode = S_IRUGO },
+	.show = show_compute_basic_id,
+	.store = NULL,
+};
+
+static struct attribute *attrs_compute_basic[] = {
+	&dev_attr_compute_basic_id.attr,
+	NULL,
+};
+
+static struct attribute_group group_compute_basic = {
+	.name = "39ad14bc-2380-45c4-91eb-fbcb3aa7ae7b",
+	.attrs =  attrs_compute_basic,
+};
+
+static ssize_t
+show_compute_extended_id(struct device *kdev, struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", METRIC_SET_ID_COMPUTE_EXTENDED);
+}
+
+static struct device_attribute dev_attr_compute_extended_id = {
+	.attr = { .name = "id", .mode = S_IRUGO },
+	.show = show_compute_extended_id,
+	.store = NULL,
+};
+
+static struct attribute *attrs_compute_extended[] = {
+	&dev_attr_compute_extended_id.attr,
+	NULL,
+};
+
+static struct attribute_group group_compute_extended = {
+	.name = "3865be28-6982-49fe-9494-e4d1b4795413",
+	.attrs =  attrs_compute_extended,
+};
+
+static ssize_t
+show_memory_reads_id(struct device *kdev, struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", METRIC_SET_ID_MEMORY_READS);
+}
+
+static struct device_attribute dev_attr_memory_reads_id = {
+	.attr = { .name = "id", .mode = S_IRUGO },
+	.show = show_memory_reads_id,
+	.store = NULL,
+};
+
+static struct attribute *attrs_memory_reads[] = {
+	&dev_attr_memory_reads_id.attr,
+	NULL,
+};
+
+static struct attribute_group group_memory_reads = {
+	.name = "bb5ed49b-2497-4095-94f6-26ba294db88a",
+	.attrs =  attrs_memory_reads,
+};
+
+static ssize_t
+show_memory_writes_id(struct device *kdev, struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", METRIC_SET_ID_MEMORY_WRITES);
+}
+
+static struct device_attribute dev_attr_memory_writes_id = {
+	.attr = { .name = "id", .mode = S_IRUGO },
+	.show = show_memory_writes_id,
+	.store = NULL,
+};
+
+static struct attribute *attrs_memory_writes[] = {
+	&dev_attr_memory_writes_id.attr,
+	NULL,
+};
+
+static struct attribute_group group_memory_writes = {
+	.name = "3358d639-9b5f-45ab-976d-9b08cbfc6240",
+	.attrs =  attrs_memory_writes,
+};
+
+static ssize_t
+show_sampler_balance_id(struct device *kdev, struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", METRIC_SET_ID_SAMPLER_BALANCE);
+}
+
+static struct device_attribute dev_attr_sampler_balance_id = {
+	.attr = { .name = "id", .mode = S_IRUGO },
+	.show = show_sampler_balance_id,
+	.store = NULL,
+};
+
+static struct attribute *attrs_sampler_balance[] = {
+	&dev_attr_sampler_balance_id.attr,
+	NULL,
+};
+
+static struct attribute_group group_sampler_balance = {
+	.name = "bc274488-b4b6-40c7-90da-b77d7ad16189",
+	.attrs =  attrs_sampler_balance,
+};
+
 int
 i915_perf_init_sysfs_hsw(struct drm_i915_private *dev_priv)
 {
@@ -163,9 +614,34 @@ i915_perf_init_sysfs_hsw(struct drm_i915_private *dev_priv)
 	ret = sysfs_create_group(dev_priv->perf.metrics_kobj, &group_render_basic);
 	if (ret)
 		goto error_render_basic;
+	ret = sysfs_create_group(dev_priv->perf.metrics_kobj, &group_compute_basic);
+	if (ret)
+		goto error_compute_basic;
+	ret = sysfs_create_group(dev_priv->perf.metrics_kobj, &group_compute_extended);
+	if (ret)
+		goto error_compute_extended;
+	ret = sysfs_create_group(dev_priv->perf.metrics_kobj, &group_memory_reads);
+	if (ret)
+		goto error_memory_reads;
+	ret = sysfs_create_group(dev_priv->perf.metrics_kobj, &group_memory_writes);
+	if (ret)
+		goto error_memory_writes;
+	ret = sysfs_create_group(dev_priv->perf.metrics_kobj, &group_sampler_balance);
+	if (ret)
+		goto error_sampler_balance;
 
 	return 0;
 
+error_sampler_balance:
+	sysfs_remove_group(dev_priv->perf.metrics_kobj, &group_memory_writes);
+error_memory_writes:
+	sysfs_remove_group(dev_priv->perf.metrics_kobj, &group_memory_reads);
+error_memory_reads:
+	sysfs_remove_group(dev_priv->perf.metrics_kobj, &group_compute_extended);
+error_compute_extended:
+	sysfs_remove_group(dev_priv->perf.metrics_kobj, &group_compute_basic);
+error_compute_basic:
+	sysfs_remove_group(dev_priv->perf.metrics_kobj, &group_render_basic);
 error_render_basic:
 	return ret;
 }
@@ -174,4 +650,9 @@ void
 i915_perf_deinit_sysfs_hsw(struct drm_i915_private *dev_priv)
 {
 	sysfs_remove_group(dev_priv->perf.metrics_kobj, &group_render_basic);
+	sysfs_remove_group(dev_priv->perf.metrics_kobj, &group_compute_basic);
+	sysfs_remove_group(dev_priv->perf.metrics_kobj, &group_compute_extended);
+	sysfs_remove_group(dev_priv->perf.metrics_kobj, &group_memory_reads);
+	sysfs_remove_group(dev_priv->perf.metrics_kobj, &group_memory_writes);
+	sysfs_remove_group(dev_priv->perf.metrics_kobj, &group_sampler_balance);
 }
-- 
2.7.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/9] Enable Gen 7 Observation Architecture
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
                   ` (8 preceding siblings ...)
  2016-04-20 14:23 ` [PATCH 9/9] drm/i915: Add more Haswell OA metric sets Robert Bragg
@ 2016-04-20 14:56 ` Robert Bragg
  2016-04-21  7:46 ` ✓ Fi.CI.BAT: success for Enable Gen 7 Observation Architecture (rev3) Patchwork
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Robert Bragg @ 2016-04-20 14:56 UTC (permalink / raw)
  To: intel-gfx
  Cc: David Airlie, ML dri-devel, Sourab Gupta, Deepak S, Daniel Vetter


[-- Attachment #1.1: Type: text/plain, Size: 4852 bytes --]

Ah, something I forgot to highlight in the cover letter is an issue with
the "drm/i915: don't whitelist oacontrol in cmd parser" patch in this
series...

I was hoping that I could get away with this without breaking any existing
userspace, since the only userspace I know that attempts to configure
OACONTROL via LRIs is Mesa which attempts to verify that it can
successfully write to OACONTROL before exposing its current
INTEL_performance_query implementation, and it seems like it should
gracefully handle a failure here as if the command parser where disabled.

Something curious here is that this patch seemed to work as expected when
my series was recently based on a v4.5 kernel, but since rebasing last week
on a couple of different recent nightlies I'm now seeing gnome-shell
failing to run.

Incidentally, entirely disabling the command parser seems to work; but
removing just OACONTROL seems to cause trouble.

Hopefully I can get a better understanding of what's going on here, though
I can at least confirm the problem is triggered via the
intel_extensions.c:can_write_oacontrol() check in Mesa. Bypassing this
check (with a true or false status) is enough to get gnome-shell running
again.

This could be a fiddly compatibility issue if it turns out to not be
possible to remove OACONTROL from the cmd parser white list. One basic
problem that stems from here is that whenever a GL application starts and
this OACONTROL check is done it ends up resetting OACONTROL to zero which
disables the OA unit which may be in use via the i915 perf interface.

Regards,
- Robert


On Wed, Apr 20, 2016 at 3:23 PM, Robert Bragg <robert@sixbynine.org> wrote:

> I've been working on some i-g-t tests for this new interface and while I
> still
> have some more tests to write, it still seemed worth sending out another
> updated
> series in the mean time.
>
>
> Firstly this includes updates based on Emil's previous comments.
>
> Then there have been a few issue hit while writing tests:
>
> * It seems it can take a fairly long time for the MUX config to apply after
>   the register writes have finished, and now the driver inserts a delay
> after
>   configuration, before enabling periodic sampling.
> * I've found that sometimes the HW tail pointer can get ahead of OA buffer
>   writes, especially with higher sampling frequencies, and so now the
> driver
>   maintains a margin behind the HW tail pointer to ensure the most recent
>   reports have some time to land before attempting to copy them to
> userspace.
> * As a sanity check that a report is valid before it's copied to userspace
> the
>   driver checks the report-id field != 0
> * The _BUFFER_OVERFLOW record has been replaced with a _BUFFER_LOST record
> with
>   more specific semantics.
> * Since we can't clear the overflow status on Haswell while periodic
> sampling
>   is enabled, if an overflow occurs we now restart the unit.
> * The maximum OA periodic sampling exponent is now 31
> * We verify the head/tail pointers read back from HW look sane before
> using as
>   offsets to read reports from the OA buffer. We reset the HW if they look
> bad.
>
>
> For reference the work-in-progress i-g-t tests can be found here:
>
> https://github.com/rib/intel-gpu-tools
> branch = wip/rib/i915-perf-tests
>
> or browsed here:
> https://github.com/rib/intel-gpu-tools/commits/wip/rib/i915-perf-tests
>
> Also for reference these patches can be fetched from here:
>
> https://github.com/rib/linux
> branch = wip/rib/oa-2016-04-19-nightly
>
> Regards,
> - Robert
>
>
> Robert Bragg (9):
>   drm/i915: Add i915 perf infrastructure
>   drm/i915: rename OACONTROL GEN7_OACONTROL
>   drm/i915: don't whitelist oacontrol in cmd parser
>   drm/i915: Add 'render basic' Haswell OA unit config
>   drm/i915: Enable i915 perf stream for Haswell OA unit
>   drm/i915: advertise available metrics via sysfs
>   drm/i915: Add dev.i915.perf_event_paranoid sysctl option
>   drm/i915: add oa_event_min_timer_exponent sysctl
>   drm/i915: Add more Haswell OA metric sets
>
>  drivers/gpu/drm/i915/Makefile           |    4 +
>  drivers/gpu/drm/i915/i915_cmd_parser.c  |   33 +-
>  drivers/gpu/drm/i915/i915_dma.c         |    8 +
>  drivers/gpu/drm/i915/i915_drv.h         |  155 ++++
>  drivers/gpu/drm/i915/i915_gem_context.c |   24 +-
>  drivers/gpu/drm/i915/i915_oa_hsw.c      |  658 ++++++++++++++
>  drivers/gpu/drm/i915/i915_oa_hsw.h      |   38 +
>  drivers/gpu/drm/i915/i915_perf.c        | 1439
> +++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_reg.h         |  340 +++++++-
>  include/uapi/drm/i915_drm.h             |  133 +++
>  10 files changed, 2795 insertions(+), 37 deletions(-)
>  create mode 100644 drivers/gpu/drm/i915/i915_oa_hsw.c
>  create mode 100644 drivers/gpu/drm/i915/i915_oa_hsw.h
>  create mode 100644 drivers/gpu/drm/i915/i915_perf.c
>
> --
> 2.7.1
>
>

[-- Attachment #1.2: Type: text/html, Size: 6012 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 14:23 ` [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit Robert Bragg
@ 2016-04-20 16:16   ` kbuild test robot
  2016-04-20 20:30   ` [Intel-gfx] " kbuild test robot
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: kbuild test robot @ 2016-04-20 16:16 UTC (permalink / raw)
  To: Robert Bragg
  Cc: David Airlie, intel-gfx, dri-devel, Sourab Gupta, kbuild-all,
	Deepak S, Daniel Vetter

[-- Attachment #1: Type: text/plain, Size: 998 bytes --]

Hi,

[auto build test ERROR on drm-intel/for-linux-next]
[also build test ERROR on next-20160420]
[cannot apply to v4.6-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Robert-Bragg/Enable-Gen-7-Observation-Architecture/20160420-222746
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: i386-randconfig-s1-201616 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `i915_perf_open_ioctl_locked':
>> (.text+0x2cadf4): undefined reference to `__udivdi3'
   drivers/built-in.o: In function `i915_perf_open_ioctl_locked':
   (.text+0x2cae0d): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 25411 bytes --]

[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 14:23 ` [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit Robert Bragg
  2016-04-20 16:16   ` kbuild test robot
@ 2016-04-20 20:30   ` kbuild test robot
  2016-04-20 21:11   ` Chris Wilson
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: kbuild test robot @ 2016-04-20 20:30 UTC (permalink / raw)
  To: Robert Bragg
  Cc: intel-gfx, dri-devel, Sourab Gupta, kbuild-all, Deepak S, Daniel Vetter

[-- Attachment #1: Type: text/plain, Size: 810 bytes --]

Hi,

[auto build test ERROR on drm-intel/for-linux-next]
[also build test ERROR on next-20160420]
[cannot apply to v4.6-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Robert-Bragg/Enable-Gen-7-Observation-Architecture/20160420-222746
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: i386-allmodconfig (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

>> ERROR: "__udivdi3" [drivers/gpu/drm/i915/i915.ko] undefined!

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 54467 bytes --]

[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 14:23 ` [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit Robert Bragg
  2016-04-20 16:16   ` kbuild test robot
  2016-04-20 20:30   ` [Intel-gfx] " kbuild test robot
@ 2016-04-20 21:11   ` Chris Wilson
  2016-04-21 16:15     ` Robert Bragg
  2016-04-20 22:15   ` Chris Wilson
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Chris Wilson @ 2016-04-20 21:11 UTC (permalink / raw)
  To: Robert Bragg; +Cc: dri-devel, intel-gfx, Sourab Gupta, Deepak S, Daniel Vetter

On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
> +static void gen7_update_oacontrol_locked(struct drm_i915_private *dev_priv)
> +{
> +	assert_spin_locked(&dev_priv->perf.hook_lock);
> +
> +	if (dev_priv->perf.oa.exclusive_stream->enabled) {
> +		unsigned long ctx_id = 0;
> +
> +		if (dev_priv->perf.oa.exclusive_stream->ctx)
> +			ctx_id = dev_priv->perf.oa.specific_ctx_id;
> +
> +		if (dev_priv->perf.oa.exclusive_stream->ctx == NULL || ctx_id) {
> +			bool periodic = dev_priv->perf.oa.periodic;
> +			u32 period_exponent = dev_priv->perf.oa.period_exponent;
> +			u32 report_format = dev_priv->perf.oa.oa_buffer.format;
> +
> +			I915_WRITE(GEN7_OACONTROL,
> +				   (ctx_id & GEN7_OACONTROL_CTX_MASK) |
> +				   (period_exponent <<
> +				    GEN7_OACONTROL_TIMER_PERIOD_SHIFT) |
> +				   (periodic ?
> +				    GEN7_OACONTROL_TIMER_ENABLE : 0) |
> +				   (report_format <<
> +				    GEN7_OACONTROL_FORMAT_SHIFT) |
> +				   (ctx_id ?
> +				    GEN7_OACONTROL_PER_CTX_ENABLE : 0) |
> +				   GEN7_OACONTROL_ENABLE);

So this works by only recording when the OACONTROL context address
matches the CCID.

Rather than hooking into switch context and checking every batch whether
you have the exclusive context in case it changed address, you could
just pin the exclusive context when told by the user to bind perf to
that context. Then it will also have the same address until oa is
finished (and releases it vma pin).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 14:23 ` [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit Robert Bragg
                     ` (2 preceding siblings ...)
  2016-04-20 21:11   ` Chris Wilson
@ 2016-04-20 22:15   ` Chris Wilson
  2016-04-20 22:46   ` Chris Wilson
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Chris Wilson @ 2016-04-20 22:15 UTC (permalink / raw)
  To: Robert Bragg; +Cc: dri-devel, intel-gfx, Sourab Gupta, Deepak S, Daniel Vetter

On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
> +static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
> +{
> +	struct drm_i915_gem_object *bo;
> +	int ret;
> +
> +	BUG_ON(dev_priv->perf.oa.oa_buffer.obj);
> +
> +	ret = i915_mutex_lock_interruptible(dev_priv->dev);
> +	if (ret)
> +		return ret;
> +
> +	bo = i915_gem_alloc_object(dev_priv->dev, OA_BUFFER_SIZE);
> +	if (bo == NULL) {
> +		DRM_ERROR("Failed to allocate OA buffer\n");
> +		ret = -ENOMEM;
> +		goto unlock;
> +	}
> +	dev_priv->perf.oa.oa_buffer.obj = bo;
> +
> +	ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
> +	if (ret)
> +		goto err_unref;
> +
> +	/* PreHSW required 512K alignment, HSW requires 16M */
> +	ret = i915_gem_obj_ggtt_pin(bo, SZ_16M, 0);
> +	if (ret)
> +		goto err_unref;
> +
> +	dev_priv->perf.oa.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo);
> +	dev_priv->perf.oa.oa_buffer.addr = vmap_oa_buffer(bo);

Now i915_gem_object_pin_map(bo) instead of manually vmapping it, and
i915_gem_object_unpin_map() to release.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/9] drm/i915: Add i915 perf infrastructure
  2016-04-20 14:23 ` [PATCH 1/9] drm/i915: Add i915 perf infrastructure Robert Bragg
@ 2016-04-20 22:41   ` Chris Wilson
  0 siblings, 0 replies; 44+ messages in thread
From: Chris Wilson @ 2016-04-20 22:41 UTC (permalink / raw)
  To: Robert Bragg
  Cc: dri-devel, David Airlie, intel-gfx, Sourab Gupta, Deepak S,
	Daniel Vetter

On Wed, Apr 20, 2016 at 03:23:06PM +0100, Robert Bragg wrote:
> +static struct intel_context *
> +lookup_context(struct drm_i915_private *dev_priv,
> +	       struct file *user_filp,
> +	       u32 ctx_user_handle)
> +{
> +	struct intel_context *ctx;
> +
> +	mutex_lock(&dev_priv->dev->struct_mutex);
> +	list_for_each_entry(ctx, &dev_priv->context_list, link) {
> +		struct drm_file *drm_file;
> +
> +		if (!ctx->file_priv)
> +			continue;
> +
> +		drm_file = ctx->file_priv->file;
> +
> +		if (user_filp->private_data == drm_file &&
> +		    ctx->user_handle == ctx_user_handle) {
> +			i915_gem_context_reference(ctx);
> +			mutex_unlock(&dev_priv->dev->struct_mutex);
> +
> +			return ctx;
> +		}
> +	}
> +	mutex_unlock(&dev_priv->dev->struct_mutex);
> +
> +	return NULL;
> +}
> +
> +int i915_perf_open_ioctl_locked(struct drm_device *dev,
> +				struct drm_i915_perf_open_param *param,
> +				struct perf_open_properties *props,
> +				struct drm_file *file)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct intel_context *specific_ctx = NULL;
> +	struct i915_perf_stream *stream = NULL;
> +	unsigned long f_flags = 0;
> +	int stream_fd;
> +	int ret = 0;
> +
> +	if (props->single_context) {
> +		u32 ctx_handle = props->ctx_handle;
> +
> +		specific_ctx = lookup_context(dev_priv, file->filp, ctx_handle);

i915_gem_context_get(file->driver_priv, ctx_handle) ?

Though this doesn't allow ptrace like ability to watch a context
elsewhere. For that you need to pass in fd:ctx props.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 14:23 ` [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit Robert Bragg
                     ` (3 preceding siblings ...)
  2016-04-20 22:15   ` Chris Wilson
@ 2016-04-20 22:46   ` Chris Wilson
  2016-04-22 11:04     ` Robert Bragg
  2016-04-20 22:52   ` Chris Wilson
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Chris Wilson @ 2016-04-20 22:46 UTC (permalink / raw)
  To: Robert Bragg; +Cc: dri-devel, intel-gfx, Sourab Gupta, Deepak S, Daniel Vetter

On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
> +static void gen7_init_oa_buffer(struct drm_i915_private *dev_priv)
> +{
> +	/* Pre-DevBDW: OABUFFER must be set with counters off,
> +	 * before OASTATUS1, but after OASTATUS2
> +	 */
> +	I915_WRITE(GEN7_OASTATUS2, dev_priv->perf.oa.oa_buffer.gtt_offset |
> +		   OA_MEM_SELECT_GGTT); /* head */
> +	I915_WRITE(GEN7_OABUFFER, dev_priv->perf.oa.oa_buffer.gtt_offset);
> +	I915_WRITE(GEN7_OASTATUS1, dev_priv->perf.oa.oa_buffer.gtt_offset |
> +		   OABUFFER_SIZE_16M); /* tail */
> +
> +	/* On Haswell we have to track which OASTATUS1 flags we've
> +	 * already seen since they can't be cleared while periodic
> +	 * sampling is enabled.
> +	 */
> +	dev_priv->perf.oa.gen7_latched_oastatus1 = 0;
> +
> +	/* We have a sanity check in gen7_append_oa_reports() that
> +	 * looks at the report-id field to make sure it's non-zero
> +	 * which relies on the assumption that new reports are
> +	 * being written to zeroed memory...
> +	 */
> +	memset(dev_priv->perf.oa.oa_buffer.addr, 0, SZ_16M);

You allocated zeroed memory.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 14:23 ` [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit Robert Bragg
                     ` (4 preceding siblings ...)
  2016-04-20 22:46   ` Chris Wilson
@ 2016-04-20 22:52   ` Chris Wilson
  2016-04-21 15:43     ` Robert Bragg
  2016-04-20 23:09   ` Chris Wilson
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Chris Wilson @ 2016-04-20 22:52 UTC (permalink / raw)
  To: Robert Bragg
  Cc: dri-devel, David Airlie, intel-gfx, Sourab Gupta, Deepak S,
	Daniel Vetter

On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
> +static int i915_oa_read(struct i915_perf_stream *stream,
> +			struct i915_perf_read_state *read_state)
> +{
> +	struct drm_i915_private *dev_priv = stream->dev_priv;
> +
> +	return dev_priv->perf.oa.ops.read(stream, read_state);
> +}

> +	stream->destroy = i915_oa_stream_destroy;
> +	stream->enable = i915_oa_stream_enable;
> +	stream->disable = i915_oa_stream_disable;
> +	stream->can_read = i915_oa_can_read;
> +	stream->wait_unlocked = i915_oa_wait_unlocked;
> +	stream->poll_wait = i915_oa_poll_wait;
> +	stream->read = i915_oa_read;

Why aren't these a const ops table?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 14:23 ` [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit Robert Bragg
                     ` (5 preceding siblings ...)
  2016-04-20 22:52   ` Chris Wilson
@ 2016-04-20 23:09   ` Chris Wilson
  2016-04-21 15:18     ` Robert Bragg
  2016-04-20 23:16   ` Chris Wilson
  2016-04-23 10:34   ` Martin Peres
  8 siblings, 1 reply; 44+ messages in thread
From: Chris Wilson @ 2016-04-20 23:09 UTC (permalink / raw)
  To: Robert Bragg; +Cc: dri-devel, intel-gfx, Sourab Gupta, Deepak S, Daniel Vetter

On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
> +static void i915_oa_stream_enable(struct i915_perf_stream *stream)
> +{
> +	struct drm_i915_private *dev_priv = stream->dev_priv;
> +
> +	dev_priv->perf.oa.ops.oa_enable(dev_priv);
> +
> +	if (dev_priv->perf.oa.periodic)
> +		hrtimer_start(&dev_priv->perf.oa.poll_check_timer,
> +			      ns_to_ktime(POLL_PERIOD),
> +			      HRTIMER_MODE_REL_PINNED);
> +}

> +static void i915_oa_stream_disable(struct i915_perf_stream *stream)
> +{
> +	struct drm_i915_private *dev_priv = stream->dev_priv;
> +
> +	dev_priv->perf.oa.ops.oa_disable(dev_priv);
> +
> +	if (dev_priv->perf.oa.periodic)
> +		hrtimer_cancel(&dev_priv->perf.oa.poll_check_timer);
> +}

> +static enum hrtimer_restart oa_poll_check_timer_cb(struct hrtimer *hrtimer)
> +{
> +	struct drm_i915_private *dev_priv =
> +		container_of(hrtimer, typeof(*dev_priv),
> +			     perf.oa.poll_check_timer);
> +
> +	if (!dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv))
> +		wake_up(&dev_priv->perf.oa.poll_wq);
> +
> +	hrtimer_forward_now(hrtimer, ns_to_ktime(POLL_PERIOD));
> +
> +	return HRTIMER_RESTART;
> +}

> @@ -424,8 +1313,37 @@ void i915_perf_init(struct drm_device *dev)
>  {
>  	struct drm_i915_private *dev_priv = to_i915(dev);
>  
> +	if (!IS_HASWELL(dev))
> +		return;
> +
> +	hrtimer_init(&dev_priv->perf.oa.poll_check_timer,
> +		     CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	dev_priv->perf.oa.poll_check_timer.function = oa_poll_check_timer_cb;
> +	init_waitqueue_head(&dev_priv->perf.oa.poll_wq);

This timer only serves to wake up pollers / wait_unlocked, right? So why
is it always running (when the stream is enabled)?

What happens to poll / wait_unlocked if oa.periodic is not set? It seems
like those functions would block indefinitely.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 14:23 ` [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit Robert Bragg
                     ` (6 preceding siblings ...)
  2016-04-20 23:09   ` Chris Wilson
@ 2016-04-20 23:16   ` Chris Wilson
  2016-04-21 15:01     ` Robert Bragg
  2016-04-23 10:34   ` Martin Peres
  8 siblings, 1 reply; 44+ messages in thread
From: Chris Wilson @ 2016-04-20 23:16 UTC (permalink / raw)
  To: Robert Bragg; +Cc: dri-devel, intel-gfx, Sourab Gupta, Deepak S, Daniel Vetter

On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
> +static int hsw_enable_metric_set(struct drm_i915_private *dev_priv)
> +{
> +	int ret = i915_oa_select_metric_set_hsw(dev_priv);
> +
> +	if (ret)
> +		return ret;
> +
> +	I915_WRITE(GDT_CHICKEN_BITS, GT_NOA_ENABLE);
> +
> +	/* PRM:
> +	 *
> +	 * OA unit is using “crclk” for its functionality. When trunk
> +	 * level clock gating takes place, OA clock would be gated,
> +	 * unable to count the events from non-render clock domain.
> +	 * Render clock gating must be disabled when OA is enabled to
> +	 * count the events from non-render domain. Unit level clock
> +	 * gating for RCS should also be disabled.
> +	 */
> +	I915_WRITE(GEN7_MISCCPCTL, (I915_READ(GEN7_MISCCPCTL) &
> +				    ~GEN7_DOP_CLOCK_GATE_ENABLE));
> +	I915_WRITE(GEN6_UCGCTL1, (I915_READ(GEN6_UCGCTL1) |
> +				  GEN6_CSUNIT_CLOCK_GATE_DISABLE));
> +
> +	config_oa_regs(dev_priv, dev_priv->perf.oa.mux_regs,
> +		       dev_priv->perf.oa.mux_regs_len);
> +
> +	/* It takes a fairly long time for a new MUX configuration to
> +	 * be be applied after these register writes. This delay
> +	 * duration was derived empirically based on the render_basic
> +	 * config but hopefully it covers the maximum configuration
> +	 * latency...
> +	 */
> +	mdelay(100);

You really want to busy spin for 100ms? msleep() perhaps!

Did you look for some register you can observe the change in when the
mux is reconfigured? Is even reading one of the OA registers enough?

> +	config_oa_regs(dev_priv, dev_priv->perf.oa.b_counter_regs,
> +		       dev_priv->perf.oa.b_counter_regs_len);
> +
> +	return 0;
> +}
> +
> +static void hsw_disable_metric_set(struct drm_i915_private *dev_priv)
> +{
> +	I915_WRITE(GEN6_UCGCTL1, (I915_READ(GEN6_UCGCTL1) &
> +				  ~GEN6_CSUNIT_CLOCK_GATE_DISABLE));
> +	I915_WRITE(GEN7_MISCCPCTL, (I915_READ(GEN7_MISCCPCTL) |
> +				    GEN7_DOP_CLOCK_GATE_ENABLE));
> +
> +	I915_WRITE(GDT_CHICKEN_BITS, (I915_READ(GDT_CHICKEN_BITS) &
> +				      ~GT_NOA_ENABLE));

You didn't preserve any other chicken bits during enable_metric_set.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* ✓ Fi.CI.BAT: success for Enable Gen 7 Observation Architecture (rev3)
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
                   ` (9 preceding siblings ...)
  2016-04-20 14:56 ` [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
@ 2016-04-21  7:46 ` Patchwork
  2016-04-21 12:41 ` ✗ Fi.CI.BAT: failure " Patchwork
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Patchwork @ 2016-04-21  7:46 UTC (permalink / raw)
  To: Robert Bragg; +Cc: intel-gfx

== Series Details ==

Series: Enable Gen 7 Observation Architecture (rev3)
URL   : https://patchwork.freedesktop.org/series/3024/
State : success

== Summary ==

Series 3024v3 Enable Gen 7 Observation Architecture
http://patchwork.freedesktop.org/api/1.0/series/3024/revisions/3/mbox/


bdw-nuci7        total:113  pass:81   dwarn:0   dfail:0   fail:0   skip:5  
bdw-ultra        total:194  pass:170  dwarn:0   dfail:0   fail:1   skip:23 
bsw-nuc-2        total:193  pass:154  dwarn:0   dfail:0   fail:0   skip:39 
byt-nuc          total:193  pass:155  dwarn:0   dfail:0   fail:0   skip:38 
hsw-brixbox      total:137  pass:116  dwarn:0   dfail:0   fail:0   skip:20 
ilk-hp8440p      total:194  pass:137  dwarn:0   dfail:0   fail:0   skip:57 
ivb-t430s        total:194  pass:166  dwarn:0   dfail:0   fail:0   skip:28 
skl-i7k-2        total:194  pass:168  dwarn:0   dfail:0   fail:1   skip:25 
skl-nuci5        total:194  pass:183  dwarn:0   dfail:0   fail:0   skip:11 

Results at /archive/results/CI_IGT_test/Patchwork_1962/

eb848ab2b19d25a08ca3b2b5e4b2f74c7f7c962c drm-intel-nightly: 2016y-04m-20d-18h-48m-11s UTC integration manifest
c6d8801 drm/i915: Add more Haswell OA metric sets
d369895 drm/i915: add oa_event_min_timer_exponent sysctl
3f46613 drm/i915: Add dev.i915.perf_event_paranoid sysctl option
99927cb drm/i915: advertise available metrics via sysfs
b28c24f drm/i915: Enable i915 perf stream for Haswell OA unit
51e55b1 drm/i915: Add 'render basic' Haswell OA unit config
2d80151 drm/i915: don't whitelist oacontrol in cmd parser
4af4699 drm/i915: rename OACONTROL GEN7_OACONTROL
a92910d drm/i915: Add i915 perf infrastructure

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* ✗ Fi.CI.BAT: failure for Enable Gen 7 Observation Architecture (rev3)
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
                   ` (10 preceding siblings ...)
  2016-04-21  7:46 ` ✓ Fi.CI.BAT: success for Enable Gen 7 Observation Architecture (rev3) Patchwork
@ 2016-04-21 12:41 ` Patchwork
  2016-04-23  8:31 ` ✗ Fi.CI.BAT: warning " Patchwork
  2016-04-24 17:23 ` ✓ Fi.CI.BAT: success " Patchwork
  13 siblings, 0 replies; 44+ messages in thread
From: Patchwork @ 2016-04-21 12:41 UTC (permalink / raw)
  To: Robert Bragg; +Cc: intel-gfx

== Series Details ==

Series: Enable Gen 7 Observation Architecture (rev3)
URL   : https://patchwork.freedesktop.org/series/3024/
State : failure

== Summary ==

Series 3024v3 Enable Gen 7 Observation Architecture
http://patchwork.freedesktop.org/api/1.0/series/3024/revisions/3/mbox/

Test drv_getparams_basic:
        Subgroup basic-eu-total:
                pass       -> INCOMPLETE (bdw-nuci7)
Test drv_module_reload_basic:
                pass       -> INCOMPLETE (hsw-brixbox)
Test gem_busy:
        Subgroup basic-bsd1:
                pass       -> INCOMPLETE (bdw-nuci7)
        Subgroup basic-bsd2:
                pass       -> INCOMPLETE (bdw-nuci7)
        Subgroup basic-vebox:
                pass       -> INCOMPLETE (bdw-nuci7)
Test gem_ctx_param:
        Subgroup basic-default:
                pass       -> INCOMPLETE (bdw-nuci7)
Test gem_ctx_switch:
        Subgroup basic-default:
                pass       -> INCOMPLETE (bdw-nuci7)
Test gem_exec_basic:
        Subgroup readonly-blt:
                pass       -> INCOMPLETE (bdw-nuci7)
        Subgroup readonly-bsd:
                pass       -> INCOMPLETE (bdw-nuci7)
        Subgroup readonly-bsd2:
                pass       -> INCOMPLETE (bdw-nuci7)
        Subgroup readonly-default:
                pass       -> INCOMPLETE (bdw-nuci7)
Test gem_flink_basic:
        Subgroup double-flink:
                pass       -> INCOMPLETE (bdw-nuci7)
Test gem_mmap:
        Subgroup basic:
                pass       -> INCOMPLETE (bdw-nuci7)
Test gem_mmap_gtt:
        Subgroup basic-small-bo:
                pass       -> INCOMPLETE (bdw-nuci7)
        Subgroup basic-write-read:
                pass       -> INCOMPLETE (bdw-nuci7)
Test gem_ringfill:
        Subgroup basic-default:
                pass       -> INCOMPLETE (bdw-nuci7)
Test gem_storedw_loop:
        Subgroup basic-bsd1:
                pass       -> INCOMPLETE (bdw-nuci7)
Test gem_sync:
        Subgroup basic-bsd:
                pass       -> INCOMPLETE (bdw-nuci7)
Test kms_addfb_basic:
        Subgroup addfb25-bad-modifier:
                pass       -> INCOMPLETE (bdw-nuci7)
        Subgroup bad-pitch-63:
                pass       -> INCOMPLETE (bdw-nuci7)
        Subgroup no-handle:
                pass       -> INCOMPLETE (bdw-nuci7)
Test kms_flip:
        Subgroup basic-flip-vs-dpms:
                pass       -> INCOMPLETE (bdw-nuci7)
        Subgroup basic-flip-vs-wf_vblank:
                fail       -> PASS       (bsw-nuc-2)
        Subgroup basic-plain-flip:
                pass       -> INCOMPLETE (bdw-nuci7)
Test kms_pipe_crc_basic:
        Subgroup bad-source:
                pass       -> INCOMPLETE (bdw-nuci7)
        Subgroup nonblocking-crc-pipe-b:
                pass       -> TIMEOUT    (bdw-nuci7)
Test pm_rpm:
        Subgroup basic-rte:
                pass       -> INCOMPLETE (bdw-nuci7)
Test prime_self_import:
        Subgroup basic-llseek-bad:
                pass       -> INCOMPLETE (bdw-nuci7)
        Subgroup basic-with_one_bo:
                pass       -> INCOMPLETE (bdw-nuci7)

bdw-nuci7        total:113  pass:81   dwarn:0   dfail:0   fail:0   skip:5  
bdw-ultra        total:194  pass:170  dwarn:0   dfail:0   fail:1   skip:23 
bsw-nuc-2        total:193  pass:154  dwarn:0   dfail:0   fail:0   skip:39 
byt-nuc          total:193  pass:155  dwarn:0   dfail:0   fail:0   skip:38 
hsw-brixbox      total:137  pass:116  dwarn:0   dfail:0   fail:0   skip:20 
ilk-hp8440p      total:194  pass:137  dwarn:0   dfail:0   fail:0   skip:57 
ivb-t430s        total:194  pass:166  dwarn:0   dfail:0   fail:0   skip:28 
skl-i7k-2        total:194  pass:168  dwarn:0   dfail:0   fail:1   skip:25 
skl-nuci5        total:194  pass:183  dwarn:0   dfail:0   fail:0   skip:11 

Results at /archive/results/CI_IGT_test/Patchwork_1962/

eb848ab2b19d25a08ca3b2b5e4b2f74c7f7c962c drm-intel-nightly: 2016y-04m-20d-18h-48m-11s UTC integration manifest
c6d8801 drm/i915: Add more Haswell OA metric sets
d369895 drm/i915: add oa_event_min_timer_exponent sysctl
3f46613 drm/i915: Add dev.i915.perf_event_paranoid sysctl option
99927cb drm/i915: advertise available metrics via sysfs
b28c24f drm/i915: Enable i915 perf stream for Haswell OA unit
51e55b1 drm/i915: Add 'render basic' Haswell OA unit config
2d80151 drm/i915: don't whitelist oacontrol in cmd parser
4af4699 drm/i915: rename OACONTROL GEN7_OACONTROL
a92910d drm/i915: Add i915 perf infrastructure

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 23:16   ` Chris Wilson
@ 2016-04-21 15:01     ` Robert Bragg
  0 siblings, 0 replies; 44+ messages in thread
From: Robert Bragg @ 2016-04-21 15:01 UTC (permalink / raw)
  To: Chris Wilson, Robert Bragg, intel-gfx, Daniel Vetter,
	Jani Nikula, David Airlie, Zhenyu Wang, Sourab Gupta, Deepak S,
	ML dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 3801 bytes --]

On Thu, Apr 21, 2016 at 12:16 AM, Chris Wilson <chris@chris-wilson.co.uk>
wrote:

> On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
> > +static int hsw_enable_metric_set(struct drm_i915_private *dev_priv)
> > +{
> > +     int ret = i915_oa_select_metric_set_hsw(dev_priv);
> > +
> > +     if (ret)
> > +             return ret;
> > +
> > +     I915_WRITE(GDT_CHICKEN_BITS, GT_NOA_ENABLE);
> > +
> > +     /* PRM:
> > +      *
> > +      * OA unit is using “crclk” for its functionality. When trunk
> > +      * level clock gating takes place, OA clock would be gated,
> > +      * unable to count the events from non-render clock domain.
> > +      * Render clock gating must be disabled when OA is enabled to
> > +      * count the events from non-render domain. Unit level clock
> > +      * gating for RCS should also be disabled.
> > +      */
> > +     I915_WRITE(GEN7_MISCCPCTL, (I915_READ(GEN7_MISCCPCTL) &
> > +                                 ~GEN7_DOP_CLOCK_GATE_ENABLE));
> > +     I915_WRITE(GEN6_UCGCTL1, (I915_READ(GEN6_UCGCTL1) |
> > +                               GEN6_CSUNIT_CLOCK_GATE_DISABLE));
> > +
> > +     config_oa_regs(dev_priv, dev_priv->perf.oa.mux_regs,
> > +                    dev_priv->perf.oa.mux_regs_len);
> > +
> > +     /* It takes a fairly long time for a new MUX configuration to
> > +      * be be applied after these register writes. This delay
> > +      * duration was derived empirically based on the render_basic
> > +      * config but hopefully it covers the maximum configuration
> > +      * latency...
> > +      */
> > +     mdelay(100);
>
> You really want to busy spin for 100ms? msleep() perhaps!
>

Ah, oops, I forgot to change this, thanks!


>
> Did you look for some register you can observe the change in when the
> mux is reconfigured? Is even reading one of the OA registers enough?
>

Although I can't really comprehend why the delay apparently needs to be
quite so long, based on my limited understanding of some of the NOA
michroarchitecture involved here it makes some sense to me there would be a
delay that's also somewhat variable depending on the particular MUX config
and I don't know of a trick for getting explicit feedback of completion
unfortunately.

I did bring this up briefly, recently in discussion with others more
familiar with the HW side of things, but haven't had much feedback on this
so far. afaik other OS drivers aren't currently accounting for a need to
have a delay here.

For reference, 100ms was picked as I was experimenting with stepping up the
delay by orders of magnitude and found 10ms wasn't enough. Potentially I
could experiment further with delays between 10 and 100ms, but I suppose it
won't make a big difference.



>
> > +     config_oa_regs(dev_priv, dev_priv->perf.oa.b_counter_regs,
> > +                    dev_priv->perf.oa.b_counter_regs_len);
> > +
> > +     return 0;
> > +}
> > +
> > +static void hsw_disable_metric_set(struct drm_i915_private *dev_priv)
> > +{
> > +     I915_WRITE(GEN6_UCGCTL1, (I915_READ(GEN6_UCGCTL1) &
> > +                               ~GEN6_CSUNIT_CLOCK_GATE_DISABLE));
> > +     I915_WRITE(GEN7_MISCCPCTL, (I915_READ(GEN7_MISCCPCTL) |
> > +                                 GEN7_DOP_CLOCK_GATE_ENABLE));
> > +
> > +     I915_WRITE(GDT_CHICKEN_BITS, (I915_READ(GDT_CHICKEN_BITS) &
> > +                                   ~GT_NOA_ENABLE));
>
> You didn't preserve any other chicken bits during enable_metric_set.
>

Hmm, good point. I think I'll aim to preserve other bits when setting if
that works, just in case something else needs to fiddle with the same
register later.


> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
>

[-- Attachment #1.2: Type: text/html, Size: 5219 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 23:09   ` Chris Wilson
@ 2016-04-21 15:18     ` Robert Bragg
  2016-04-22  1:10       ` Robert Bragg
  0 siblings, 1 reply; 44+ messages in thread
From: Robert Bragg @ 2016-04-21 15:18 UTC (permalink / raw)
  To: Chris Wilson, Robert Bragg, intel-gfx, Daniel Vetter,
	Jani Nikula, David Airlie, Zhenyu Wang, Sourab Gupta, Deepak S,
	ML dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 2809 bytes --]

On Thu, Apr 21, 2016 at 12:09 AM, Chris Wilson <chris@chris-wilson.co.uk>
wrote:

> On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
> > +static void i915_oa_stream_enable(struct i915_perf_stream *stream)
> > +{
> > +     struct drm_i915_private *dev_priv = stream->dev_priv;
> > +
> > +     dev_priv->perf.oa.ops.oa_enable(dev_priv);
> > +
> > +     if (dev_priv->perf.oa.periodic)
> > +             hrtimer_start(&dev_priv->perf.oa.poll_check_timer,
> > +                           ns_to_ktime(POLL_PERIOD),
> > +                           HRTIMER_MODE_REL_PINNED);
> > +}
>
> > +static void i915_oa_stream_disable(struct i915_perf_stream *stream)
> > +{
> > +     struct drm_i915_private *dev_priv = stream->dev_priv;
> > +
> > +     dev_priv->perf.oa.ops.oa_disable(dev_priv);
> > +
> > +     if (dev_priv->perf.oa.periodic)
> > +             hrtimer_cancel(&dev_priv->perf.oa.poll_check_timer);
> > +}
>
> > +static enum hrtimer_restart oa_poll_check_timer_cb(struct hrtimer
> *hrtimer)
> > +{
> > +     struct drm_i915_private *dev_priv =
> > +             container_of(hrtimer, typeof(*dev_priv),
> > +                          perf.oa.poll_check_timer);
> > +
> > +     if (!dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv))
> > +             wake_up(&dev_priv->perf.oa.poll_wq);
> > +
> > +     hrtimer_forward_now(hrtimer, ns_to_ktime(POLL_PERIOD));
> > +
> > +     return HRTIMER_RESTART;
> > +}
>
> > @@ -424,8 +1313,37 @@ void i915_perf_init(struct drm_device *dev)
> >  {
> >       struct drm_i915_private *dev_priv = to_i915(dev);
> >
> > +     if (!IS_HASWELL(dev))
> > +             return;
> > +
> > +     hrtimer_init(&dev_priv->perf.oa.poll_check_timer,
> > +                  CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > +     dev_priv->perf.oa.poll_check_timer.function =
> oa_poll_check_timer_cb;
> > +     init_waitqueue_head(&dev_priv->perf.oa.poll_wq);
>
> This timer only serves to wake up pollers / wait_unlocked, right? So why
> is it always running (when the stream is enabled)?
>
> What happens to poll / wait_unlocked if oa.periodic is not set? It seems
> like those functions would block indefinitely.
>

Right, it's unecessary. I'll look at limitting it to just while polling or
for blocking reads.

Good point about the blocking case too.

I just started testing that scenario yesterday, writting an MI_RPC unit
test which opens a stream without requesting periodic sampling, but didn't
poll or read in that case so far so didn't hit this yet.

At least for the read() this is partially considered by returning -EIO if
attempting a blocking read while the stream is disabled, but it doesn't
consider the case that the stream is enabled but periodic sampling isn't
enabled.

Regards,
- Robert


> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
>

[-- Attachment #1.2: Type: text/html, Size: 4035 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 22:52   ` Chris Wilson
@ 2016-04-21 15:43     ` Robert Bragg
  2016-04-21 16:21       ` Chris Wilson
  0 siblings, 1 reply; 44+ messages in thread
From: Robert Bragg @ 2016-04-21 15:43 UTC (permalink / raw)
  To: Chris Wilson, Robert Bragg, intel-gfx, Daniel Vetter,
	Jani Nikula, David Airlie, Zhenyu Wang, Sourab Gupta, Deepak S,
	ML dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 1318 bytes --]

On Wed, Apr 20, 2016 at 11:52 PM, Chris Wilson <chris@chris-wilson.co.uk>
wrote:

> On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
> > +static int i915_oa_read(struct i915_perf_stream *stream,
> > +                     struct i915_perf_read_state *read_state)
> > +{
> > +     struct drm_i915_private *dev_priv = stream->dev_priv;
> > +
> > +     return dev_priv->perf.oa.ops.read(stream, read_state);
> > +}
>
> > +     stream->destroy = i915_oa_stream_destroy;
> > +     stream->enable = i915_oa_stream_enable;
> > +     stream->disable = i915_oa_stream_disable;
> > +     stream->can_read = i915_oa_can_read;
> > +     stream->wait_unlocked = i915_oa_wait_unlocked;
> > +     stream->poll_wait = i915_oa_poll_wait;
> > +     stream->read = i915_oa_read;
>
> Why aren't these a const ops table?
>

No particular reason; I guess it just seemed straightforward enough at the
time. I suppose it avoids some redundant pointer indirection and could suit
defining streams in the future that might find it awkward to have static
ops (don't have anything like that in mind though) but it's at the expense
of a slightly larger stream struct (though also don't see that as a concern
currently).

Can change if you like.

Regards,
- Robert


> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
>

[-- Attachment #1.2: Type: text/html, Size: 2127 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 21:11   ` Chris Wilson
@ 2016-04-21 16:15     ` Robert Bragg
  2016-04-23  8:48       ` Chris Wilson
  0 siblings, 1 reply; 44+ messages in thread
From: Robert Bragg @ 2016-04-21 16:15 UTC (permalink / raw)
  To: Chris Wilson, Robert Bragg, intel-gfx, Daniel Vetter,
	Jani Nikula, David Airlie, Zhenyu Wang, Sourab Gupta, Deepak S,
	ML dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 2502 bytes --]

On Wed, Apr 20, 2016 at 10:11 PM, Chris Wilson <chris@chris-wilson.co.uk>
wrote:

> On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
> > +static void gen7_update_oacontrol_locked(struct drm_i915_private
> *dev_priv)
> > +{
> > +     assert_spin_locked(&dev_priv->perf.hook_lock);
> > +
> > +     if (dev_priv->perf.oa.exclusive_stream->enabled) {
> > +             unsigned long ctx_id = 0;
> > +
> > +             if (dev_priv->perf.oa.exclusive_stream->ctx)
> > +                     ctx_id = dev_priv->perf.oa.specific_ctx_id;
> > +
> > +             if (dev_priv->perf.oa.exclusive_stream->ctx == NULL ||
> ctx_id) {
> > +                     bool periodic = dev_priv->perf.oa.periodic;
> > +                     u32 period_exponent =
> dev_priv->perf.oa.period_exponent;
> > +                     u32 report_format =
> dev_priv->perf.oa.oa_buffer.format;
> > +
> > +                     I915_WRITE(GEN7_OACONTROL,
> > +                                (ctx_id & GEN7_OACONTROL_CTX_MASK) |
> > +                                (period_exponent <<
> > +                                 GEN7_OACONTROL_TIMER_PERIOD_SHIFT) |
> > +                                (periodic ?
> > +                                 GEN7_OACONTROL_TIMER_ENABLE : 0) |
> > +                                (report_format <<
> > +                                 GEN7_OACONTROL_FORMAT_SHIFT) |
> > +                                (ctx_id ?
> > +                                 GEN7_OACONTROL_PER_CTX_ENABLE : 0) |
> > +                                GEN7_OACONTROL_ENABLE);
>
> So this works by only recording when the OACONTROL context address
> matches the CCID.
>

> Rather than hooking into switch context and checking every batch whether
> you have the exclusive context in case it changed address, you could
> just pin the exclusive context when told by the user to bind perf to
> that context. Then it will also have the same address until oa is
> finished (and releases it vma pin).
>

Yeah, this was the approach I first went with when the driver was perf
based, though we ended up deciding to got with hooking into pinning and
updating the OA state in the end.

E.g. for reference:
https://lists.freedesktop.org/archives/intel-gfx/2014-November/055385.html
(wow, sad face after seeing how long I've been kicking this stuff)

I'd prefer to stick with this approach now, unless you see a big problem
with it.

Regards,
- Robert



> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
>

[-- Attachment #1.2: Type: text/html, Size: 3939 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-21 15:43     ` Robert Bragg
@ 2016-04-21 16:21       ` Chris Wilson
  0 siblings, 0 replies; 44+ messages in thread
From: Chris Wilson @ 2016-04-21 16:21 UTC (permalink / raw)
  To: Robert Bragg
  Cc: ML dri-devel, intel-gfx, Sourab Gupta, Deepak S, Daniel Vetter

On Thu, Apr 21, 2016 at 04:43:19PM +0100, Robert Bragg wrote:
>    On Wed, Apr 20, 2016 at 11:52 PM, Chris Wilson
>    <[1]chris@chris-wilson.co.uk> wrote:
> 
>      On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
>      > +static int i915_oa_read(struct i915_perf_stream *stream,
>      > +                     struct i915_perf_read_state *read_state)
>      > +{
>      > +     struct drm_i915_private *dev_priv = stream->dev_priv;
>      > +
>      > +     return dev_priv->perf.oa.ops.read(stream, read_state);
>      > +}
> 
>      > +     stream->destroy = i915_oa_stream_destroy;
>      > +     stream->enable = i915_oa_stream_enable;
>      > +     stream->disable = i915_oa_stream_disable;
>      > +     stream->can_read = i915_oa_can_read;
>      > +     stream->wait_unlocked = i915_oa_wait_unlocked;
>      > +     stream->poll_wait = i915_oa_poll_wait;
>      > +     stream->read = i915_oa_read;
> 
>      Why aren't these a const ops table?
> 
>    No particular reason; I guess it just seemed straightforward enough at the
>    time. I suppose it avoids some redundant pointer indirection and could
>    suit defining streams in the future that might find it awkward to have
>    static ops (don't have anything like that in mind though) but it's at the
>    expense of a slightly larger stream struct (though also don't see that as
>    a concern currently).
> 
>    Can change if you like.

I think it is safe to say it is considered best practice to have vfunc
tables in read-only memory. Certainly raises an eyebrow when they look
like they could be modified on the fly.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-21 15:18     ` Robert Bragg
@ 2016-04-22  1:10       ` Robert Bragg
  0 siblings, 0 replies; 44+ messages in thread
From: Robert Bragg @ 2016-04-22  1:10 UTC (permalink / raw)
  To: Chris Wilson, Robert Bragg, intel-gfx, Daniel Vetter,
	Jani Nikula, David Airlie, Zhenyu Wang, Sourab Gupta, Deepak S,
	ML dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 3256 bytes --]

On Thu, Apr 21, 2016 at 4:18 PM, Robert Bragg <robert@sixbynine.org> wrote:

>
>
> On Thu, Apr 21, 2016 at 12:09 AM, Chris Wilson <chris@chris-wilson.co.uk>
> wrote:
>
>> On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
>> > +static void i915_oa_stream_enable(struct i915_perf_stream *stream)
>> > +{
>> > +     struct drm_i915_private *dev_priv = stream->dev_priv;
>> > +
>> > +     dev_priv->perf.oa.ops.oa_enable(dev_priv);
>> > +
>> > +     if (dev_priv->perf.oa.periodic)
>> > +             hrtimer_start(&dev_priv->perf.oa.poll_check_timer,
>> > +                           ns_to_ktime(POLL_PERIOD),
>> > +                           HRTIMER_MODE_REL_PINNED);
>> > +}
>>
>> > +static void i915_oa_stream_disable(struct i915_perf_stream *stream)
>> > +{
>> > +     struct drm_i915_private *dev_priv = stream->dev_priv;
>> > +
>> > +     dev_priv->perf.oa.ops.oa_disable(dev_priv);
>> > +
>> > +     if (dev_priv->perf.oa.periodic)
>> > +             hrtimer_cancel(&dev_priv->perf.oa.poll_check_timer);
>> > +}
>>
>> > +static enum hrtimer_restart oa_poll_check_timer_cb(struct hrtimer
>> *hrtimer)
>> > +{
>> > +     struct drm_i915_private *dev_priv =
>> > +             container_of(hrtimer, typeof(*dev_priv),
>> > +                          perf.oa.poll_check_timer);
>> > +
>> > +     if (!dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv))
>> > +             wake_up(&dev_priv->perf.oa.poll_wq);
>> > +
>> > +     hrtimer_forward_now(hrtimer, ns_to_ktime(POLL_PERIOD));
>> > +
>> > +     return HRTIMER_RESTART;
>> > +}
>>
>> > @@ -424,8 +1313,37 @@ void i915_perf_init(struct drm_device *dev)
>> >  {
>> >       struct drm_i915_private *dev_priv = to_i915(dev);
>> >
>> > +     if (!IS_HASWELL(dev))
>> > +             return;
>> > +
>> > +     hrtimer_init(&dev_priv->perf.oa.poll_check_timer,
>> > +                  CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> > +     dev_priv->perf.oa.poll_check_timer.function =
>> oa_poll_check_timer_cb;
>> > +     init_waitqueue_head(&dev_priv->perf.oa.poll_wq);
>>
>> This timer only serves to wake up pollers / wait_unlocked, right? So why
>> is it always running (when the stream is enabled)?
>>
>>
> Right, it's unecessary. I'll look at limitting it to just while polling or
> for blocking reads.
>

Actually, looking at this, I couldn't see a clean way to synchronized with
do_sys_poll() returning to be able to cancel the hrtimer. The .poll fop is
only responsible for registering a wait queue via poll_wait() and checking
if there are already events pending before any wait.

The current hrtimer frequency was picked as a reasonable default to ensure
we pick up samples before an overflow with high frequency OA sampling but
also avoids latency in picking up samples written at a lower frequency.
Something I've considered a few times before is that we might want to add a
property that can influence the latency userspace is happy with, which
might also alieviate some of this concern that the hrtimer runs all the
time the stream is enabled when it could often be fine to run at a much
lower frequency than the current 200Hz.

For now, maybe it's ok to stick with the fixed frequency, and I'll plan to
experiment with a property for influencing the maximum latency?

Regards,
- Robert

[-- Attachment #1.2: Type: text/html, Size: 4618 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 22:46   ` Chris Wilson
@ 2016-04-22 11:04     ` Robert Bragg
  2016-04-22 11:18       ` Chris Wilson
  0 siblings, 1 reply; 44+ messages in thread
From: Robert Bragg @ 2016-04-22 11:04 UTC (permalink / raw)
  To: Chris Wilson, Robert Bragg, intel-gfx, Daniel Vetter,
	Jani Nikula, David Airlie, Zhenyu Wang, Sourab Gupta, Deepak S,
	ML dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 2051 bytes --]

On Wed, Apr 20, 2016 at 11:46 PM, Chris Wilson <chris@chris-wilson.co.uk>
wrote:

> On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
> > +static void gen7_init_oa_buffer(struct drm_i915_private *dev_priv)
> > +{
> > +     /* Pre-DevBDW: OABUFFER must be set with counters off,
> > +      * before OASTATUS1, but after OASTATUS2
> > +      */
> > +     I915_WRITE(GEN7_OASTATUS2, dev_priv->perf.oa.oa_buffer.gtt_offset |
> > +                OA_MEM_SELECT_GGTT); /* head */
> > +     I915_WRITE(GEN7_OABUFFER, dev_priv->perf.oa.oa_buffer.gtt_offset);
> > +     I915_WRITE(GEN7_OASTATUS1, dev_priv->perf.oa.oa_buffer.gtt_offset |
> > +                OABUFFER_SIZE_16M); /* tail */
> > +
> > +     /* On Haswell we have to track which OASTATUS1 flags we've
> > +      * already seen since they can't be cleared while periodic
> > +      * sampling is enabled.
> > +      */
> > +     dev_priv->perf.oa.gen7_latched_oastatus1 = 0;
> > +
> > +     /* We have a sanity check in gen7_append_oa_reports() that
> > +      * looks at the report-id field to make sure it's non-zero
> > +      * which relies on the assumption that new reports are
> > +      * being written to zeroed memory...
> > +      */
> > +     memset(dev_priv->perf.oa.oa_buffer.addr, 0, SZ_16M);
>
> You allocated zeroed memory.
>

yup. currently I have this memset here because we may re-init the buffer if
the stream is disabled then re-enabled (via I915_PERF_IOCTL_ENABLE) or if
we have to reset the unit on error. In these cases there may be some number
of reports in the buffer with non-zero report-id fields while we still want
to be sure new reports are being written to zereod memory so that the
sanity check that report-id != 0 will continue to be valid.

I've had it in mind to consider optimizing this at some point to minimize
how much of the buffer is cleared, maybe just for the _DISABLE/_ENABLE case
where I'd expect the buffer will mostly be empty before disabling the
stream.

- Robert


-Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
>

[-- Attachment #1.2: Type: text/html, Size: 2861 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-22 11:04     ` Robert Bragg
@ 2016-04-22 11:18       ` Chris Wilson
  0 siblings, 0 replies; 44+ messages in thread
From: Chris Wilson @ 2016-04-22 11:18 UTC (permalink / raw)
  To: Robert Bragg
  Cc: ML dri-devel, David Airlie, intel-gfx, Sourab Gupta, Deepak S,
	Daniel Vetter

On Fri, Apr 22, 2016 at 12:04:26PM +0100, Robert Bragg wrote:
>    On Wed, Apr 20, 2016 at 11:46 PM, Chris Wilson
>    <[1]chris@chris-wilson.co.uk> wrote:
> 
>      On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
>      > +static void gen7_init_oa_buffer(struct drm_i915_private *dev_priv)
>      > +{
>      > +     /* Pre-DevBDW: OABUFFER must be set with counters off,
>      > +      * before OASTATUS1, but after OASTATUS2
>      > +      */
>      > +     I915_WRITE(GEN7_OASTATUS2,
>      dev_priv->perf.oa.oa_buffer.gtt_offset |
>      > +                OA_MEM_SELECT_GGTT); /* head */
>      > +     I915_WRITE(GEN7_OABUFFER,
>      dev_priv->perf.oa.oa_buffer.gtt_offset);
>      > +     I915_WRITE(GEN7_OASTATUS1,
>      dev_priv->perf.oa.oa_buffer.gtt_offset |
>      > +                OABUFFER_SIZE_16M); /* tail */
>      > +
>      > +     /* On Haswell we have to track which OASTATUS1 flags we've
>      > +      * already seen since they can't be cleared while periodic
>      > +      * sampling is enabled.
>      > +      */
>      > +     dev_priv->perf.oa.gen7_latched_oastatus1 = 0;
>      > +
>      > +     /* We have a sanity check in gen7_append_oa_reports() that
>      > +      * looks at the report-id field to make sure it's non-zero
>      > +      * which relies on the assumption that new reports are
>      > +      * being written to zeroed memory...
>      > +      */
>      > +     memset(dev_priv->perf.oa.oa_buffer.addr, 0, SZ_16M);
> 
>      You allocated zeroed memory.
> 
>    yup. currently I have this memset here because we may re-init the buffer
>    if the stream is disabled then re-enabled (via I915_PERF_IOCTL_ENABLE) or
>    if we have to reset the unit on error. In these cases there may be some
>    number of reports in the buffer with non-zero report-id fields while we
>    still want to be sure new reports are being written to zereod memory so
>    that the sanity check that report-id != 0 will continue to be valid.
> 
>    I've had it in mind to consider optimizing this at some point to minimize
>    how much of the buffer is cleared, maybe just for the _DISABLE/_ENABLE
>    case where I'd expect the buffer will mostly be empty before disabling the
>    stream.

Or just make it clear that you are considering buffer reuse. Having the
memset here allows us to use non-shmemfs allocation, it wasn't that I
objected I just didn't understand the comment in the context of
allocation path.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* ✗ Fi.CI.BAT: warning for Enable Gen 7 Observation Architecture (rev3)
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
                   ` (11 preceding siblings ...)
  2016-04-21 12:41 ` ✗ Fi.CI.BAT: failure " Patchwork
@ 2016-04-23  8:31 ` Patchwork
  2016-04-24 17:23 ` ✓ Fi.CI.BAT: success " Patchwork
  13 siblings, 0 replies; 44+ messages in thread
From: Patchwork @ 2016-04-23  8:31 UTC (permalink / raw)
  To: Robert Bragg; +Cc: intel-gfx

== Series Details ==

Series: Enable Gen 7 Observation Architecture (rev3)
URL   : https://patchwork.freedesktop.org/series/3024/
State : warning

== Summary ==

Series 3024v3 Enable Gen 7 Observation Architecture
http://patchwork.freedesktop.org/api/1.0/series/3024/revisions/3/mbox/

Test pm_rpm:
        Subgroup basic-pci-d3-state:
                pass       -> SKIP       (bdw-nuci7)

bdw-nuci7        total:193  pass:180  dwarn:0   dfail:0   fail:0   skip:13 
bdw-ultra        total:193  pass:170  dwarn:0   dfail:0   fail:0   skip:23 
bsw-nuc-2        total:192  pass:153  dwarn:0   dfail:0   fail:0   skip:39 
byt-nuc          total:192  pass:154  dwarn:0   dfail:0   fail:0   skip:38 
ilk-hp8440p      total:193  pass:136  dwarn:0   dfail:0   fail:0   skip:57 
ivb-t430s        total:193  pass:165  dwarn:0   dfail:0   fail:0   skip:28 
skl-i7k-2        total:193  pass:168  dwarn:0   dfail:0   fail:0   skip:25 
skl-nuci5        total:193  pass:182  dwarn:0   dfail:0   fail:0   skip:11 
snb-dellxps      total:193  pass:155  dwarn:0   dfail:0   fail:0   skip:38 
snb-x220t        total:193  pass:155  dwarn:0   dfail:0   fail:1   skip:37 

Results at /archive/results/CI_IGT_test/Patchwork_2006/

340c485ad98d0ec0369a3b18d4a09938f3f5537d drm-intel-nightly: 2016y-04m-22d-17h-32m-25s UTC integration manifest
fcf0c57 drm/i915: Add more Haswell OA metric sets
b395ef1 drm/i915: add oa_event_min_timer_exponent sysctl
7f22bd8 drm/i915: Add dev.i915.perf_event_paranoid sysctl option
f62bcec drm/i915: advertise available metrics via sysfs
fe6778e drm/i915: Enable i915 perf stream for Haswell OA unit
e32d18d drm/i915: Add 'render basic' Haswell OA unit config
5b22605 drm/i915: don't whitelist oacontrol in cmd parser
2a3581f drm/i915: rename OACONTROL GEN7_OACONTROL
9563014 drm/i915: Add i915 perf infrastructure

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-21 16:15     ` Robert Bragg
@ 2016-04-23  8:48       ` Chris Wilson
  0 siblings, 0 replies; 44+ messages in thread
From: Chris Wilson @ 2016-04-23  8:48 UTC (permalink / raw)
  To: Robert Bragg
  Cc: ML dri-devel, intel-gfx, Sourab Gupta, Deepak S, Daniel Vetter

On Thu, Apr 21, 2016 at 05:15:10PM +0100, Robert Bragg wrote:
>    On Wed, Apr 20, 2016 at 10:11 PM, Chris Wilson
>    <[1]chris@chris-wilson.co.uk> wrote:
> 
>      On Wed, Apr 20, 2016 at 03:23:10PM +0100, Robert Bragg wrote:
>      > +static void gen7_update_oacontrol_locked(struct drm_i915_private
>      *dev_priv)
>      > +{
>      > +     assert_spin_locked(&dev_priv->perf.hook_lock);
>      > +
>      > +     if (dev_priv->perf.oa.exclusive_stream->enabled) {
>      > +             unsigned long ctx_id = 0;
>      > +
>      > +             if (dev_priv->perf.oa.exclusive_stream->ctx)
>      > +                     ctx_id = dev_priv->perf.oa.specific_ctx_id;
>      > +
>      > +             if (dev_priv->perf.oa.exclusive_stream->ctx == NULL ||
>      ctx_id) {
>      > +                     bool periodic = dev_priv->perf.oa.periodic;
>      > +                     u32 period_exponent =
>      dev_priv->perf.oa.period_exponent;
>      > +                     u32 report_format =
>      dev_priv->perf.oa.oa_buffer.format;
>      > +
>      > +                     I915_WRITE(GEN7_OACONTROL,
>      > +                                (ctx_id & GEN7_OACONTROL_CTX_MASK) |
>      > +                                (period_exponent <<
>      > +                                 GEN7_OACONTROL_TIMER_PERIOD_SHIFT) |
>      > +                                (periodic ?
>      > +                                 GEN7_OACONTROL_TIMER_ENABLE : 0) |
>      > +                                (report_format <<
>      > +                                 GEN7_OACONTROL_FORMAT_SHIFT) |
>      > +                                (ctx_id ?
>      > +                                 GEN7_OACONTROL_PER_CTX_ENABLE : 0) |
>      > +                                GEN7_OACONTROL_ENABLE);
> 
>      So this works by only recording when the OACONTROL context address
>      matches the CCID.
> 
>      Rather than hooking into switch context and checking every batch whether
>      you have the exclusive context in case it changed address, you could
>      just pin the exclusive context when told by the user to bind perf to
>      that context. Then it will also have the same address until oa is
>      finished (and releases it vma pin).
> 
>    Yeah, this was the approach I first went with when the driver was perf
>    based, though we ended up deciding to got with hooking into pinning and
>    updating the OA state in the end.
> 
>    E.g. for reference:
>    [2]https://lists.freedesktop.org/archives/intel-gfx/2014-November/055385.html
>    (wow, sad face after seeing how long I've been kicking this stuff)
> 
>    I'd prefer to stick with this approach now, unless you see a big problem
>    with it.

Given no reason to have the hook, I don't see why we should. Pinning the
context in the GGTT and causing that bit of extra fragmenetation isn't
the worst evil here. and is better practice overall to treat the OA
register as holding the pin on the object is it referencing, along with
lifetime tracking of that register (i.e. unpinning only when we know it
has completed its writes). Given that, the pin_notify is inadequate.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-20 14:23 ` [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit Robert Bragg
                     ` (7 preceding siblings ...)
  2016-04-20 23:16   ` Chris Wilson
@ 2016-04-23 10:34   ` Martin Peres
  2016-05-03 19:34     ` Robert Bragg
  8 siblings, 1 reply; 44+ messages in thread
From: Martin Peres @ 2016-04-23 10:34 UTC (permalink / raw)
  To: Robert Bragg, intel-gfx; +Cc: Deepak S, Daniel Vetter, Sourab Gupta, dri-devel

On 20/04/16 17:23, Robert Bragg wrote:
> Gen graphics hardware can be set up to periodically write snapshots of
> performance counters into a circular buffer via its Observation
> Architecture and this patch exposes that capability to userspace via the
> i915 perf interface.
>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Robert Bragg <robert@sixbynine.org>
> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         |  56 +-
>   drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
>   drivers/gpu/drm/i915/i915_perf.c        | 940 +++++++++++++++++++++++++++++++-
>   drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
>   include/uapi/drm/i915_drm.h             |  70 ++-
>   5 files changed, 1408 insertions(+), 20 deletions(-)
>
> +
> +
> +	/* It takes a fairly long time for a new MUX configuration to
> +	 * be be applied after these register writes. This delay
> +	 * duration was derived empirically based on the render_basic
> +	 * config but hopefully it covers the maximum configuration
> +	 * latency...
> +	 */
> +	mdelay(100);

With such a HW and SW design, how can we ever expose hope to get any
kind of performance when we are trying to monitor different metrics on each
draw call? This may be acceptable for system monitoring, but it is 
problematic
for the GL extensions :s

Since it seems like we are going for a perf API, it means that for every 
change
of metrics, we need to flush the commands, wait for the GPU to be done, then
program the new set of metrics via an IOCTL, wait 100 ms, and then we may
resume rendering ... until the next change. We are talking about a 
latency of
6-7 frames at 60 Hz here... this is non-negligeable...

I understand that we have a ton of counters and we may hide latency by not
allowing using more than half of the counters for every draw call or 
frame, but
even then, this 100ms delay is killing this approach altogether.

To be honest, if it indeed is an HW bug, then the approach that Samuel 
Pitoiset
and I used for Nouveau involving pushing an handle representing a
pre-computed configuration to the command buffer so as a software method
can be ask the kernel to reprogram the counters with as little idle time as
possible, would be useless as waiting for the GPU to be idle would 
usually not
take more than a few ms... which is nothing compared to waiting 100ms.

So, now, the elephant in the room, how can it take that long to apply the
change? Are the OA registers double buffered (NVIDIA's are, so as we can
reconfigure and start monitoring multiple counters at the same time)?

Maybe this 100ms is the polling period and the HW does not allow changing
the configuration in the middle of a polling session. In this case, this 
delay
should be dependent on the polling frequency. But even then, I would really
hope that the HW would allow us to tear down everything, reconfigure and
start polling again without waiting for the next tick. If not possible, 
maybe we
can change the frequency for the polling clock to make the polling event 
happen
sooner.

HW delays are usually a few microseconds, not milliseconds, that really 
suggests
that something funny is happening and the HW design is not understood 
properly.
If the documentation has nothing on this and the HW teams cannot help, 
then I
suggest a little REing session. I really want to see this work land, but 
the way I see
it right now is that we cannot rely on it because of this bug. Maybe 
fixing this bug
would require changing the architecture, so better address it before 
landing the
patches.

Worst case scenario, do not hesitate to contact me if non of the proposed
explanation pans out, I will take the time to read through the OA 
material and try my
REing skills on it. As I said, I really want to see this upstream!

Sorry...

Martin
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* ✓ Fi.CI.BAT: success for Enable Gen 7 Observation Architecture (rev3)
  2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
                   ` (12 preceding siblings ...)
  2016-04-23  8:31 ` ✗ Fi.CI.BAT: warning " Patchwork
@ 2016-04-24 17:23 ` Patchwork
  13 siblings, 0 replies; 44+ messages in thread
From: Patchwork @ 2016-04-24 17:23 UTC (permalink / raw)
  To: Robert Bragg; +Cc: intel-gfx

== Series Details ==

Series: Enable Gen 7 Observation Architecture (rev3)
URL   : https://patchwork.freedesktop.org/series/3024/
State : success

== Summary ==

Series 3024v3 Enable Gen 7 Observation Architecture
http://patchwork.freedesktop.org/api/1.0/series/3024/revisions/3/mbox/


byt-nuc          total:192  pass:154  dwarn:0   dfail:0   fail:0   skip:38 
ilk-hp8440p      total:193  pass:136  dwarn:0   dfail:0   fail:0   skip:57 
ivb-t430s        total:193  pass:165  dwarn:0   dfail:0   fail:0   skip:28 
snb-dellxps      total:193  pass:155  dwarn:0   dfail:0   fail:0   skip:38 

Results at /archive/results/CI_IGT_test/Patchwork_2048/

1e81bacf1f7fdbdf83f46b55389713fa13cb1256 drm-intel-nightly: 2016y-04m-24d-10h-36m-11s UTC integration manifest
78cae1b drm/i915: Add more Haswell OA metric sets
7866967 drm/i915: add oa_event_min_timer_exponent sysctl
f9d868d drm/i915: Add dev.i915.perf_event_paranoid sysctl option
030301f drm/i915: advertise available metrics via sysfs
440ee27 drm/i915: Enable i915 perf stream for Haswell OA unit
7b10cf5 drm/i915: Add 'render basic' Haswell OA unit config
1f4d04b drm/i915: don't whitelist oacontrol in cmd parser
9950d98 drm/i915: rename OACONTROL GEN7_OACONTROL
b4467d8 drm/i915: Add i915 perf infrastructure

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-04-23 10:34   ` Martin Peres
@ 2016-05-03 19:34     ` Robert Bragg
  2016-05-03 20:03       ` Robert Bragg
  2016-05-04  9:04       ` [Intel-gfx] " Martin Peres
  0 siblings, 2 replies; 44+ messages in thread
From: Robert Bragg @ 2016-05-03 19:34 UTC (permalink / raw)
  To: Martin Peres
  Cc: Deepak S, Daniel Vetter, intel-gfx, Sourab Gupta, ML dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 9266 bytes --]

Sorry for the delay replying to this, I missed it.

On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres@free.fr> wrote:

> On 20/04/16 17:23, Robert Bragg wrote:
>
>> Gen graphics hardware can be set up to periodically write snapshots of
>> performance counters into a circular buffer via its Observation
>> Architecture and this patch exposes that capability to userspace via the
>> i915 perf interface.
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Signed-off-by: Robert Bragg <robert@sixbynine.org>
>> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.h         |  56 +-
>>   drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
>>   drivers/gpu/drm/i915/i915_perf.c        | 940
>> +++++++++++++++++++++++++++++++-
>>   drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
>>   include/uapi/drm/i915_drm.h             |  70 ++-
>>   5 files changed, 1408 insertions(+), 20 deletions(-)
>>
>> +
>> +
>> +       /* It takes a fairly long time for a new MUX configuration to
>> +        * be be applied after these register writes. This delay
>> +        * duration was derived empirically based on the render_basic
>> +        * config but hopefully it covers the maximum configuration
>> +        * latency...
>> +        */
>> +       mdelay(100);
>>
>
> With such a HW and SW design, how can we ever expose hope to get any
> kind of performance when we are trying to monitor different metrics on each
> draw call? This may be acceptable for system monitoring, but it is
> problematic
> for the GL extensions :s
>

> Since it seems like we are going for a perf API, it means that for every
> change
> of metrics, we need to flush the commands, wait for the GPU to be done,
> then
> program the new set of metrics via an IOCTL, wait 100 ms, and then we may
> resume rendering ... until the next change. We are talking about a latency
> of
> 6-7 frames at 60 Hz here... this is non-negligeable...
>

> I understand that we have a ton of counters and we may hide latency by not
> allowing using more than half of the counters for every draw call or
> frame, but
> even then, this 100ms delay is killing this approach altogether.
>


Although I'm also really unhappy about introducing this delay recently, the
impact of the delay is typically amortized somewhat by keeping a
configuration open as long as possible.

Even without this explicit delay here the OA unit isn't suited to being
reconfigured on a per draw call basis, though it is able to support per
draw call queries with the same config.

The above assessment assumes wanting to change config between draw calls
which is not something this driver aims to support - as the HW isn't really
designed for that model.

E.g. in the case of INTEL_performance_query, the backend keeps the i915
perf stream open until all OA based query objects are deleted - so you have
to be pretty explicit if you want to change config.

Considering the sets available on Haswell:
* Render Metrics Basic
* Compute Metrics Basic
* Compute Metrics Extended
* Memory Reads Distribution
* Memory Writes Distribution
* Metric set SamplerBalance

Each of these configs can expose around 50 counters as a set.

A GL application is most likely just going to use the render basic set, and
In the case of a tool like gputop/GPA then changing config would usually be
driven by some user interaction to select a set of metrics, where even a
100ms delay will go unnoticed.

In case you aren't familiar with how the GL_INTEL_performance_query side of
things works for OA counters; one thing to be aware of is that there's a
separate MI_REPORT_PERF_COUNT command that Mesa writes either side of a
query which writes all the counters for the current OA config (as
configured via this i915 perf interface) to a buffer. In addition to
collecting reports via MI_REPORT_PERF_COUNT Mesa also configures the unit
for periodic sampling to be able to account for potential counter overflow.


It also might be worth keeping in mind that per draw queries will anyway
trash the pipelining of work, since it's necessary to put stalls between
the draw calls to avoid conflated metrics (not to do with the details of
this driver) so use cases will probably be limited to those that just want
the draw call numbers but don't mind ruining overall frame/application
performance. Periodic sampling or swap-to-swap queries would be better
suited to cases that should minimize their impact.


>
> To be honest, if it indeed is an HW bug, then the approach that Samuel
> Pitoiset
> and I used for Nouveau involving pushing an handle representing a
> pre-computed configuration to the command buffer so as a software method
> can be ask the kernel to reprogram the counters with as little idle time as
> possible, would be useless as waiting for the GPU to be idle would usually
> not
> take more than a few ms... which is nothing compared to waiting 100ms.
>

Yeah, I think this is a really quite different programming model to what
the OA unit is geared for, even if we can somehow knock out this 100ms MUX
config delay.


>
> So, now, the elephant in the room, how can it take that long to apply the
> change? Are the OA registers double buffered (NVIDIA's are, so as we can
> reconfigure and start monitoring multiple counters at the same time)?
>

Based on my understanding of how the HW works internally I can see how some
delay would be expected, but can't currently fathom why it would need to
have this order of magnitude, and so the delay is currently simply based on
experimentation where I was getting unit test failures at 10ms, for invalid
looking reports, but the tests ran reliably at 100ms.

OA configuration state isn't double buffered to allow configuration while
in use.



>
> Maybe this 100ms is the polling period and the HW does not allow changing
> the configuration in the middle of a polling session. In this case, this
> delay
> should be dependent on the polling frequency. But even then, I would really
> hope that the HW would allow us to tear down everything, reconfigure and
> start polling again without waiting for the next tick. If not possible,
> maybe we
> can change the frequency for the polling clock to make the polling event
> happen
> sooner.
>

The tests currently test periods from 160ns to 168 milliseconds while the
delay required falls somewhere between 10 and 100 milliseconds. I think I'd
expect the delay to be > all periods tested if this was the link.

Generally this seems unlikely to me, in part considering how the MUX isn't
really part of the OA unit that handles periodic sampling. I wouldn't rule
out some interaction though so some experimenting along these lines could
be interesting.


>
> HW delays are usually a few microseconds, not milliseconds, that really
> suggests
> that something funny is happening and the HW design is not understood
> properly.
>

Yup.

Although I understand more about the HW than I can write up here, I can't
currently see why the HW should ever really take this long to apply a MUX
config - although I can see why some delay would be required.

It's on my list of things to try and get feedback/ideas on from the OA
architect/HW engineers. I brought this up briefly some time ago but we
didn't have time to go into details.



> If the documentation has nothing on this and the HW teams cannot help,
> then I
> suggest a little REing session


There's no precisely documented delay requirement. Insofar as REing is the
process of inferring how black box HW works through poking it with a stick
and seeing how it reacts, then yep more of that may be necessary. At least
in this case the HW isn't really a black box (maybe stain glass), where I
hopefully have a fairly good sense of how the HW is designed and can prod
folks closer to the HW for feedback/ideas.

So far I haven't spent too long investigating this besides recently homing
in on needing a delay here when my unit tests were failing.


> I really want to see this work land, but the way I see
> it right now is that we cannot rely on it because of this bug. Maybe
> fixing this bug
> would require changing the architecture, so better address it before
> landing the
> patches.
>

I think it's unlikely to change the architecture; rather we might just find
some other things to frob that make the MUX config apply faster (e.g. clock
gating issue); we find a way to get explicit feedback of completion so we
can minimize the delay or a better understanding that lets us choose a
shorter delay in most cases.

The driver is already usable with gputop with this delay and considering
how config changes are typically associated with user interaction I
wouldn't see this as a show stopper - even though it's not ideal. I think
the assertions about it being unusable with GL, were a little overstated
based on making frequent OA config changes which is not really how the
interface is intended to be used.


Thanks for starting to take a look through the code.

Kind Regards,
- Robert


> Worst case scenario, do not hesitate to contact me if non of the proposed
> explanation pans out, I will take the time to read through the OA material
> and try my
> REing skills on it. As I said, I really want to see this upstream!
>

> Sorry...
>
> Martin
>

[-- Attachment #1.2: Type: text/html, Size: 12493 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-05-03 19:34     ` Robert Bragg
@ 2016-05-03 20:03       ` Robert Bragg
  2016-05-04  9:09         ` Martin Peres
  2016-05-04  9:04       ` [Intel-gfx] " Martin Peres
  1 sibling, 1 reply; 44+ messages in thread
From: Robert Bragg @ 2016-05-03 20:03 UTC (permalink / raw)
  To: Martin Peres
  Cc: Deepak S, Daniel Vetter, intel-gfx, Sourab Gupta, ML dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 3601 bytes --]

On Tue, May 3, 2016 at 8:34 PM, Robert Bragg <robert@sixbynine.org> wrote:

> Sorry for the delay replying to this, I missed it.
>
> On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres@free.fr>
> wrote:
>
>> On 20/04/16 17:23, Robert Bragg wrote:
>>
>>> Gen graphics hardware can be set up to periodically write snapshots of
>>> performance counters into a circular buffer via its Observation
>>> Architecture and this patch exposes that capability to userspace via the
>>> i915 perf interface.
>>>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Signed-off-by: Robert Bragg <robert@sixbynine.org>
>>> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_drv.h         |  56 +-
>>>   drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
>>>   drivers/gpu/drm/i915/i915_perf.c        | 940
>>> +++++++++++++++++++++++++++++++-
>>>   drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
>>>   include/uapi/drm/i915_drm.h             |  70 ++-
>>>   5 files changed, 1408 insertions(+), 20 deletions(-)
>>>
>>> +
>>> +
>>> +       /* It takes a fairly long time for a new MUX configuration to
>>> +        * be be applied after these register writes. This delay
>>> +        * duration was derived empirically based on the render_basic
>>> +        * config but hopefully it covers the maximum configuration
>>> +        * latency...
>>> +        */
>>> +       mdelay(100);
>>>
>>
>> With such a HW and SW design, how can we ever expose hope to get any
>> kind of performance when we are trying to monitor different metrics on
>> each
>> draw call? This may be acceptable for system monitoring, but it is
>> problematic
>> for the GL extensions :s
>>
>
>> Since it seems like we are going for a perf API, it means that for every
>> change
>> of metrics, we need to flush the commands, wait for the GPU to be done,
>> then
>> program the new set of metrics via an IOCTL, wait 100 ms, and then we may
>> resume rendering ... until the next change. We are talking about a
>> latency of
>> 6-7 frames at 60 Hz here... this is non-negligeable...
>>
>
>> I understand that we have a ton of counters and we may hide latency by not
>> allowing using more than half of the counters for every draw call or
>> frame, but
>> even then, this 100ms delay is killing this approach altogether.
>>
>
>
>
So revisiting this to double check how things fail with my latest
driver/tests without the delay, I apparently can't reproduce test failures
without the delay any more...

I think the explanation is that since first adding the delay to the driver
I also made the the driver a bit more careful to not forward spurious
reports that look invalid due to a zeroed report id field, and that
mechanism keeps the unit tests happy, even though there are still some
number of invalid reports generated if we don't wait.

One problem with simply having no delay is that the driver prints an error
if it sees an invalid reports so I get a lot of 'Skipping spurious, invalid
OA report' dmesg spam. Also this was intended more as a last resort
mechanism, and I wouldn't feel too happy about squashing the error message
and potentially sweeping other error cases under the carpet.

Experimenting to see if the delay can at least be reduced, I brought the
delay up in millisecond increments and found that although I still see a
lot of spurious reports only waiting 1 or 5 milliseconds, at 10
milliseconds its reduced quite a bit and at 15 milliseconds I don't seem to
have any errors.

15 milliseconds is still a long time, but at least not as long as 100.

Regards,
- Robert

[-- Attachment #1.2: Type: text/html, Size: 5266 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-05-03 19:34     ` Robert Bragg
  2016-05-03 20:03       ` Robert Bragg
@ 2016-05-04  9:04       ` Martin Peres
  2016-05-04 11:15         ` Robert Bragg
  1 sibling, 1 reply; 44+ messages in thread
From: Martin Peres @ 2016-05-04  9:04 UTC (permalink / raw)
  To: Robert Bragg, Martin Peres
  Cc: Deepak S, Daniel Vetter, intel-gfx, Sourab Gupta, ML dri-devel

On 03/05/16 22:34, Robert Bragg wrote:
> Sorry for the delay replying to this, I missed it.

No worries!

>
> On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres@free.fr
> <mailto:martin.peres@free.fr>> wrote:
>
>     On 20/04/16 17:23, Robert Bragg wrote:
>
>         Gen graphics hardware can be set up to periodically write
>         snapshots of
>         performance counters into a circular buffer via its Observation
>         Architecture and this patch exposes that capability to userspace
>         via the
>         i915 perf interface.
>
>         Cc: Chris Wilson <chris@chris-wilson.co.uk
>         <mailto:chris@chris-wilson.co.uk>>
>         Signed-off-by: Robert Bragg <robert@sixbynine.org
>         <mailto:robert@sixbynine.org>>
>         Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com
>         <mailto:zhenyuw@linux.intel.com>>
>         ---
>           drivers/gpu/drm/i915/i915_drv.h         |  56 +-
>           drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
>           drivers/gpu/drm/i915/i915_perf.c        | 940
>         +++++++++++++++++++++++++++++++-
>           drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
>           include/uapi/drm/i915_drm.h             |  70 ++-
>           5 files changed, 1408 insertions(+), 20 deletions(-)
>
>         +
>         +
>         +       /* It takes a fairly long time for a new MUX
>         configuration to
>         +        * be be applied after these register writes. This delay
>         +        * duration was derived empirically based on the
>         render_basic
>         +        * config but hopefully it covers the maximum configuration
>         +        * latency...
>         +        */
>         +       mdelay(100);
>
>
>     With such a HW and SW design, how can we ever expose hope to get any
>     kind of performance when we are trying to monitor different metrics
>     on each
>     draw call? This may be acceptable for system monitoring, but it is
>     problematic
>     for the GL extensions :s
>
>
>     Since it seems like we are going for a perf API, it means that for
>     every change
>     of metrics, we need to flush the commands, wait for the GPU to be
>     done, then
>     program the new set of metrics via an IOCTL, wait 100 ms, and then
>     we may
>     resume rendering ... until the next change. We are talking about a
>     latency of
>     6-7 frames at 60 Hz here... this is non-negligeable...
>
>
>     I understand that we have a ton of counters and we may hide latency
>     by not
>     allowing using more than half of the counters for every draw call or
>     frame, but
>     even then, this 100ms delay is killing this approach altogether.
>
>
>
> Although I'm also really unhappy about introducing this delay recently,
> the impact of the delay is typically amortized somewhat by keeping a
> configuration open as long as possible.
>
> Even without this explicit delay here the OA unit isn't suited to being
> reconfigured on a per draw call basis, though it is able to support per
> draw call queries with the same config.
>
> The above assessment assumes wanting to change config between draw calls
> which is not something this driver aims to support - as the HW isn't
> really designed for that model.
>
> E.g. in the case of INTEL_performance_query, the backend keeps the i915
> perf stream open until all OA based query objects are deleted - so you
> have to be pretty explicit if you want to change config.

OK, I get your point. However, I still want to state that applications 
changing the set would see a disastrous effect as a 100 ms is enough to 
downclock both the CPU and GPU and that would dramatically alter the
metrics. Should we make it clear somewhere, either in the 
INTEL_performance_query or as a warning in mesa_performance if changing 
the set while running? I would think the latter would be preferable as 
it could also cover the case of the AMD extension which, IIRC, does not 
talk about the performance cost of changing the metrics. With this 
caveat made clear, it seems reasonable.

>
> Considering the sets available on Haswell:
> * Render Metrics Basic
> * Compute Metrics Basic
> * Compute Metrics Extended
> * Memory Reads Distribution
> * Memory Writes Distribution
> * Metric set SamplerBalance
>
> Each of these configs can expose around 50 counters as a set.
>
> A GL application is most likely just going to use the render basic set,
> and In the case of a tool like gputop/GPA then changing config would
> usually be driven by some user interaction to select a set of metrics,
> where even a 100ms delay will go unnoticed.

100 ms is becoming visible, but I agree, it would not be a show stopper 
for sure.

On the APITRACE side, this should not be an issue, because we do not 
change the set of metrics while running.

>
> In case you aren't familiar with how the GL_INTEL_performance_query side
> of things works for OA counters; one thing to be aware of is that
> there's a separate MI_REPORT_PERF_COUNT command that Mesa writes either
> side of a query which writes all the counters for the current OA config
> (as configured via this i915 perf interface) to a buffer. In addition to
> collecting reports via MI_REPORT_PERF_COUNT Mesa also configures the
> unit for periodic sampling to be able to account for potential counter
> overflow.

Oh, the overflow case is mean. Doesn't the spec mandate the application 
to read at least every second? This is the case for the timestamp queries.

>
>
> It also might be worth keeping in mind that per draw queries will anyway
> trash the pipelining of work, since it's necessary to put stalls between
> the draw calls to avoid conflated metrics (not to do with the details of
> this driver) so use cases will probably be limited to those that just
> want the draw call numbers but don't mind ruining overall
> frame/application performance. Periodic sampling or swap-to-swap queries
> would be better suited to cases that should minimize their impact.

Yes, I agree that there will always be a cost, but with the design 
implemented in nouveau (which barely involves the CPU at all), the 
pipelining is almost unaffected. As in, monitoring every draw call with 
a different metric would lower the performance of glxgears (worst case I 
could think off) but still keep thousands of FPS.
>
>
>
>     To be honest, if it indeed is an HW bug, then the approach that
>     Samuel Pitoiset
>     and I used for Nouveau involving pushing an handle representing a
>     pre-computed configuration to the command buffer so as a software method
>     can be ask the kernel to reprogram the counters with as little idle
>     time as
>     possible, would be useless as waiting for the GPU to be idle would
>     usually not
>     take more than a few ms... which is nothing compared to waiting 100ms.
>
>
> Yeah, I think this is a really quite different programming model to what
> the OA unit is geared for, even if we can somehow knock out this 100ms
> MUX config delay.

Too bad :)

>
>
>
>     So, now, the elephant in the room, how can it take that long to
>     apply the
>     change? Are the OA registers double buffered (NVIDIA's are, so as we can
>     reconfigure and start monitoring multiple counters at the same time)?
>
>
> Based on my understanding of how the HW works internally I can see how
> some delay would be expected, but can't currently fathom why it would
> need to have this order of magnitude, and so the delay is currently
> simply based on experimentation where I was getting unit test failures
> at 10ms, for invalid looking reports, but the tests ran reliably at 100ms.
>
> OA configuration state isn't double buffered to allow configuration
> while in use.
>
>
>
>
>     Maybe this 100ms is the polling period and the HW does not allow
>     changing
>     the configuration in the middle of a polling session. In this case,
>     this delay
>     should be dependent on the polling frequency. But even then, I would
>     really
>     hope that the HW would allow us to tear down everything, reconfigure and
>     start polling again without waiting for the next tick. If not
>     possible, maybe we
>     can change the frequency for the polling clock to make the polling
>     event happen
>     sooner.
>
>
> The tests currently test periods from 160ns to 168 milliseconds while
> the delay required falls somewhere between 10 and 100 milliseconds. I
> think I'd expect the delay to be > all periods tested if this was the link.

Thanks, definitely the kind of information that is valuable for 
understanding this issue!

>
> Generally this seems unlikely to me, in part considering how the MUX
> isn't really part of the OA unit that handles periodic sampling. I
> wouldn't rule out some interaction though so some experimenting along
> these lines could be interesting.

That indeed makes it less likely. Interactions increase the BOM!

>
>
>
>     HW delays are usually a few microseconds, not milliseconds, that
>     really suggests
>     that something funny is happening and the HW design is not
>     understood properly.
>
>
> Yup.
>
> Although I understand more about the HW than I can write up here, I
> can't currently see why the HW should ever really take this long to
> apply a MUX config - although I can see why some delay would be required.
>
> It's on my list of things to try and get feedback/ideas on from the OA
> architect/HW engineers. I brought this up briefly some time ago but we
> didn't have time to go into details.

Sounds like a good idea!

>
>
>
>     If the documentation has nothing on this and the HW teams cannot
>     help, then I
>     suggest a little REing session
>
>
> There's no precisely documented delay requirement. Insofar as REing is
> the process of inferring how black box HW works through poking it with a
> stick and seeing how it reacts, then yep more of that may be necessary.
> At least in this case the HW isn't really a black box (maybe stain
> glass), where I hopefully have a fairly good sense of how the HW is
> designed and can prod folks closer to the HW for feedback/ideas.
>
> So far I haven't spent too long investigating this besides recently
> homing in on needing a delay here when my unit tests were failing.

ACK! Thanks for the info!

>
>
>     I really want to see this work land, but the way I see
>     it right now is that we cannot rely on it because of this bug. Maybe
>     fixing this bug
>     would require changing the architecture, so better address it before
>     landing the
>     patches.
>
>
> I think it's unlikely to change the architecture; rather we might just
> find some other things to frob that make the MUX config apply faster
> (e.g. clock gating issue); we find a way to get explicit feedback of
> completion so we can minimize the delay or a better understanding that
> lets us choose a shorter delay in most cases.

Yes, clock gating may be one issue here, even though it would be a funny 
hw design to clock gate the bus to a register...

>
> The driver is already usable with gputop with this delay and considering
> how config changes are typically associated with user interaction I
> wouldn't see this as a show stopper - even though it's not ideal. I
> think the assertions about it being unusable with GL, were a little
> overstated based on making frequent OA config changes which is not
> really how the interface is intended to be used.

Yeah, but a performance warning in mesa, I would be OK with this change. 
Thanks for taking the time to explain!
>
>
> Thanks for starting to take a look through the code.
>
> Kind Regards,
> - Robert

Martin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-05-03 20:03       ` Robert Bragg
@ 2016-05-04  9:09         ` Martin Peres
  2016-05-04  9:49           ` Robert Bragg
  0 siblings, 1 reply; 44+ messages in thread
From: Martin Peres @ 2016-05-04  9:09 UTC (permalink / raw)
  To: Robert Bragg, Martin Peres
  Cc: Deepak S, Daniel Vetter, intel-gfx, Sourab Gupta, ML dri-devel

On 03/05/16 23:03, Robert Bragg wrote:
>
>
> On Tue, May 3, 2016 at 8:34 PM, Robert Bragg <robert@sixbynine.org
> <mailto:robert@sixbynine.org>> wrote:
>
>     Sorry for the delay replying to this, I missed it.
>
>     On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres@free.fr
>     <mailto:martin.peres@free.fr>> wrote:
>
>         On 20/04/16 17:23, Robert Bragg wrote:
>
>             Gen graphics hardware can be set up to periodically write
>             snapshots of
>             performance counters into a circular buffer via its Observation
>             Architecture and this patch exposes that capability to
>             userspace via the
>             i915 perf interface.
>
>             Cc: Chris Wilson <chris@chris-wilson.co.uk
>             <mailto:chris@chris-wilson.co.uk>>
>             Signed-off-by: Robert Bragg <robert@sixbynine.org
>             <mailto:robert@sixbynine.org>>
>             Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com
>             <mailto:zhenyuw@linux.intel.com>>
>             ---
>               drivers/gpu/drm/i915/i915_drv.h         |  56 +-
>               drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
>               drivers/gpu/drm/i915/i915_perf.c        | 940
>             +++++++++++++++++++++++++++++++-
>               drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
>               include/uapi/drm/i915_drm.h             |  70 ++-
>               5 files changed, 1408 insertions(+), 20 deletions(-)
>
>             +
>             +
>             +       /* It takes a fairly long time for a new MUX
>             configuration to
>             +        * be be applied after these register writes. This delay
>             +        * duration was derived empirically based on the
>             render_basic
>             +        * config but hopefully it covers the maximum
>             configuration
>             +        * latency...
>             +        */
>             +       mdelay(100);
>
>
>         With such a HW and SW design, how can we ever expose hope to get any
>         kind of performance when we are trying to monitor different
>         metrics on each
>         draw call? This may be acceptable for system monitoring, but it
>         is problematic
>         for the GL extensions :s
>
>
>         Since it seems like we are going for a perf API, it means that
>         for every change
>         of metrics, we need to flush the commands, wait for the GPU to
>         be done, then
>         program the new set of metrics via an IOCTL, wait 100 ms, and
>         then we may
>         resume rendering ... until the next change. We are talking about
>         a latency of
>         6-7 frames at 60 Hz here... this is non-negligeable...
>
>
>         I understand that we have a ton of counters and we may hide
>         latency by not
>         allowing using more than half of the counters for every draw
>         call or frame, but
>         even then, this 100ms delay is killing this approach altogether.
>
>
>
>
> So revisiting this to double check how things fail with my latest
> driver/tests without the delay, I apparently can't reproduce test
> failures without the delay any more...
>
> I think the explanation is that since first adding the delay to the
> driver I also made the the driver a bit more careful to not forward
> spurious reports that look invalid due to a zeroed report id field, and
> that mechanism keeps the unit tests happy, even though there are still
> some number of invalid reports generated if we don't wait.
>
> One problem with simply having no delay is that the driver prints an
> error if it sees an invalid reports so I get a lot of 'Skipping
> spurious, invalid OA report' dmesg spam. Also this was intended more as
> a last resort mechanism, and I wouldn't feel too happy about squashing
> the error message and potentially sweeping other error cases under the
> carpet.
>
> Experimenting to see if the delay can at least be reduced, I brought the
> delay up in millisecond increments and found that although I still see a
> lot of spurious reports only waiting 1 or 5 milliseconds, at 10
> milliseconds its reduced quite a bit and at 15 milliseconds I don't seem
> to have any errors.
>
> 15 milliseconds is still a long time, but at least not as long as 100.

OK, so the issue does not come from the HW after all, great!

Now, my main question is, why are spurious events generated when 
changing the MUX's value? I can understand that we would need to ignore 
the reading that came right after the change, but other than this,  I am 
a bit at a loss.

I am a bit swamped with other tasks right now, but I would love to spend 
more time reviewing your code as I really want to see this upstream!

Martin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-05-04  9:09         ` Martin Peres
@ 2016-05-04  9:49           ` Robert Bragg
  2016-05-04 12:24             ` Daniel Vetter
  0 siblings, 1 reply; 44+ messages in thread
From: Robert Bragg @ 2016-05-04  9:49 UTC (permalink / raw)
  To: Martin Peres
  Cc: intel-gfx, Sourab Gupta, ML dri-devel, Deepak S, Daniel Vetter


[-- Attachment #1.1: Type: text/plain, Size: 6711 bytes --]

On Wed, May 4, 2016 at 10:09 AM, Martin Peres <martin.peres@linux.intel.com>
wrote:

> On 03/05/16 23:03, Robert Bragg wrote:
>
>>
>>
>> On Tue, May 3, 2016 at 8:34 PM, Robert Bragg <robert@sixbynine.org
>> <mailto:robert@sixbynine.org>> wrote:
>>
>>     Sorry for the delay replying to this, I missed it.
>>
>>     On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres@free.fr
>>     <mailto:martin.peres@free.fr>> wrote:
>>
>>         On 20/04/16 17:23, Robert Bragg wrote:
>>
>>             Gen graphics hardware can be set up to periodically write
>>             snapshots of
>>             performance counters into a circular buffer via its
>> Observation
>>             Architecture and this patch exposes that capability to
>>             userspace via the
>>             i915 perf interface.
>>
>>             Cc: Chris Wilson <chris@chris-wilson.co.uk
>>             <mailto:chris@chris-wilson.co.uk>>
>>             Signed-off-by: Robert Bragg <robert@sixbynine.org
>>             <mailto:robert@sixbynine.org>>
>>             Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com
>>             <mailto:zhenyuw@linux.intel.com>>
>>
>>             ---
>>               drivers/gpu/drm/i915/i915_drv.h         |  56 +-
>>               drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
>>               drivers/gpu/drm/i915/i915_perf.c        | 940
>>             +++++++++++++++++++++++++++++++-
>>               drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
>>               include/uapi/drm/i915_drm.h             |  70 ++-
>>               5 files changed, 1408 insertions(+), 20 deletions(-)
>>
>>             +
>>             +
>>             +       /* It takes a fairly long time for a new MUX
>>             configuration to
>>             +        * be be applied after these register writes. This
>> delay
>>             +        * duration was derived empirically based on the
>>             render_basic
>>             +        * config but hopefully it covers the maximum
>>             configuration
>>             +        * latency...
>>             +        */
>>             +       mdelay(100);
>>
>>
>>         With such a HW and SW design, how can we ever expose hope to get
>> any
>>         kind of performance when we are trying to monitor different
>>         metrics on each
>>         draw call? This may be acceptable for system monitoring, but it
>>         is problematic
>>         for the GL extensions :s
>>
>>
>>         Since it seems like we are going for a perf API, it means that
>>         for every change
>>         of metrics, we need to flush the commands, wait for the GPU to
>>         be done, then
>>         program the new set of metrics via an IOCTL, wait 100 ms, and
>>         then we may
>>         resume rendering ... until the next change. We are talking about
>>         a latency of
>>         6-7 frames at 60 Hz here... this is non-negligeable...
>>
>>
>>         I understand that we have a ton of counters and we may hide
>>         latency by not
>>         allowing using more than half of the counters for every draw
>>         call or frame, but
>>         even then, this 100ms delay is killing this approach altogether.
>>
>>
>>
>>
>> So revisiting this to double check how things fail with my latest
>> driver/tests without the delay, I apparently can't reproduce test
>> failures without the delay any more...
>>
>> I think the explanation is that since first adding the delay to the
>> driver I also made the the driver a bit more careful to not forward
>> spurious reports that look invalid due to a zeroed report id field, and
>> that mechanism keeps the unit tests happy, even though there are still
>> some number of invalid reports generated if we don't wait.
>>
>> One problem with simply having no delay is that the driver prints an
>> error if it sees an invalid reports so I get a lot of 'Skipping
>> spurious, invalid OA report' dmesg spam. Also this was intended more as
>> a last resort mechanism, and I wouldn't feel too happy about squashing
>> the error message and potentially sweeping other error cases under the
>> carpet.
>>
>> Experimenting to see if the delay can at least be reduced, I brought the
>> delay up in millisecond increments and found that although I still see a
>> lot of spurious reports only waiting 1 or 5 milliseconds, at 10
>> milliseconds its reduced quite a bit and at 15 milliseconds I don't seem
>> to have any errors.
>>
>> 15 milliseconds is still a long time, but at least not as long as 100.
>>
>
> OK, so the issue does not come from the HW after all, great!
>

Erm, I'm not sure that's a conclusion we can make here...

The upshot here was really just reducing the delay from 100ms to 15ms.
Previously I arrived at a workable delay by jumping the delay in orders of
magnitude with 10ms failing, 100ms passing and I didn't try and refine it
further. Here I've looked at delays between 10 and 100ms.

The other thing is observing that because the kernel is checking for
invalid reports (generated by the hardware) before forwarding to userspace
the lack of a delay no longer triggers i-g-t failures because the invalid
data won't reach i-g-t any more - though the invalid reports are still a
thing to avoid.


> Now, my main question is, why are spurious events generated when changing
> the MUX's value? I can understand that we would need to ignore the reading
> that came right after the change, but other than this,  I am a bit at a
> loss.
>

The MUX selects 16 signals that the OA unit can turn into 16 separate
counters by basically counting the signal changes. (there's some fancy
fixed function logic that can affect this but that's the general idea).

If the MUX is in the middle of being re-programmed then some subset of
those 16 signals are for who knows what.

After programming the MUX we will go on to configure the OA unit and the
tests will enable periodic sampling which (if we have no delay) will sample
the OA counters that are currently being fed by undefined signals.

So as far as that goes it makes sense to me to expect bad data if we don't
wait for the MUX config to land properly. Something I don't really know is
how come we're seemingly lucky to have the reports be cleanly invalid with
a zero report-id, instead of just having junk data that would be harder to
recognise.


> I am a bit swamped with other tasks right now, but I would love to spend
> more time reviewing your code as I really want to see this upstream!
>

No worries.

I can hopefully send out my i-g-t tests this afternoon too which should
hopefully give us all the pieces to be able seriously consider hopefully
landing things soon.

Regards,
- Robert


> Martin
>

[-- Attachment #1.2: Type: text/html, Size: 9264 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Intel-gfx] [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-05-04  9:04       ` [Intel-gfx] " Martin Peres
@ 2016-05-04 11:15         ` Robert Bragg
  0 siblings, 0 replies; 44+ messages in thread
From: Robert Bragg @ 2016-05-04 11:15 UTC (permalink / raw)
  To: Martin Peres
  Cc: intel-gfx, Sourab Gupta, ML dri-devel, Deepak S, Daniel Vetter


[-- Attachment #1.1: Type: text/plain, Size: 8241 bytes --]

On Wed, May 4, 2016 at 10:04 AM, Martin Peres <martin.peres@linux.intel.com>
wrote:

> On 03/05/16 22:34, Robert Bragg wrote:
>
>> Sorry for the delay replying to this, I missed it.
>>
>
> No worries!
>
>
>> On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres@free.fr
>> <mailto:martin.peres@free.fr>> wrote:
>>
>>     On 20/04/16 17:23, Robert Bragg wrote:
>>
>>         Gen graphics hardware can be set up to periodically write
>>         snapshots of
>>         performance counters into a circular buffer via its Observation
>>         Architecture and this patch exposes that capability to userspace
>>         via the
>>         i915 perf interface.
>>
>>         Cc: Chris Wilson <chris@chris-wilson.co.uk
>>         <mailto:chris@chris-wilson.co.uk>>
>>         Signed-off-by: Robert Bragg <robert@sixbynine.org
>>         <mailto:robert@sixbynine.org>>
>>         Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com
>>         <mailto:zhenyuw@linux.intel.com>>
>>
>>         ---
>>           drivers/gpu/drm/i915/i915_drv.h         |  56 +-
>>           drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
>>           drivers/gpu/drm/i915/i915_perf.c        | 940
>>         +++++++++++++++++++++++++++++++-
>>           drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
>>           include/uapi/drm/i915_drm.h             |  70 ++-
>>           5 files changed, 1408 insertions(+), 20 deletions(-)
>>
>>         +
>>         +
>>         +       /* It takes a fairly long time for a new MUX
>>         configuration to
>>         +        * be be applied after these register writes. This delay
>>         +        * duration was derived empirically based on the
>>         render_basic
>>         +        * config but hopefully it covers the maximum
>> configuration
>>         +        * latency...
>>         +        */
>>         +       mdelay(100);
>>
>>
>>     With such a HW and SW design, how can we ever expose hope to get any
>>     kind of performance when we are trying to monitor different metrics
>>     on each
>>     draw call? This may be acceptable for system monitoring, but it is
>>     problematic
>>     for the GL extensions :s
>>
>>
>>     Since it seems like we are going for a perf API, it means that for
>>     every change
>>     of metrics, we need to flush the commands, wait for the GPU to be
>>     done, then
>>     program the new set of metrics via an IOCTL, wait 100 ms, and then
>>     we may
>>     resume rendering ... until the next change. We are talking about a
>>     latency of
>>     6-7 frames at 60 Hz here... this is non-negligeable...
>>
>>
>>     I understand that we have a ton of counters and we may hide latency
>>     by not
>>     allowing using more than half of the counters for every draw call or
>>     frame, but
>>     even then, this 100ms delay is killing this approach altogether.
>>
>>
>>
>> Although I'm also really unhappy about introducing this delay recently,
>> the impact of the delay is typically amortized somewhat by keeping a
>> configuration open as long as possible.
>>
>> Even without this explicit delay here the OA unit isn't suited to being
>> reconfigured on a per draw call basis, though it is able to support per
>> draw call queries with the same config.
>>
>> The above assessment assumes wanting to change config between draw calls
>> which is not something this driver aims to support - as the HW isn't
>> really designed for that model.
>>
>> E.g. in the case of INTEL_performance_query, the backend keeps the i915
>> perf stream open until all OA based query objects are deleted - so you
>> have to be pretty explicit if you want to change config.
>>
>
> OK, I get your point. However, I still want to state that applications
> changing the set would see a disastrous effect as a 100 ms is enough to
> downclock both the CPU and GPU and that would dramatically alter the
> metrics. Should we make it clear somewhere, either in the
> INTEL_performance_query or as a warning in mesa_performance if changing the
> set while running? I would think the latter would be preferable as it could
> also cover the case of the AMD extension which, IIRC, does not talk about
> the performance cost of changing the metrics. With this caveat made clear,
> it seems reasonable.
>

Yeah a KHR_debug performance warning sounds like a good idea.


>
>
>> In case you aren't familiar with how the GL_INTEL_performance_query side
>> of things works for OA counters; one thing to be aware of is that
>> there's a separate MI_REPORT_PERF_COUNT command that Mesa writes either
>> side of a query which writes all the counters for the current OA config
>> (as configured via this i915 perf interface) to a buffer. In addition to
>> collecting reports via MI_REPORT_PERF_COUNT Mesa also configures the
>> unit for periodic sampling to be able to account for potential counter
>> overflow.
>>
>
> Oh, the overflow case is mean. Doesn't the spec mandate the application to
> read at least every second? This is the case for the timestamp queries.
>

For a Haswell GT3 system with 40EUs @ 1GHz some aggregate EU counters may
overflow their 32bits in approximately 40milliseconds. It should be pretty
unusual to see a draw call last that long, but not unimaginable. Might also
be a good draw call to focus on profiling too :-)

For Gen8+ a bunch of the A counters can be reported with 40bits to mitigate
this issue.


>
>
>>
>> It also might be worth keeping in mind that per draw queries will anyway
>> trash the pipelining of work, since it's necessary to put stalls between
>> the draw calls to avoid conflated metrics (not to do with the details of
>> this driver) so use cases will probably be limited to those that just
>> want the draw call numbers but don't mind ruining overall
>> frame/application performance. Periodic sampling or swap-to-swap queries
>> would be better suited to cases that should minimize their impact.
>>
>
> Yes, I agree that there will always be a cost, but with the design
> implemented in nouveau (which barely involves the CPU at all), the
> pipelining is almost unaffected. As in, monitoring every draw call with a
> different metric would lower the performance of glxgears (worst case I
> could think off) but still keep thousands of FPS.
>

I guess it just has different trade offs.

While it sounds like we have a typically higher cost to reconfigure OA (at
least if touching the MUX) once the config is fixed (which can be done
before measuring anything), then I guess the pipelining for queries might
be slightly better with MI_REPORT_PERF_COUNT commands than something
requiring interrupting + executing work on the cpu to switch config (even
if it's cheaper than an OA re-config). I guess nouveau would have the same
need to insert GPU pipeline stalls (just gpu syncing with gpu) to avoid
conflating neighbouring draw call metrics, and maybe the bubbles from those
that can swallow the latency of the software methods.

glxgears might not really exaggerate draw call pipeline stall issues with
only 6 cheap primitives per gear. glxgears hammers context switching more
so than drawing anything. I think a pessimal case would be an app that
depends on large numbers of draw calls per frame that each do enough real
work that stalling for their completion is also measurable.

Funnily enough enabling the OA unit with glxgears can be kind of
problematic for Gen8+ which automatically writes reports on context switch
due to the spam of generating all of those context switch reports.


>
>> The driver is already usable with gputop with this delay and considering
>> how config changes are typically associated with user interaction I
>> wouldn't see this as a show stopper - even though it's not ideal. I
>> think the assertions about it being unusable with GL, were a little
>> overstated based on making frequent OA config changes which is not
>> really how the interface is intended to be used.
>>
>
> Yeah, but a performance warning in mesa, I would be OK with this change.
> Thanks for taking the time to explain!


A performance warning sounds like a sensible idea yup.

Regards,
- Robert


>
>
>>
>> Thanks for starting to take a look through the code.
>>
>> Kind Regards,
>> - Robert
>>
>
> Martin
>

[-- Attachment #1.2: Type: text/html, Size: 11275 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-05-04  9:49           ` Robert Bragg
@ 2016-05-04 12:24             ` Daniel Vetter
  2016-05-04 13:24               ` Robert Bragg
  0 siblings, 1 reply; 44+ messages in thread
From: Daniel Vetter @ 2016-05-04 12:24 UTC (permalink / raw)
  To: Robert Bragg
  Cc: intel-gfx, ML dri-devel, Sourab Gupta, Deepak S, Daniel Vetter

On Wed, May 04, 2016 at 10:49:53AM +0100, Robert Bragg wrote:
> On Wed, May 4, 2016 at 10:09 AM, Martin Peres <martin.peres@linux.intel.com>
> wrote:
> 
> > On 03/05/16 23:03, Robert Bragg wrote:
> >
> >>
> >>
> >> On Tue, May 3, 2016 at 8:34 PM, Robert Bragg <robert@sixbynine.org
> >> <mailto:robert@sixbynine.org>> wrote:
> >>
> >>     Sorry for the delay replying to this, I missed it.
> >>
> >>     On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres@free.fr
> >>     <mailto:martin.peres@free.fr>> wrote:
> >>
> >>         On 20/04/16 17:23, Robert Bragg wrote:
> >>
> >>             Gen graphics hardware can be set up to periodically write
> >>             snapshots of
> >>             performance counters into a circular buffer via its
> >> Observation
> >>             Architecture and this patch exposes that capability to
> >>             userspace via the
> >>             i915 perf interface.
> >>
> >>             Cc: Chris Wilson <chris@chris-wilson.co.uk
> >>             <mailto:chris@chris-wilson.co.uk>>
> >>             Signed-off-by: Robert Bragg <robert@sixbynine.org
> >>             <mailto:robert@sixbynine.org>>
> >>             Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com
> >>             <mailto:zhenyuw@linux.intel.com>>
> >>
> >>             ---
> >>               drivers/gpu/drm/i915/i915_drv.h         |  56 +-
> >>               drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
> >>               drivers/gpu/drm/i915/i915_perf.c        | 940
> >>             +++++++++++++++++++++++++++++++-
> >>               drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
> >>               include/uapi/drm/i915_drm.h             |  70 ++-
> >>               5 files changed, 1408 insertions(+), 20 deletions(-)
> >>
> >>             +
> >>             +
> >>             +       /* It takes a fairly long time for a new MUX
> >>             configuration to
> >>             +        * be be applied after these register writes. This
> >> delay
> >>             +        * duration was derived empirically based on the
> >>             render_basic
> >>             +        * config but hopefully it covers the maximum
> >>             configuration
> >>             +        * latency...
> >>             +        */
> >>             +       mdelay(100);
> >>
> >>
> >>         With such a HW and SW design, how can we ever expose hope to get
> >> any
> >>         kind of performance when we are trying to monitor different
> >>         metrics on each
> >>         draw call? This may be acceptable for system monitoring, but it
> >>         is problematic
> >>         for the GL extensions :s
> >>
> >>
> >>         Since it seems like we are going for a perf API, it means that
> >>         for every change
> >>         of metrics, we need to flush the commands, wait for the GPU to
> >>         be done, then
> >>         program the new set of metrics via an IOCTL, wait 100 ms, and
> >>         then we may
> >>         resume rendering ... until the next change. We are talking about
> >>         a latency of
> >>         6-7 frames at 60 Hz here... this is non-negligeable...
> >>
> >>
> >>         I understand that we have a ton of counters and we may hide
> >>         latency by not
> >>         allowing using more than half of the counters for every draw
> >>         call or frame, but
> >>         even then, this 100ms delay is killing this approach altogether.
> >>
> >>
> >>
> >>
> >> So revisiting this to double check how things fail with my latest
> >> driver/tests without the delay, I apparently can't reproduce test
> >> failures without the delay any more...
> >>
> >> I think the explanation is that since first adding the delay to the
> >> driver I also made the the driver a bit more careful to not forward
> >> spurious reports that look invalid due to a zeroed report id field, and
> >> that mechanism keeps the unit tests happy, even though there are still
> >> some number of invalid reports generated if we don't wait.
> >>
> >> One problem with simply having no delay is that the driver prints an
> >> error if it sees an invalid reports so I get a lot of 'Skipping
> >> spurious, invalid OA report' dmesg spam. Also this was intended more as
> >> a last resort mechanism, and I wouldn't feel too happy about squashing
> >> the error message and potentially sweeping other error cases under the
> >> carpet.
> >>
> >> Experimenting to see if the delay can at least be reduced, I brought the
> >> delay up in millisecond increments and found that although I still see a
> >> lot of spurious reports only waiting 1 or 5 milliseconds, at 10
> >> milliseconds its reduced quite a bit and at 15 milliseconds I don't seem
> >> to have any errors.
> >>
> >> 15 milliseconds is still a long time, but at least not as long as 100.
> >>
> >
> > OK, so the issue does not come from the HW after all, great!
> >
> 
> Erm, I'm not sure that's a conclusion we can make here...
> 
> The upshot here was really just reducing the delay from 100ms to 15ms.
> Previously I arrived at a workable delay by jumping the delay in orders of
> magnitude with 10ms failing, 100ms passing and I didn't try and refine it
> further. Here I've looked at delays between 10 and 100ms.
> 
> The other thing is observing that because the kernel is checking for
> invalid reports (generated by the hardware) before forwarding to userspace
> the lack of a delay no longer triggers i-g-t failures because the invalid
> data won't reach i-g-t any more - though the invalid reports are still a
> thing to avoid.
> 
> 
> > Now, my main question is, why are spurious events generated when changing
> > the MUX's value? I can understand that we would need to ignore the reading
> > that came right after the change, but other than this,  I am a bit at a
> > loss.
> >
> 
> The MUX selects 16 signals that the OA unit can turn into 16 separate
> counters by basically counting the signal changes. (there's some fancy
> fixed function logic that can affect this but that's the general idea).
> 
> If the MUX is in the middle of being re-programmed then some subset of
> those 16 signals are for who knows what.
> 
> After programming the MUX we will go on to configure the OA unit and the
> tests will enable periodic sampling which (if we have no delay) will sample
> the OA counters that are currently being fed by undefined signals.
> 
> So as far as that goes it makes sense to me to expect bad data if we don't
> wait for the MUX config to land properly. Something I don't really know is
> how come we're seemingly lucky to have the reports be cleanly invalid with
> a zero report-id, instead of just having junk data that would be harder to
> recognise.

Yeah this mdelay story sounds realy scary. Few random comments:
- msleep instead of mdelay please
- no dmesg noise above debug level for stuff that we know can happen -
  dmesg noise counts as igt failures
- reading 0 sounds more like bad synchronization. Have you tried quiescing
  the entire gpu (to make sure nothing is happen) and disabling OA, then
  updating, and then restarting? At least on a very quick look I didn't
  spot that. Random delays freak me out a bit, but wouldn't be surprised
  if really needed.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-05-04 12:24             ` Daniel Vetter
@ 2016-05-04 13:24               ` Robert Bragg
  2016-05-04 13:33                 ` Robert Bragg
  2016-05-04 13:51                 ` Daniel Vetter
  0 siblings, 2 replies; 44+ messages in thread
From: Robert Bragg @ 2016-05-04 13:24 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: intel-gfx, ML dri-devel, Sourab Gupta, Deepak S, Daniel Vetter


[-- Attachment #1.1: Type: text/plain, Size: 10399 bytes --]

On Wed, May 4, 2016 at 1:24 PM, Daniel Vetter <daniel@ffwll.ch> wrote:

> On Wed, May 04, 2016 at 10:49:53AM +0100, Robert Bragg wrote:
> > On Wed, May 4, 2016 at 10:09 AM, Martin Peres <
> martin.peres@linux.intel.com>
> > wrote:
> >
> > > On 03/05/16 23:03, Robert Bragg wrote:
> > >
> > >>
> > >>
> > >> On Tue, May 3, 2016 at 8:34 PM, Robert Bragg <robert@sixbynine.org
> > >> <mailto:robert@sixbynine.org>> wrote:
> > >>
> > >>     Sorry for the delay replying to this, I missed it.
> > >>
> > >>     On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <
> martin.peres@free.fr
> > >>     <mailto:martin.peres@free.fr>> wrote:
> > >>
> > >>         On 20/04/16 17:23, Robert Bragg wrote:
> > >>
> > >>             Gen graphics hardware can be set up to periodically write
> > >>             snapshots of
> > >>             performance counters into a circular buffer via its
> > >> Observation
> > >>             Architecture and this patch exposes that capability to
> > >>             userspace via the
> > >>             i915 perf interface.
> > >>
> > >>             Cc: Chris Wilson <chris@chris-wilson.co.uk
> > >>             <mailto:chris@chris-wilson.co.uk>>
> > >>             Signed-off-by: Robert Bragg <robert@sixbynine.org
> > >>             <mailto:robert@sixbynine.org>>
> > >>             Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com
> > >>             <mailto:zhenyuw@linux.intel.com>>
> > >>
> > >>             ---
> > >>               drivers/gpu/drm/i915/i915_drv.h         |  56 +-
> > >>               drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
> > >>               drivers/gpu/drm/i915/i915_perf.c        | 940
> > >>             +++++++++++++++++++++++++++++++-
> > >>               drivers/gpu/drm/i915/i915_reg.h         | 338
> ++++++++++++
> > >>               include/uapi/drm/i915_drm.h             |  70 ++-
> > >>               5 files changed, 1408 insertions(+), 20 deletions(-)
> > >>
> > >>             +
> > >>             +
> > >>             +       /* It takes a fairly long time for a new MUX
> > >>             configuration to
> > >>             +        * be be applied after these register writes. This
> > >> delay
> > >>             +        * duration was derived empirically based on the
> > >>             render_basic
> > >>             +        * config but hopefully it covers the maximum
> > >>             configuration
> > >>             +        * latency...
> > >>             +        */
> > >>             +       mdelay(100);
> > >>
> > >>
> > >>         With such a HW and SW design, how can we ever expose hope to
> get
> > >> any
> > >>         kind of performance when we are trying to monitor different
> > >>         metrics on each
> > >>         draw call? This may be acceptable for system monitoring, but
> it
> > >>         is problematic
> > >>         for the GL extensions :s
> > >>
> > >>
> > >>         Since it seems like we are going for a perf API, it means that
> > >>         for every change
> > >>         of metrics, we need to flush the commands, wait for the GPU to
> > >>         be done, then
> > >>         program the new set of metrics via an IOCTL, wait 100 ms, and
> > >>         then we may
> > >>         resume rendering ... until the next change. We are talking
> about
> > >>         a latency of
> > >>         6-7 frames at 60 Hz here... this is non-negligeable...
> > >>
> > >>
> > >>         I understand that we have a ton of counters and we may hide
> > >>         latency by not
> > >>         allowing using more than half of the counters for every draw
> > >>         call or frame, but
> > >>         even then, this 100ms delay is killing this approach
> altogether.
> > >>
> > >>
> > >>
> > >>
> > >> So revisiting this to double check how things fail with my latest
> > >> driver/tests without the delay, I apparently can't reproduce test
> > >> failures without the delay any more...
> > >>
> > >> I think the explanation is that since first adding the delay to the
> > >> driver I also made the the driver a bit more careful to not forward
> > >> spurious reports that look invalid due to a zeroed report id field,
> and
> > >> that mechanism keeps the unit tests happy, even though there are still
> > >> some number of invalid reports generated if we don't wait.
> > >>
> > >> One problem with simply having no delay is that the driver prints an
> > >> error if it sees an invalid reports so I get a lot of 'Skipping
> > >> spurious, invalid OA report' dmesg spam. Also this was intended more
> as
> > >> a last resort mechanism, and I wouldn't feel too happy about squashing
> > >> the error message and potentially sweeping other error cases under the
> > >> carpet.
> > >>
> > >> Experimenting to see if the delay can at least be reduced, I brought
> the
> > >> delay up in millisecond increments and found that although I still
> see a
> > >> lot of spurious reports only waiting 1 or 5 milliseconds, at 10
> > >> milliseconds its reduced quite a bit and at 15 milliseconds I don't
> seem
> > >> to have any errors.
> > >>
> > >> 15 milliseconds is still a long time, but at least not as long as 100.
> > >>
> > >
> > > OK, so the issue does not come from the HW after all, great!
> > >
> >
> > Erm, I'm not sure that's a conclusion we can make here...
> >
> > The upshot here was really just reducing the delay from 100ms to 15ms.
> > Previously I arrived at a workable delay by jumping the delay in orders
> of
> > magnitude with 10ms failing, 100ms passing and I didn't try and refine it
> > further. Here I've looked at delays between 10 and 100ms.
> >
> > The other thing is observing that because the kernel is checking for
> > invalid reports (generated by the hardware) before forwarding to
> userspace
> > the lack of a delay no longer triggers i-g-t failures because the invalid
> > data won't reach i-g-t any more - though the invalid reports are still a
> > thing to avoid.
> >
> >
> > > Now, my main question is, why are spurious events generated when
> changing
> > > the MUX's value? I can understand that we would need to ignore the
> reading
> > > that came right after the change, but other than this,  I am a bit at a
> > > loss.
> > >
> >
> > The MUX selects 16 signals that the OA unit can turn into 16 separate
> > counters by basically counting the signal changes. (there's some fancy
> > fixed function logic that can affect this but that's the general idea).
> >
> > If the MUX is in the middle of being re-programmed then some subset of
> > those 16 signals are for who knows what.
> >
> > After programming the MUX we will go on to configure the OA unit and the
> > tests will enable periodic sampling which (if we have no delay) will
> sample
> > the OA counters that are currently being fed by undefined signals.
> >
> > So as far as that goes it makes sense to me to expect bad data if we
> don't
> > wait for the MUX config to land properly. Something I don't really know
> is
> > how come we're seemingly lucky to have the reports be cleanly invalid
> with
> > a zero report-id, instead of just having junk data that would be harder
> to
> > recognise.
>
> Yeah this mdelay story sounds realy scary. Few random comments:
> - msleep instead of mdelay please
>

yup this was a mistake I'd forgotten to fix in this series, but is fixed in
the last series I sent after chris noticed too.

actually in my latest (yesterday after experimenting further with the delay
requirements) I'm using usleep_range for a delay between 15 and 20
milliseconds which seems to be enough.


> - no dmesg noise above debug level for stuff that we know can happen -
>   dmesg noise counts as igt failures
>

okey. I don't think I have anything above debug level, unless things are
going badly wrong.

Just double checking though has made me think twice about a WARN_ON in
gen7_oa_read for !oa_buffer_addr, which would be a bad situation but should
either be removed (never expected), be a BUG_ON (since the code would deref
NULL anyway) or more gracefully return an error if seen.

I currently have some DRM_DRIVER_DEBUG errors for cases where userspace
messes up what properties it gives to open a stream - hopefully that sound
ok? I've found it quite helpful to have a readable error for otherwise
vague EINVAL type errors.

I have a debug message I print if we see an invalid HW report, which
*shouldn't* happen but can happen (e.g. if the MUX delay or tail margin
aren't well tuned) and it's helpful to have the feedback, in case we end up
in a situation where we see this kind of message too frequently which might
indicate an issue to investigate.


> - reading 0 sounds more like bad synchronization.


I suppose I haven't thoroughly considered if we should return zero in any
case  - normally that would imply EOF so we get to choose what that implies
here. I don't think the driver should ever return 0 from read() currently.

A few concious choices re: read() return values have been:

- never ever return partial records (or rather even if a partial record
were literally copied into the userspace buffer, but an error were hit in
the middle of copying a full sample then that record wouldn't be accounted
for in the byte count returned.)

- Don't throw away records successfully copied, due to a later error. This
simplifies error handling paths internally and reporting
EAGAIN/ENOSPC/EFAULT errors and means data isn't lost. The precedence for
what we return is 1) did we successfully copy some reports? report bytes
copied for complete records. 2) did we encounter an error? report that if
so. 3) return -EAGAIN. (though for a blocking fd this will be handled
internally).



> Have you tried quiescing

the entire gpu (to make sure nothing is happen) and disabling OA, then
>   updating, and then restarting? At least on a very quick look I didn't
>   spot that. Random delays freak me out a bit, but wouldn't be surprised
>   if really needed.
>

Experimenting yesterday, it seems I can at least reduce the delay to around
15ms (granted that's still pretty huge), and it's also workable to have
userspace sleep for this time (despite the mdelay I originally went with)

Haven't tried this, but yeah could be interesting to experiment if the MUX
config lands faster in different situation such as when the HW is idle.

Thanks,
- Robert


>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

[-- Attachment #1.2: Type: text/html, Size: 14513 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-05-04 13:24               ` Robert Bragg
@ 2016-05-04 13:33                 ` Robert Bragg
  2016-05-04 13:51                 ` Daniel Vetter
  1 sibling, 0 replies; 44+ messages in thread
From: Robert Bragg @ 2016-05-04 13:33 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: intel-gfx, ML dri-devel, Sourab Gupta, Deepak S, Daniel Vetter


[-- Attachment #1.1: Type: text/plain, Size: 11070 bytes --]

On Wed, May 4, 2016 at 2:24 PM, Robert Bragg <robert@sixbynine.org> wrote:

>
>
> On Wed, May 4, 2016 at 1:24 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>
>> On Wed, May 04, 2016 at 10:49:53AM +0100, Robert Bragg wrote:
>> > On Wed, May 4, 2016 at 10:09 AM, Martin Peres <
>> martin.peres@linux.intel.com>
>> > wrote:
>> >
>> > > On 03/05/16 23:03, Robert Bragg wrote:
>> > >
>> > >>
>> > >>
>> > >> On Tue, May 3, 2016 at 8:34 PM, Robert Bragg <robert@sixbynine.org
>> > >> <mailto:robert@sixbynine.org>> wrote:
>> > >>
>> > >>     Sorry for the delay replying to this, I missed it.
>> > >>
>> > >>     On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <
>> martin.peres@free.fr
>> > >>     <mailto:martin.peres@free.fr>> wrote:
>> > >>
>> > >>         On 20/04/16 17:23, Robert Bragg wrote:
>> > >>
>> > >>             Gen graphics hardware can be set up to periodically write
>> > >>             snapshots of
>> > >>             performance counters into a circular buffer via its
>> > >> Observation
>> > >>             Architecture and this patch exposes that capability to
>> > >>             userspace via the
>> > >>             i915 perf interface.
>> > >>
>> > >>             Cc: Chris Wilson <chris@chris-wilson.co.uk
>> > >>             <mailto:chris@chris-wilson.co.uk>>
>> > >>             Signed-off-by: Robert Bragg <robert@sixbynine.org
>> > >>             <mailto:robert@sixbynine.org>>
>> > >>             Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com
>> > >>             <mailto:zhenyuw@linux.intel.com>>
>> > >>
>> > >>             ---
>> > >>               drivers/gpu/drm/i915/i915_drv.h         |  56 +-
>> > >>               drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
>> > >>               drivers/gpu/drm/i915/i915_perf.c        | 940
>> > >>             +++++++++++++++++++++++++++++++-
>> > >>               drivers/gpu/drm/i915/i915_reg.h         | 338
>> ++++++++++++
>> > >>               include/uapi/drm/i915_drm.h             |  70 ++-
>> > >>               5 files changed, 1408 insertions(+), 20 deletions(-)
>> > >>
>> > >>             +
>> > >>             +
>> > >>             +       /* It takes a fairly long time for a new MUX
>> > >>             configuration to
>> > >>             +        * be be applied after these register writes.
>> This
>> > >> delay
>> > >>             +        * duration was derived empirically based on the
>> > >>             render_basic
>> > >>             +        * config but hopefully it covers the maximum
>> > >>             configuration
>> > >>             +        * latency...
>> > >>             +        */
>> > >>             +       mdelay(100);
>> > >>
>> > >>
>> > >>         With such a HW and SW design, how can we ever expose hope to
>> get
>> > >> any
>> > >>         kind of performance when we are trying to monitor different
>> > >>         metrics on each
>> > >>         draw call? This may be acceptable for system monitoring, but
>> it
>> > >>         is problematic
>> > >>         for the GL extensions :s
>> > >>
>> > >>
>> > >>         Since it seems like we are going for a perf API, it means
>> that
>> > >>         for every change
>> > >>         of metrics, we need to flush the commands, wait for the GPU
>> to
>> > >>         be done, then
>> > >>         program the new set of metrics via an IOCTL, wait 100 ms, and
>> > >>         then we may
>> > >>         resume rendering ... until the next change. We are talking
>> about
>> > >>         a latency of
>> > >>         6-7 frames at 60 Hz here... this is non-negligeable...
>> > >>
>> > >>
>> > >>         I understand that we have a ton of counters and we may hide
>> > >>         latency by not
>> > >>         allowing using more than half of the counters for every draw
>> > >>         call or frame, but
>> > >>         even then, this 100ms delay is killing this approach
>> altogether.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> So revisiting this to double check how things fail with my latest
>> > >> driver/tests without the delay, I apparently can't reproduce test
>> > >> failures without the delay any more...
>> > >>
>> > >> I think the explanation is that since first adding the delay to the
>> > >> driver I also made the the driver a bit more careful to not forward
>> > >> spurious reports that look invalid due to a zeroed report id field,
>> and
>> > >> that mechanism keeps the unit tests happy, even though there are
>> still
>> > >> some number of invalid reports generated if we don't wait.
>> > >>
>> > >> One problem with simply having no delay is that the driver prints an
>> > >> error if it sees an invalid reports so I get a lot of 'Skipping
>> > >> spurious, invalid OA report' dmesg spam. Also this was intended more
>> as
>> > >> a last resort mechanism, and I wouldn't feel too happy about
>> squashing
>> > >> the error message and potentially sweeping other error cases under
>> the
>> > >> carpet.
>> > >>
>> > >> Experimenting to see if the delay can at least be reduced, I brought
>> the
>> > >> delay up in millisecond increments and found that although I still
>> see a
>> > >> lot of spurious reports only waiting 1 or 5 milliseconds, at 10
>> > >> milliseconds its reduced quite a bit and at 15 milliseconds I don't
>> seem
>> > >> to have any errors.
>> > >>
>> > >> 15 milliseconds is still a long time, but at least not as long as
>> 100.
>> > >>
>> > >
>> > > OK, so the issue does not come from the HW after all, great!
>> > >
>> >
>> > Erm, I'm not sure that's a conclusion we can make here...
>> >
>> > The upshot here was really just reducing the delay from 100ms to 15ms.
>> > Previously I arrived at a workable delay by jumping the delay in orders
>> of
>> > magnitude with 10ms failing, 100ms passing and I didn't try and refine
>> it
>> > further. Here I've looked at delays between 10 and 100ms.
>> >
>> > The other thing is observing that because the kernel is checking for
>> > invalid reports (generated by the hardware) before forwarding to
>> userspace
>> > the lack of a delay no longer triggers i-g-t failures because the
>> invalid
>> > data won't reach i-g-t any more - though the invalid reports are still a
>> > thing to avoid.
>> >
>> >
>> > > Now, my main question is, why are spurious events generated when
>> changing
>> > > the MUX's value? I can understand that we would need to ignore the
>> reading
>> > > that came right after the change, but other than this,  I am a bit at
>> a
>> > > loss.
>> > >
>> >
>> > The MUX selects 16 signals that the OA unit can turn into 16 separate
>> > counters by basically counting the signal changes. (there's some fancy
>> > fixed function logic that can affect this but that's the general idea).
>> >
>> > If the MUX is in the middle of being re-programmed then some subset of
>> > those 16 signals are for who knows what.
>> >
>> > After programming the MUX we will go on to configure the OA unit and the
>> > tests will enable periodic sampling which (if we have no delay) will
>> sample
>> > the OA counters that are currently being fed by undefined signals.
>> >
>> > So as far as that goes it makes sense to me to expect bad data if we
>> don't
>> > wait for the MUX config to land properly. Something I don't really know
>> is
>> > how come we're seemingly lucky to have the reports be cleanly invalid
>> with
>> > a zero report-id, instead of just having junk data that would be harder
>> to
>> > recognise.
>>
>> Yeah this mdelay story sounds realy scary. Few random comments:
>> - msleep instead of mdelay please
>>
>
> yup this was a mistake I'd forgotten to fix in this series, but is fixed
> in the last series I sent after chris noticed too.
>
> actually in my latest (yesterday after experimenting further with the
> delay requirements) I'm using usleep_range for a delay between 15 and 20
> milliseconds which seems to be enough.
>
>
>> - no dmesg noise above debug level for stuff that we know can happen -
>>   dmesg noise counts as igt failures
>>
>
> okey. I don't think I have anything above debug level, unless things are
> going badly wrong.
>
> Just double checking though has made me think twice about a WARN_ON in
> gen7_oa_read for !oa_buffer_addr, which would be a bad situation but should
> either be removed (never expected), be a BUG_ON (since the code would deref
> NULL anyway) or more gracefully return an error if seen.
>
> I currently have some DRM_DRIVER_DEBUG errors for cases where userspace
> messes up what properties it gives to open a stream - hopefully that sound
> ok? I've found it quite helpful to have a readable error for otherwise
> vague EINVAL type errors.
>
> I have a debug message I print if we see an invalid HW report, which
> *shouldn't* happen but can happen (e.g. if the MUX delay or tail margin
> aren't well tuned) and it's helpful to have the feedback, in case we end up
> in a situation where we see this kind of message too frequently which might
> indicate an issue to investigate.
>
>
>> - reading 0 sounds more like bad synchronization.
>
>
> I suppose I haven't thoroughly considered if we should return zero in any
> case  - normally that would imply EOF so we get to choose what that implies
> here. I don't think the driver should ever return 0 from read() currently.
>
> A few concious choices re: read() return values have been:
>
> - never ever return partial records (or rather even if a partial record
> were literally copied into the userspace buffer, but an error were hit in
> the middle of copying a full sample then that record wouldn't be accounted
> for in the byte count returned.)
>
> - Don't throw away records successfully copied, due to a later error. This
> simplifies error handling paths internally and reporting
> EAGAIN/ENOSPC/EFAULT errors and means data isn't lost. The precedence for
> what we return is 1) did we successfully copy some reports? report bytes
> copied for complete records. 2) did we encounter an error? report that if
> so. 3) return -EAGAIN. (though for a blocking fd this will be handled
> internally).
>
>
>
>> Have you tried quiescing
>
> the entire gpu (to make sure nothing is happen) and disabling OA, then
>>   updating, and then restarting? At least on a very quick look I didn't
>>   spot that. Random delays freak me out a bit, but wouldn't be surprised
>>   if really needed.
>>
>
> Experimenting yesterday, it seems I can at least reduce the delay to
> around 15ms (granted that's still pretty huge), and it's also workable to
> have userspace sleep for this time (despite the mdelay I originally went
> with)
>
> Haven't tried this, but yeah could be interesting to experiment if the MUX
> config lands faster in different situation such as when the HW is idle.
>

Hmm, maybe a stretch, but 15ms is perhaps coincidentally close to the
vblank period, the MUX relates to a fabric across the whole gpu... not
totally in-plausible there could be an interaction there too. another one
to experiment with.

- Robert


>
> Thanks,
> - Robert
>
>
>>
>> Cheers, Daniel
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
>>
>
>

[-- Attachment #1.2: Type: text/html, Size: 15682 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit
  2016-05-04 13:24               ` Robert Bragg
  2016-05-04 13:33                 ` Robert Bragg
@ 2016-05-04 13:51                 ` Daniel Vetter
  1 sibling, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2016-05-04 13:51 UTC (permalink / raw)
  To: Robert Bragg
  Cc: intel-gfx, ML dri-devel, Sourab Gupta, Deepak S, Daniel Vetter

On Wed, May 04, 2016 at 02:24:14PM +0100, Robert Bragg wrote:
> On Wed, May 4, 2016 at 1:24 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Wed, May 04, 2016 at 10:49:53AM +0100, Robert Bragg wrote:
> > > On Wed, May 4, 2016 at 10:09 AM, Martin Peres <
> > martin.peres@linux.intel.com>
> > > wrote:
> > >
> > > > On 03/05/16 23:03, Robert Bragg wrote:
> > > >
> > > >>
> > > >>
> > > >> On Tue, May 3, 2016 at 8:34 PM, Robert Bragg <robert@sixbynine.org
> > > >> <mailto:robert@sixbynine.org>> wrote:
> > > >>
> > > >>     Sorry for the delay replying to this, I missed it.
> > > >>
> > > >>     On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <
> > martin.peres@free.fr
> > > >>     <mailto:martin.peres@free.fr>> wrote:
> > > >>
> > > >>         On 20/04/16 17:23, Robert Bragg wrote:
> > > >>
> > > >>             Gen graphics hardware can be set up to periodically write
> > > >>             snapshots of
> > > >>             performance counters into a circular buffer via its
> > > >> Observation
> > > >>             Architecture and this patch exposes that capability to
> > > >>             userspace via the
> > > >>             i915 perf interface.
> > > >>
> > > >>             Cc: Chris Wilson <chris@chris-wilson.co.uk
> > > >>             <mailto:chris@chris-wilson.co.uk>>
> > > >>             Signed-off-by: Robert Bragg <robert@sixbynine.org
> > > >>             <mailto:robert@sixbynine.org>>
> > > >>             Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com
> > > >>             <mailto:zhenyuw@linux.intel.com>>
> > > >>
> > > >>             ---
> > > >>               drivers/gpu/drm/i915/i915_drv.h         |  56 +-
> > > >>               drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
> > > >>               drivers/gpu/drm/i915/i915_perf.c        | 940
> > > >>             +++++++++++++++++++++++++++++++-
> > > >>               drivers/gpu/drm/i915/i915_reg.h         | 338
> > ++++++++++++
> > > >>               include/uapi/drm/i915_drm.h             |  70 ++-
> > > >>               5 files changed, 1408 insertions(+), 20 deletions(-)
> > > >>
> > > >>             +
> > > >>             +
> > > >>             +       /* It takes a fairly long time for a new MUX
> > > >>             configuration to
> > > >>             +        * be be applied after these register writes. This
> > > >> delay
> > > >>             +        * duration was derived empirically based on the
> > > >>             render_basic
> > > >>             +        * config but hopefully it covers the maximum
> > > >>             configuration
> > > >>             +        * latency...
> > > >>             +        */
> > > >>             +       mdelay(100);
> > > >>
> > > >>
> > > >>         With such a HW and SW design, how can we ever expose hope to
> > get
> > > >> any
> > > >>         kind of performance when we are trying to monitor different
> > > >>         metrics on each
> > > >>         draw call? This may be acceptable for system monitoring, but
> > it
> > > >>         is problematic
> > > >>         for the GL extensions :s
> > > >>
> > > >>
> > > >>         Since it seems like we are going for a perf API, it means that
> > > >>         for every change
> > > >>         of metrics, we need to flush the commands, wait for the GPU to
> > > >>         be done, then
> > > >>         program the new set of metrics via an IOCTL, wait 100 ms, and
> > > >>         then we may
> > > >>         resume rendering ... until the next change. We are talking
> > about
> > > >>         a latency of
> > > >>         6-7 frames at 60 Hz here... this is non-negligeable...
> > > >>
> > > >>
> > > >>         I understand that we have a ton of counters and we may hide
> > > >>         latency by not
> > > >>         allowing using more than half of the counters for every draw
> > > >>         call or frame, but
> > > >>         even then, this 100ms delay is killing this approach
> > altogether.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> So revisiting this to double check how things fail with my latest
> > > >> driver/tests without the delay, I apparently can't reproduce test
> > > >> failures without the delay any more...
> > > >>
> > > >> I think the explanation is that since first adding the delay to the
> > > >> driver I also made the the driver a bit more careful to not forward
> > > >> spurious reports that look invalid due to a zeroed report id field,
> > and
> > > >> that mechanism keeps the unit tests happy, even though there are still
> > > >> some number of invalid reports generated if we don't wait.
> > > >>
> > > >> One problem with simply having no delay is that the driver prints an
> > > >> error if it sees an invalid reports so I get a lot of 'Skipping
> > > >> spurious, invalid OA report' dmesg spam. Also this was intended more
> > as
> > > >> a last resort mechanism, and I wouldn't feel too happy about squashing
> > > >> the error message and potentially sweeping other error cases under the
> > > >> carpet.
> > > >>
> > > >> Experimenting to see if the delay can at least be reduced, I brought
> > the
> > > >> delay up in millisecond increments and found that although I still
> > see a
> > > >> lot of spurious reports only waiting 1 or 5 milliseconds, at 10
> > > >> milliseconds its reduced quite a bit and at 15 milliseconds I don't
> > seem
> > > >> to have any errors.
> > > >>
> > > >> 15 milliseconds is still a long time, but at least not as long as 100.
> > > >>
> > > >
> > > > OK, so the issue does not come from the HW after all, great!
> > > >
> > >
> > > Erm, I'm not sure that's a conclusion we can make here...
> > >
> > > The upshot here was really just reducing the delay from 100ms to 15ms.
> > > Previously I arrived at a workable delay by jumping the delay in orders
> > of
> > > magnitude with 10ms failing, 100ms passing and I didn't try and refine it
> > > further. Here I've looked at delays between 10 and 100ms.
> > >
> > > The other thing is observing that because the kernel is checking for
> > > invalid reports (generated by the hardware) before forwarding to
> > userspace
> > > the lack of a delay no longer triggers i-g-t failures because the invalid
> > > data won't reach i-g-t any more - though the invalid reports are still a
> > > thing to avoid.
> > >
> > >
> > > > Now, my main question is, why are spurious events generated when
> > changing
> > > > the MUX's value? I can understand that we would need to ignore the
> > reading
> > > > that came right after the change, but other than this,  I am a bit at a
> > > > loss.
> > > >
> > >
> > > The MUX selects 16 signals that the OA unit can turn into 16 separate
> > > counters by basically counting the signal changes. (there's some fancy
> > > fixed function logic that can affect this but that's the general idea).
> > >
> > > If the MUX is in the middle of being re-programmed then some subset of
> > > those 16 signals are for who knows what.
> > >
> > > After programming the MUX we will go on to configure the OA unit and the
> > > tests will enable periodic sampling which (if we have no delay) will
> > sample
> > > the OA counters that are currently being fed by undefined signals.
> > >
> > > So as far as that goes it makes sense to me to expect bad data if we
> > don't
> > > wait for the MUX config to land properly. Something I don't really know
> > is
> > > how come we're seemingly lucky to have the reports be cleanly invalid
> > with
> > > a zero report-id, instead of just having junk data that would be harder
> > to
> > > recognise.
> >
> > Yeah this mdelay story sounds realy scary. Few random comments:
> > - msleep instead of mdelay please
> >
> 
> yup this was a mistake I'd forgotten to fix in this series, but is fixed in
> the last series I sent after chris noticed too.
> 
> actually in my latest (yesterday after experimenting further with the delay
> requirements) I'm using usleep_range for a delay between 15 and 20
> milliseconds which seems to be enough.
> 
> 
> > - no dmesg noise above debug level for stuff that we know can happen -
> >   dmesg noise counts as igt failures
> >
> 
> okey. I don't think I have anything above debug level, unless things are
> going badly wrong.
> 
> Just double checking though has made me think twice about a WARN_ON in
> gen7_oa_read for !oa_buffer_addr, which would be a bad situation but should
> either be removed (never expected), be a BUG_ON (since the code would deref
> NULL anyway) or more gracefully return an error if seen.

WARN_ON + bail out, or BUG_ON. Silently fixing up without failing loud in
dmesg is imo the wrong approach for something that should never happen.

> I currently have some DRM_DRIVER_DEBUG errors for cases where userspace
> messes up what properties it gives to open a stream - hopefully that sound
> ok? I've found it quite helpful to have a readable error for otherwise
> vague EINVAL type errors.

Yeah, as long as it's debug-only it's perfectly fine. We actually try to
have such a line for each EINVAL, since it's so useful (but then userspace
always hits the one case we've missed to document with debug output!).

> I have a debug message I print if we see an invalid HW report, which
> *shouldn't* happen but can happen (e.g. if the MUX delay or tail margin
> aren't well tuned) and it's helpful to have the feedback, in case we end up
> in a situation where we see this kind of message too frequently which might
> indicate an issue to investigate.

That's fine too.

> > - reading 0 sounds more like bad synchronization.
> 
> 
> I suppose I haven't thoroughly considered if we should return zero in any
> case  - normally that would imply EOF so we get to choose what that implies
> here. I don't think the driver should ever return 0 from read() currently.
> 
> A few concious choices re: read() return values have been:
> 
> - never ever return partial records (or rather even if a partial record
> were literally copied into the userspace buffer, but an error were hit in
> the middle of copying a full sample then that record wouldn't be accounted
> for in the byte count returned.)
> 
> - Don't throw away records successfully copied, due to a later error. This
> simplifies error handling paths internally and reporting
> EAGAIN/ENOSPC/EFAULT errors and means data isn't lost. The precedence for
> what we return is 1) did we successfully copy some reports? report bytes
> copied for complete records. 2) did we encounter an error? report that if
> so. 3) return -EAGAIN. (though for a blocking fd this will be handled
> internally).
> 
> 
> 
> > Have you tried quiescing
> 
> the entire gpu (to make sure nothing is happen) and disabling OA, then
> >   updating, and then restarting? At least on a very quick look I didn't
> >   spot that. Random delays freak me out a bit, but wouldn't be surprised
> >   if really needed.
> >
> 
> Experimenting yesterday, it seems I can at least reduce the delay to around
> 15ms (granted that's still pretty huge), and it's also workable to have
> userspace sleep for this time (despite the mdelay I originally went with)
> 
> Haven't tried this, but yeah could be interesting to experiment if the MUX
> config lands faster in different situation such as when the HW is idle.

In either case I think it'd be good to excessively document what you've
discovered. Maybe even split out the msleep into a separate patch, so that
the commit message with all the details is easy to find again in the
future. Because 2 months down the road someone will read this and go wtf
;-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2016-05-04 13:51 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-20 14:23 [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
2016-04-20 14:23 ` [PATCH 1/9] drm/i915: Add i915 perf infrastructure Robert Bragg
2016-04-20 22:41   ` Chris Wilson
2016-04-20 14:23 ` [PATCH 2/9] drm/i915: rename OACONTROL GEN7_OACONTROL Robert Bragg
2016-04-20 14:23 ` [PATCH 3/9] drm/i915: don't whitelist oacontrol in cmd parser Robert Bragg
2016-04-20 14:23 ` [PATCH 4/9] drm/i915: Add 'render basic' Haswell OA unit config Robert Bragg
2016-04-20 14:23 ` [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit Robert Bragg
2016-04-20 16:16   ` kbuild test robot
2016-04-20 20:30   ` [Intel-gfx] " kbuild test robot
2016-04-20 21:11   ` Chris Wilson
2016-04-21 16:15     ` Robert Bragg
2016-04-23  8:48       ` Chris Wilson
2016-04-20 22:15   ` Chris Wilson
2016-04-20 22:46   ` Chris Wilson
2016-04-22 11:04     ` Robert Bragg
2016-04-22 11:18       ` Chris Wilson
2016-04-20 22:52   ` Chris Wilson
2016-04-21 15:43     ` Robert Bragg
2016-04-21 16:21       ` Chris Wilson
2016-04-20 23:09   ` Chris Wilson
2016-04-21 15:18     ` Robert Bragg
2016-04-22  1:10       ` Robert Bragg
2016-04-20 23:16   ` Chris Wilson
2016-04-21 15:01     ` Robert Bragg
2016-04-23 10:34   ` Martin Peres
2016-05-03 19:34     ` Robert Bragg
2016-05-03 20:03       ` Robert Bragg
2016-05-04  9:09         ` Martin Peres
2016-05-04  9:49           ` Robert Bragg
2016-05-04 12:24             ` Daniel Vetter
2016-05-04 13:24               ` Robert Bragg
2016-05-04 13:33                 ` Robert Bragg
2016-05-04 13:51                 ` Daniel Vetter
2016-05-04  9:04       ` [Intel-gfx] " Martin Peres
2016-05-04 11:15         ` Robert Bragg
2016-04-20 14:23 ` [PATCH 6/9] drm/i915: advertise available metrics via sysfs Robert Bragg
2016-04-20 14:23 ` [PATCH 7/9] drm/i915: Add dev.i915.perf_event_paranoid sysctl option Robert Bragg
2016-04-20 14:23 ` [PATCH 8/9] drm/i915: add oa_event_min_timer_exponent sysctl Robert Bragg
2016-04-20 14:23 ` [PATCH 9/9] drm/i915: Add more Haswell OA metric sets Robert Bragg
2016-04-20 14:56 ` [PATCH 0/9] Enable Gen 7 Observation Architecture Robert Bragg
2016-04-21  7:46 ` ✓ Fi.CI.BAT: success for Enable Gen 7 Observation Architecture (rev3) Patchwork
2016-04-21 12:41 ` ✗ Fi.CI.BAT: failure " Patchwork
2016-04-23  8:31 ` ✗ Fi.CI.BAT: warning " Patchwork
2016-04-24 17:23 ` ✓ Fi.CI.BAT: success " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.