All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/8] Introduce framework to forward asynchronous OA counter
@ 2015-06-22  9:50 sourab.gupta
  2015-06-22  9:50 ` [RFC 1/8] drm/i915: Have globally unique context ids, as opposed to drm file specific sourab.gupta
                   ` (7 more replies)
  0 siblings, 8 replies; 12+ messages in thread
From: sourab.gupta @ 2015-06-22  9:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

Cc: Robert Bragg <robert@sixbynine.org>,
    Zhenyu Wang <zhenyuw@linux.intel.com>,
    Jon Bloomfield <jon.bloomfield@intel.com>,
    Peter Zijlstra <a.p.zijlstra@chello.nl>,
    Jabin Wu <jabin.wu@intel.com>,
    Insoo Woo <insoo.woo@intel.com>

This patch series adds support for capturing OA counter snapshots at
asynchronous points by inserting MI_REPORT_PERF_COUNT commands into CS,
and forwarding these snapshots to userspace using perf interface.
These commands can be inserted at asynchronous points during workload
execution for e.g. at batch buffer boundaries.

This work is based on Robert Bragg's perf event framework, the patches for
which were floated earlier. Please see the link below:

http://lists.freedesktop.org/archives/intel-gfx/2015-May/066102.html

The perf event framework enabled capture of periodic OA counter snapshots
by configuring the OA unit during perf event init. The raw OA reports generated
by HW are then forwarded to userspace using perf apis.

There may be usecases wherein we need more than the periodic OA capture
functionality which is supported by perf_event currently. Few such usecases are:
    - Ability to capture system wide metrics. The reports captured should be
      able to be mapped back to individual contexts.
    - Ability to inject tags for work, into the reports. This provides
      visibility into the multiple stages of work within single context.

This framework may also be seen as a way to overcome a limitation of Haswell,
which doesn't write out a context ID with OA reports and handling this in the
kernel makes sense when we plan for compatibility with Broadwell which doesn't
include context id in reports.

This can be achieved by inserting the commands into the ring to dump the OA
counter snapshots at some asynchronous points during workload execution.
The reports generated can have an additional footer appended for capturing the
metadata information such as ctx id, pid, tags, etc.
The specific issue of counter wraparound due to large batchbuffers can be
subverted by using them in conjunction with periodic OA snapshots. Such per-BB
data can give useful information to userspace tools to analyze performance and
timing information at batchbuffer level.

An application intending to profile its own contexts can do so by submitting
the MI_REPORT_PERF_COUNT commands into the CS from the userspace itself.
But consider the usecase of a system wide GPU profiler tool which needs the 
data for all the workloads being scheduled on GPU globally. The relative
complexity of doing this in kernel is significantly less than supporting such
a usecase through userspace.

This framework is intended to feed into the requirement of such system wide
GPU profilers, which may further utilize this data for usecases such as
performance analysis (at a global level), identifying optimization scenarios
for improving GPU utilization, CPU vs GPU timing analysis, etc. Again, this is
made possible by presence of metadata information with individual reports, which
is enabled by this framework.

One such system wide GPU profiler tool is MVP(Modular Video Profiler) tool,
used by media team for profiling media workloads. (Talks in progress for
open sourcing of this tool)

The current implementation approach is to forward these samples through the
same PERF_SAMPLE_RAW sample type, as being done for periodic samples, with an
additional footer appended for metadata information. The userspace can then
distinguish these samples by filtering out on the basis of sample size.
One of the other approaches being contemplated right now is creating seperate
sample types to handle these different kind of samples. There would be different
fd's associated with these different sample types, though they can be a part of
one event group. The userspace can listen to either or both these sample types
while specifying event attributes during event init.
But right now, I'm seeing this work as a future refinement, based on acceptance
of general framework as such. I'm looking, as of now, to get the feedback on
these initial patches, w.r.t. the usage of perf apis and the interaction with
i915.

Another feature introduced in these patches is perftag. PerfTag is a mechanism,
whereby the reports collected are marked with a perfTag passed by userspace
during the execbuffer call. This way the userspace tool can associate the
reports collected with the corresponding execbuffers. This satifies the
requirement to have visibility into multiple stages (i.e. execbuffers) lying
within a single context. 
For e.g. for the media pipeline, CodecHAL encoding stage has a single context,
and involves multiple stages such as Scaling, ME, MBEnc, PAK for which there
are seperate execbuffer calls. There is a need to have the granularity of these
multiple stages of a context for the reports generated. The presence of a
perftag in report metadata fulfills this requirement. This is done right now by
using rsvd2 field of execbuffer ioctl structure, and introducing an additional
bitfield in flags to inform KMD of the same.


One of the pre-requisite for this work is presence of globally unique context
id. The context id right now is specific to drm file instance. As such, it
can't uniquely be used to associate the reports generated with the corresponding
context scheduled from userspace in a global way. In absence of globally
unique context id, other metadata such as pid/tags in conjunction with ctx id
may be used to associate reports with their corresponding contexts.

The first patch in the series proposes a way of implementing globally unique
context id.  I'm looking for comments on the pros & cons of having global ctx
id. This implementation can be refined upon if this approach is acceptable.
The subsequent patches introduce the asynchronous OA capture mode and the
mechanism to forward these snapshots using perf.

This patch set currently supports Haswell. Gen8+ support can be added when
the basic framework is agreed upon.

Sourab Gupta (8):
  drm/i915: Have globally unique context ids, as opposed to drm file
    specific
  drm/i915: Introduce mode for asynchronous capture of OA counters
  drm/i915: Add the data structures for async OA capture mode
  drm/i915: Add mechanism for forwarding async OA counter snapshots
    through perf
  drm/i915: Wait for GPU to finish before event stop, in async OA
    counter mode
  drm/i915: Routines for inserting OA capture commands in the ringbuffer
  drm/i915: Add commands in ringbuf for OA snapshot capture across
    Batchbuffer boundaries
  drm/i915: Add perfTag support for OA counter reports

 drivers/gpu/drm/i915/i915_debugfs.c        |   2 +-
 drivers/gpu/drm/i915/i915_dma.c            |   1 +
 drivers/gpu/drm/i915/i915_drv.h            |  47 ++-
 drivers/gpu/drm/i915/i915_gem_context.c    |  53 +++-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   9 +
 drivers/gpu/drm/i915/i915_oa_perf.c        | 451 +++++++++++++++++++++++++++--
 include/uapi/drm/i915_drm.h                |  24 +-
 7 files changed, 538 insertions(+), 49 deletions(-)

-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 1/8] drm/i915: Have globally unique context ids, as opposed to drm file specific
  2015-06-22  9:50 [RFC 0/8] Introduce framework to forward asynchronous OA counter sourab.gupta
@ 2015-06-22  9:50 ` sourab.gupta
  2015-06-22  9:50 ` [RFC 2/8] drm/i915: Introduce mode for asynchronous capture of OA counters sourab.gupta
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: sourab.gupta @ 2015-06-22  9:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

Currently the context ids are specific to a drm file instance, as opposed to
being globally unique. There are some usecases, which may require globally
unique context ids. For e.g. a system level GPU profiler tool may lean upon
the context ids to associate the performance snapshots with individual contexts.
If the context ids are unique, it may do so without relying on any additional
information such as pid and drm fd.

This patch proposes an implementation of globally unique context ids, by
conceptually moving the idr table for holding the context ids, into device
private structure instead of file private structure. The case of default context
id for drm file (which is given by id=0) is handled by storing the same in
file private during context creation, and retrieving as and when required.

This patch is proposed an an enabler for the patches following in the series. In
particular, I'm looking for feedback on the pros and cons of having a globally
unique context id, and any specific inputs on this particular implementation.
This implementation can be improved upon, if agreed upon conceptually.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c     |  2 +-
 drivers/gpu/drm/i915/i915_drv.h         |  4 ++-
 drivers/gpu/drm/i915/i915_gem_context.c | 53 +++++++++++++++++++++++----------
 3 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 47636f3..74c736c 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -2258,8 +2258,8 @@ static void gen6_ppgtt_info(struct seq_file *m, struct drm_device *dev)
 
 		seq_printf(m, "proc: %s\n",
 			   get_pid_task(file->pid, PIDTYPE_PID)->comm);
-		idr_for_each(&file_priv->context_idr, per_file_ctx, m);
 	}
+	idr_for_each(&dev_priv->context_idr, per_file_ctx, m);
 	seq_printf(m, "ECOCHK: 0x%08x\n", I915_READ(GAM_ECOCHK));
 }
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 50977f0..baa0234 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -320,7 +320,7 @@ struct drm_i915_file_private {
  */
 #define DRM_I915_THROTTLE_JIFFIES msecs_to_jiffies(20)
 	} mm;
-	struct idr context_idr;
+	u32 first_ctx_id;
 
 	struct intel_rps_client {
 		struct list_head link;
@@ -1754,6 +1754,8 @@ struct drm_i915_private {
 	struct intel_opregion opregion;
 	struct intel_vbt_data vbt;
 
+	struct idr context_idr;
+
 	bool preserve_bios_swizzle;
 
 	/* overlay */
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index d9ccad5..6b572c1 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -212,7 +212,8 @@ i915_gem_alloc_context_obj(struct drm_device *dev, size_t size)
 
 static struct intel_context *
 __create_hw_context(struct drm_device *dev,
-		    struct drm_i915_file_private *file_priv)
+		    struct drm_i915_file_private *file_priv,
+		    bool is_first_ctx)
 {
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct intel_context *ctx;
@@ -237,10 +238,12 @@ __create_hw_context(struct drm_device *dev,
 
 	/* Default context will never have a file_priv */
 	if (file_priv != NULL) {
-		ret = idr_alloc(&file_priv->context_idr, ctx,
+		ret = idr_alloc(&dev_priv->context_idr, ctx,
 				DEFAULT_CONTEXT_HANDLE, 0, GFP_KERNEL);
 		if (ret < 0)
 			goto err_out;
+		if (is_first_ctx)
+			file_priv->first_ctx_id = ret;
 	} else
 		ret = DEFAULT_CONTEXT_HANDLE;
 
@@ -267,7 +270,8 @@ err_out:
  */
 static struct intel_context *
 i915_gem_create_context(struct drm_device *dev,
-			struct drm_i915_file_private *file_priv)
+			struct drm_i915_file_private *file_priv,
+			bool is_first_ctx)
 {
 	const bool is_global_default_ctx = file_priv == NULL;
 	struct intel_context *ctx;
@@ -275,7 +279,7 @@ i915_gem_create_context(struct drm_device *dev,
 
 	BUG_ON(!mutex_is_locked(&dev->struct_mutex));
 
-	ctx = __create_hw_context(dev, file_priv);
+	ctx = __create_hw_context(dev, file_priv, is_first_ctx);
 	if (IS_ERR(ctx))
 		return ctx;
 
@@ -348,6 +352,14 @@ void i915_gem_context_reset(struct drm_device *dev)
 	}
 }
 
+static int context_idr_cleanup(int id, void *p, void *data)
+{
+	struct intel_context *ctx = p;
+
+	i915_gem_context_unreference(ctx);
+	return 0;
+}
+
 int i915_gem_context_init(struct drm_device *dev)
 {
 	struct drm_i915_private *dev_priv = dev->dev_private;
@@ -371,8 +383,9 @@ int i915_gem_context_init(struct drm_device *dev)
 			dev_priv->hw_context_size = 0;
 		}
 	}
+	idr_init(&dev_priv->context_idr);
 
-	ctx = i915_gem_create_context(dev, NULL);
+	ctx = i915_gem_create_context(dev, NULL, false);
 	if (IS_ERR(ctx)) {
 		DRM_ERROR("Failed to create default global context (error %ld)\n",
 			  PTR_ERR(ctx));
@@ -398,6 +411,9 @@ void i915_gem_context_fini(struct drm_device *dev)
 	struct intel_context *dctx = dev_priv->ring[RCS].default_context;
 	int i;
 
+	idr_for_each(&dev_priv->context_idr, context_idr_cleanup, NULL);
+	idr_destroy(&dev_priv->context_idr);
+
 	if (dctx->legacy_hw_ctx.rcs_state) {
 		/* The only known way to stop the gpu from accessing the hw context is
 		 * to reset it. Do this as the very last operation to avoid confusing
@@ -465,11 +481,14 @@ int i915_gem_context_enable(struct drm_i915_private *dev_priv)
 	return 0;
 }
 
-static int context_idr_cleanup(int id, void *p, void *data)
+static int cleanup_file_contexts(int id, void *p, void *data)
 {
 	struct intel_context *ctx = p;
+	struct drm_i915_file_private *file_priv = data;
+
+	if (ctx->file_priv == file_priv)
+		i915_gem_context_unreference(ctx);
 
-	i915_gem_context_unreference(ctx);
 	return 0;
 }
 
@@ -478,14 +497,11 @@ int i915_gem_context_open(struct drm_device *dev, struct drm_file *file)
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct intel_context *ctx;
 
-	idr_init(&file_priv->context_idr);
-
 	mutex_lock(&dev->struct_mutex);
-	ctx = i915_gem_create_context(dev, file_priv);
+	ctx = i915_gem_create_context(dev, file_priv, true);
 	mutex_unlock(&dev->struct_mutex);
 
 	if (IS_ERR(ctx)) {
-		idr_destroy(&file_priv->context_idr);
 		return PTR_ERR(ctx);
 	}
 
@@ -495,17 +511,21 @@ int i915_gem_context_open(struct drm_device *dev, struct drm_file *file)
 void i915_gem_context_close(struct drm_device *dev, struct drm_file *file)
 {
 	struct drm_i915_file_private *file_priv = file->driver_priv;
+	struct drm_i915_private *dev_priv = file_priv->dev_priv;
 
-	idr_for_each(&file_priv->context_idr, context_idr_cleanup, NULL);
-	idr_destroy(&file_priv->context_idr);
+	idr_for_each(&dev_priv->context_idr, cleanup_file_contexts, file_priv);
 }
 
 struct intel_context *
 i915_gem_context_get(struct drm_i915_file_private *file_priv, u32 id)
 {
+	struct drm_i915_private *dev_priv = file_priv->dev_priv;
 	struct intel_context *ctx;
 
-	ctx = (struct intel_context *)idr_find(&file_priv->context_idr, id);
+	if (id == 0)
+		id = file_priv->first_ctx_id;
+
+	ctx = (struct intel_context *)idr_find(&dev_priv->context_idr, id);
 	if (!ctx)
 		return ERR_PTR(-ENOENT);
 
@@ -862,7 +882,7 @@ int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 	if (ret)
 		return ret;
 
-	ctx = i915_gem_create_context(dev, file_priv);
+	ctx = i915_gem_create_context(dev, file_priv, false);
 	mutex_unlock(&dev->struct_mutex);
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
@@ -878,6 +898,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 {
 	struct drm_i915_gem_context_destroy *args = data;
 	struct drm_i915_file_private *file_priv = file->driver_priv;
+	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct intel_context *ctx;
 	int ret;
 
@@ -894,7 +915,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 		return PTR_ERR(ctx);
 	}
 
-	idr_remove(&ctx->file_priv->context_idr, ctx->user_handle);
+	idr_remove(&dev_priv->context_idr, ctx->user_handle);
 	i915_gem_context_unreference(ctx);
 	mutex_unlock(&dev->struct_mutex);
 
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 2/8] drm/i915: Introduce mode for asynchronous capture of OA counters
  2015-06-22  9:50 [RFC 0/8] Introduce framework to forward asynchronous OA counter sourab.gupta
  2015-06-22  9:50 ` [RFC 1/8] drm/i915: Have globally unique context ids, as opposed to drm file specific sourab.gupta
@ 2015-06-22  9:50 ` sourab.gupta
  2015-06-22 15:59   ` Daniel Vetter
  2015-06-22  9:50 ` [RFC 3/8] drm/i915: Add the data structures for async OA capture mode sourab.gupta
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 12+ messages in thread
From: sourab.gupta @ 2015-06-22  9:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

The perf event framework supports periodic capture of OA counter snapshots. The
raw OA reports generated by HW are forwarded to userspace using perf apis. This
patch looks to extend the perf pmu introduced earlier to support the capture of
asynchronous OA snapshots (in addition to periodic snapshots). These
asynchronous snapshots will be forwarded by the perf pmu to userspace alongside
periodic snapshots in the same perf ringbuffer.
This patch introduces fields for enabling the asynchronous capture mode of OA
counter snapshots.

There may be usecases wherein we need more than periodic OA capture mode which
is supported by perf_event currently. We may need to insert the commands into
the ring to dump the OA counter snapshots at some asynchronous points during
workload execution.
This mode is primarily used for two usecases:
    - Ability to capture system wide metrics. The reports captured should be
      able to be mapped back to individual contexts.
    - Ability to inject tags for work, into the reports. This provides
      visibility into the multiple stages of work within single context.

The asynchronous reports generated in this way (using MI_REPORT_PERF_COUNT
commands), will be forwarded to userspace after appending a footer, which will
have this metadata information. This will enable the usecases mentioned above.

This may also be seen as a way to overcome a limitation of Haswell, which
doesn't write out a context ID with reports and handling this in the kernel
makes sense when we plan for compatibility with Broadwell which doesn't include
context id in reports.

This patch introduces an additional field in the oa attr structure
for supporting this type of capture. The data thus captured needs to be stored
in a separate buffer, which will be different from the buffer used otherwise for
periodic OA capture mode. Again this buffer address will not need to be mapped to
OA unit register addresses such as OASTATUS1, OASTATUS2 and OABUFFER.

The subsequent patches in the series introduce the data structures and mechanism
for forwarding reports to userspace, and mechanism for inserting corresponding
commands into the ringbuffer.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |  11 ++++
 drivers/gpu/drm/i915/i915_oa_perf.c | 122 ++++++++++++++++++++++++++++++------
 include/uapi/drm/i915_drm.h         |   4 +-
 3 files changed, 118 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index baa0234..ee4a5d3 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1933,6 +1933,7 @@ struct drm_i915_private {
 		u32 period_exponent;
 
 		u32 metrics_set;
+		bool async_sample_mode;
 
 		struct {
 			struct drm_i915_gem_object *obj;
@@ -1944,6 +1945,16 @@ struct drm_i915_private {
 			int format_size;
 			spinlock_t flush_lock;
 		} oa_buffer;
+
+		/* Fields for asynchronous OA snapshot capture */
+		struct {
+			struct drm_i915_gem_object *obj;
+			u8 *addr;
+			u32 head;
+			u32 tail;
+			int format;
+			int format_size;
+		} oa_async_buffer;
 	} oa_pmu;
 #endif
 
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index e7e0b2b..419b6a5 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -166,6 +166,20 @@ static void flush_oa_snapshots(struct drm_i915_private *dev_priv,
 }
 
 static void
+oa_async_buffer_destroy(struct drm_i915_private *i915)
+{
+	mutex_lock(&i915->dev->struct_mutex);
+
+	vunmap(i915->oa_pmu.oa_async_buffer.addr);
+	i915_gem_object_ggtt_unpin(i915->oa_pmu.oa_async_buffer.obj);
+	drm_gem_object_unreference(&i915->oa_pmu.oa_async_buffer.obj->base);
+
+	i915->oa_pmu.oa_async_buffer.obj = NULL;
+	i915->oa_pmu.oa_async_buffer.addr = NULL;
+	mutex_unlock(&i915->dev->struct_mutex);
+}
+
+static void
 oa_buffer_destroy(struct drm_i915_private *i915)
 {
 	mutex_lock(&i915->dev->struct_mutex);
@@ -207,6 +221,9 @@ static void i915_oa_event_destroy(struct perf_event *event)
 	I915_WRITE(GDT_CHICKEN_BITS, (I915_READ(GDT_CHICKEN_BITS) &
 				      ~GT_NOA_ENABLE));
 
+	if (dev_priv->oa_pmu.async_sample_mode)
+		oa_async_buffer_destroy(dev_priv);
+
 	oa_buffer_destroy(dev_priv);
 
 	BUG_ON(dev_priv->oa_pmu.exclusive_event != event);
@@ -247,21 +264,11 @@ finish:
 	return addr;
 }
 
-static int init_oa_buffer(struct perf_event *event)
+static int alloc_oa_obj(struct drm_i915_private *dev_priv,
+				struct drm_i915_gem_object **obj)
 {
-	struct drm_i915_private *dev_priv =
-		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
 	struct drm_i915_gem_object *bo;
-	int ret;
-
-	BUG_ON(!IS_HASWELL(dev_priv->dev));
-	BUG_ON(dev_priv->oa_pmu.oa_buffer.obj);
-
-	ret = i915_mutex_lock_interruptible(dev_priv->dev);
-	if (ret)
-		return ret;
-
-	spin_lock_init(&dev_priv->oa_pmu.oa_buffer.flush_lock);
+	int ret = 0;
 
 	/* NB: We over allocate the OA buffer due to the way raw sample data
 	 * gets copied from the gpu mapped circular buffer into the perf
@@ -277,13 +284,13 @@ static int init_oa_buffer(struct perf_event *event)
 	 * when a report is at the end of the gpu mapped buffer we need to
 	 * read 4 bytes past the end of the buffer.
 	 */
+	intel_runtime_pm_get(dev_priv);
 	bo = i915_gem_alloc_object(dev_priv->dev, OA_BUFFER_SIZE + PAGE_SIZE);
 	if (bo == NULL) {
 		DRM_ERROR("Failed to allocate OA buffer\n");
 		ret = -ENOMEM;
-		goto unlock;
+		goto out;
 	}
-	dev_priv->oa_pmu.oa_buffer.obj = bo;
 
 	ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
 	if (ret)
@@ -294,6 +301,38 @@ static int init_oa_buffer(struct perf_event *event)
 	if (ret)
 		goto err_unref;
 
+	*obj = bo;
+	goto out;
+
+err_unref:
+	drm_gem_object_unreference_unlocked(&bo->base);
+out:
+	intel_runtime_pm_put(dev_priv);
+	return ret;
+}
+
+static int init_oa_buffer(struct perf_event *event)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	struct drm_i915_gem_object *bo;
+	int ret;
+
+	BUG_ON(!IS_HASWELL(dev_priv->dev));
+	BUG_ON(dev_priv->oa_pmu.oa_buffer.obj);
+
+	ret = i915_mutex_lock_interruptible(dev_priv->dev);
+	if (ret)
+		return ret;
+
+	spin_lock_init(&dev_priv->oa_pmu.oa_buffer.flush_lock);
+
+	ret = alloc_oa_obj(dev_priv, &bo);
+	if (ret)
+		goto unlock;
+
+	dev_priv->oa_pmu.oa_buffer.obj = bo;
+
 	dev_priv->oa_pmu.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo);
 	dev_priv->oa_pmu.oa_buffer.addr = vmap_oa_buffer(bo);
 
@@ -309,10 +348,35 @@ static int init_oa_buffer(struct perf_event *event)
 			 dev_priv->oa_pmu.oa_buffer.gtt_offset,
 			 dev_priv->oa_pmu.oa_buffer.addr);
 
-	goto unlock;
+unlock:
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+	return ret;
+}
 
-err_unref:
-	drm_gem_object_unreference(&bo->base);
+static int init_async_oa_buffer(struct perf_event *event)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
+	struct drm_i915_gem_object *bo;
+	int ret;
+
+	BUG_ON(!IS_HASWELL(dev_priv->dev));
+	BUG_ON(dev_priv->oa_pmu.oa_async_buffer.obj);
+
+	ret = i915_mutex_lock_interruptible(dev_priv->dev);
+	if (ret)
+		return ret;
+
+	ret = alloc_oa_obj(dev_priv, &bo);
+	if (ret)
+		goto unlock;
+
+	dev_priv->oa_pmu.oa_async_buffer.obj = bo;
+
+	dev_priv->oa_pmu.oa_async_buffer.addr = vmap_oa_buffer(bo);
+
+	DRM_DEBUG_DRIVER("OA Async Buffer initialized, vaddr = %p",
+			 dev_priv->oa_pmu.oa_async_buffer.addr);
 
 unlock:
 	mutex_unlock(&dev_priv->dev->struct_mutex);
@@ -444,6 +508,9 @@ static int i915_oa_event_init(struct perf_event *event)
 
 	report_format = oa_attr.format;
 	dev_priv->oa_pmu.oa_buffer.format = report_format;
+	if (oa_attr.batch_buffer_sample)
+		dev_priv->oa_pmu.oa_async_buffer.format = report_format;
+
 	dev_priv->oa_pmu.metrics_set = oa_attr.metrics_set;
 
 	if (IS_HASWELL(dev_priv->dev)) {
@@ -457,6 +524,9 @@ static int i915_oa_event_init(struct perf_event *event)
 			return -EINVAL;
 
 		dev_priv->oa_pmu.oa_buffer.format_size = snapshot_size;
+		if (oa_attr.batch_buffer_sample)
+			dev_priv->oa_pmu.oa_async_buffer.format_size =
+				snapshot_size;
 
 		if (oa_attr.metrics_set > I915_OA_METRICS_SET_MAX)
 			return -EINVAL;
@@ -465,6 +535,16 @@ static int i915_oa_event_init(struct perf_event *event)
 		return -ENODEV;
 	}
 
+	/*
+	 * In case of per batch buffer sampling, we need to check for
+	 * CAP_SYS_ADMIN capability as we profile all the running contexts
+	 */
+	if (oa_attr.batch_buffer_sample) {
+		if (!capable(CAP_SYS_ADMIN))
+			return -EACCES;
+		dev_priv->oa_pmu.async_sample_mode = true;
+	}
+
 	/* Since we are limited to an exponential scale for
 	 * programming the OA sampling period we don't allow userspace
 	 * to pass a precise attr.sample_period. */
@@ -528,6 +608,12 @@ static int i915_oa_event_init(struct perf_event *event)
 	if (ret)
 		return ret;
 
+	if (oa_attr.batch_buffer_sample) {
+		ret = init_async_oa_buffer(event);
+		if (ret)
+			return ret;
+	}
+
 	BUG_ON(dev_priv->oa_pmu.exclusive_event);
 	dev_priv->oa_pmu.exclusive_event = event;
 
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 992e1e9..354dc3a 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -92,7 +92,9 @@ typedef struct _drm_i915_oa_attr {
 	__u32 ctx_id;
 
 	__u64 single_context : 1,
-	      __reserved_1 : 63;
+	__reserved_1 : 31;
+	__u32 batch_buffer_sample:1,
+	__reserved_2:31;
 } drm_i915_oa_attr_t;
 
 /* Header for PERF_RECORD_DEVICE type events */
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 3/8] drm/i915: Add the data structures for async OA capture mode
  2015-06-22  9:50 [RFC 0/8] Introduce framework to forward asynchronous OA counter sourab.gupta
  2015-06-22  9:50 ` [RFC 1/8] drm/i915: Have globally unique context ids, as opposed to drm file specific sourab.gupta
  2015-06-22  9:50 ` [RFC 2/8] drm/i915: Introduce mode for asynchronous capture of OA counters sourab.gupta
@ 2015-06-22  9:50 ` sourab.gupta
  2015-06-22 16:01   ` Daniel Vetter
  2015-06-22  9:50 ` [RFC 4/8] drm/i915: Add mechanism for forwarding async OA counter snapshots through perf sourab.gupta
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 12+ messages in thread
From: sourab.gupta @ 2015-06-22  9:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch introduces the data structures for capturing asynchronous OA
snapshots

The data captured will be organized into nodes. Each node has the field for OA
report alongwith metadata information such as ctx_id, pid, etc. The metadata
information can be extended to provided any additional information.
The data is organized to have a queue header at beginning, which will have
information about size, data offset, number of nodes captured etc.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h | 21 +++++++++++++++++++++
 include/uapi/drm/i915_drm.h     |  5 +++++
 2 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ee4a5d3..da150bc 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1677,6 +1677,27 @@ extern const struct i915_oa_reg i915_oa_sampler_balance_mux_config_hsw[];
 extern const int i915_oa_sampler_balance_mux_config_hsw_len;
 extern const struct i915_oa_reg i915_oa_sampler_balance_b_counter_config_hsw[];
 extern const int i915_oa_sampler_balance_b_counter_config_hsw_len;
+
+
+struct drm_i915_oa_async_queue_header {
+	__u64 size_in_bytes;
+	/* Byte offset, start of queue header to first node */
+	__u64 data_offset;
+	__u32 node_count;
+	__u32 wrap_count;
+	__u32 pad[10];
+};
+
+struct drm_i915_oa_async_node_info {
+	__u32 pid;
+	__u32 ctx_id;
+	__u32 pad[14];
+};
+
+struct drm_i915_oa_async_node {
+	struct drm_i915_oa_async_node_info node_info;
+	__u32 report_perf[64]; /* Must be aligned to 64-byte boundary */
+};
 #endif
 
 struct drm_i915_private {
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 354dc3a..c91b427 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -124,6 +124,11 @@ enum drm_i915_oa_event_type {
 	I915_OA_RECORD_MAX,			/* non-ABI */
 };
 
+struct drm_i915_oa_async_node_footer {
+	__u32 pid;
+	__u32 ctx_id;
+};
+
 /* Each region is a minimum of 16k, and there are at most 255 of them.
  */
 #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 4/8] drm/i915: Add mechanism for forwarding async OA counter snapshots through perf
  2015-06-22  9:50 [RFC 0/8] Introduce framework to forward asynchronous OA counter sourab.gupta
                   ` (2 preceding siblings ...)
  2015-06-22  9:50 ` [RFC 3/8] drm/i915: Add the data structures for async OA capture mode sourab.gupta
@ 2015-06-22  9:50 ` sourab.gupta
  2015-06-22  9:50 ` [RFC 5/8] drm/i915: Wait for GPU to finish before event stop, in async OA counter mode sourab.gupta
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: sourab.gupta @ 2015-06-22  9:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds the mechanism for forwarding the asynchronous OA snapshots
through the perf event interface.

Each node of data collected is forwarded as a separate perf sample.
A single snapshot will have two fields. First is the raw report and second
field is a footer with metadata corresponding to snapshot such as ctx_id, pid.
The size of the raw report is the one specified during event init.
The samples will be forwarded in a workqueue, which is scheduled when hrtimer
triggers. In the workqueue, each node of data collected will be forwarded as a
separate perf sample.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |   5 +-
 drivers/gpu/drm/i915/i915_oa_perf.c | 158 +++++++++++++++++++++++++++++++++++-
 2 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index da150bc..d738f7a 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1691,7 +1691,8 @@ struct drm_i915_oa_async_queue_header {
 struct drm_i915_oa_async_node_info {
 	__u32 pid;
 	__u32 ctx_id;
-	__u32 pad[14];
+	struct drm_i915_gem_request *req;
+	__u32 pad[12];
 };
 
 struct drm_i915_oa_async_node {
@@ -1975,7 +1976,9 @@ struct drm_i915_private {
 			u32 tail;
 			int format;
 			int format_size;
+			u8 *snapshot;
 		} oa_async_buffer;
+		struct work_struct work_timer;
 	} oa_pmu;
 #endif
 
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index 419b6a5..3bf4c47 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -25,6 +25,128 @@ static int hsw_perf_format_sizes[] = {
 	64   /* C4_B8_HSW */
 };
 
+static void init_oa_async_buf_queue(struct drm_i915_private *dev_priv)
+{
+	struct drm_i915_oa_async_queue_header *hdr =
+		(struct drm_i915_oa_async_queue_header *)
+		dev_priv->oa_pmu.oa_async_buffer.addr;
+	void *data_ptr;
+
+	hdr->size_in_bytes = dev_priv->oa_pmu.oa_async_buffer.obj->base.size;
+	/* 64 bit alignment for OA node address */
+	data_ptr = PTR_ALIGN((void *)(hdr + 1), 64);
+	hdr->data_offset = (__u64)(data_ptr - (void *)hdr);
+
+	hdr->node_count = 0;
+	hdr->wrap_count = 0;
+}
+
+static void forward_one_oa_async_sample(struct drm_i915_private *dev_priv,
+				struct drm_i915_oa_async_node *node)
+{
+	struct perf_sample_data data;
+	struct perf_event *event = dev_priv->oa_pmu.exclusive_event;
+	int format_size, snapshot_size;
+	u8 *snapshot;
+	struct perf_raw_record raw;
+
+	format_size = dev_priv->oa_pmu.oa_async_buffer.format_size;
+	snapshot_size = format_size +
+			sizeof(struct drm_i915_oa_async_node_footer);
+	snapshot = dev_priv->oa_pmu.oa_async_buffer.snapshot;
+
+	memcpy(snapshot, node, format_size);
+	memcpy(snapshot + format_size, &node->node_info,
+			sizeof(struct drm_i915_oa_async_node_footer));
+
+	perf_sample_data_init(&data, 0, event->hw.last_period);
+
+	/* Note: the combined u32 raw->size member + raw data itself must be 8
+	 * byte aligned. (See note in init_oa_buffer for more details) */
+	raw.size = snapshot_size + 4;
+	raw.data = snapshot;
+
+	data.raw = &raw;
+
+	perf_event_overflow(event, &data, &dev_priv->oa_pmu.dummy_regs);
+}
+
+void i915_oa_async_wait_gpu(struct drm_i915_private *dev_priv)
+{
+	struct drm_i915_oa_async_queue_header *hdr =
+		(struct drm_i915_oa_async_queue_header *)
+		dev_priv->oa_pmu.oa_async_buffer.addr;
+	struct drm_i915_oa_async_node *first_node, *node;
+	int ret, head, tail, num_nodes;
+	struct drm_i915_gem_request *req;
+
+	first_node = (struct drm_i915_oa_async_node *)
+			((char *)hdr + hdr->data_offset);
+	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
+			sizeof(*node);
+
+
+	tail = hdr->node_count;
+	head = dev_priv->oa_pmu.oa_async_buffer.head;
+
+	/* wait for all requests to complete*/
+	while ((head % num_nodes) != (tail % num_nodes)) {
+		node = &first_node[head % num_nodes];
+		req = node->node_info.req;
+		if (req) {
+			if (!i915_gem_request_completed(req, true)) {
+				ret = i915_wait_request(req);
+				if (ret)
+					DRM_DEBUG_DRIVER(
+					"oa async: failed to wait\n");
+			}
+			i915_gem_request_assign(&node->node_info.req, NULL);
+		}
+		head++;
+	}
+}
+
+void forward_oa_async_snapshots_work(struct work_struct *__work)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(__work, typeof(*dev_priv),
+			oa_pmu.work_timer);
+	struct drm_i915_oa_async_queue_header *hdr =
+		(struct drm_i915_oa_async_queue_header *)
+		dev_priv->oa_pmu.oa_async_buffer.addr;
+	struct drm_i915_oa_async_node *first_node, *node;
+	int ret, head, tail, num_nodes;
+	struct drm_i915_gem_request *req;
+
+	first_node = (struct drm_i915_oa_async_node *)
+			((char *)hdr + hdr->data_offset);
+	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
+			sizeof(*node);
+
+	ret = i915_mutex_lock_interruptible(dev_priv->dev);
+	if (ret)
+		return;
+
+	tail = hdr->node_count;
+	head = dev_priv->oa_pmu.oa_async_buffer.head;
+
+	while ((head % num_nodes) != (tail % num_nodes)) {
+		node = &first_node[head % num_nodes];
+		req = node->node_info.req;
+		if (req && i915_gem_request_completed(req, true)) {
+			forward_one_oa_async_sample(dev_priv, node);
+			i915_gem_request_assign(&node->node_info.req, NULL);
+			head++;
+		} else
+			break;
+	}
+
+	dev_priv->oa_pmu.oa_async_buffer.tail = tail;
+	dev_priv->oa_pmu.oa_async_buffer.head = head;
+
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+}
+
 static void forward_one_oa_snapshot_to_event(struct drm_i915_private *dev_priv,
 					     u8 *snapshot,
 					     struct perf_event *event)
@@ -58,6 +180,14 @@ static u32 forward_oa_snapshots(struct drm_i915_private *dev_priv,
 	u8 *snapshot;
 	u32 taken;
 
+	/*
+	 * Schedule a wq to forward the async samples collected. We schedule
+	 * wq here, since it requires device mutex to be taken which can't be
+	 * done here because of atomic context
+	 */
+	if (dev_priv->oa_pmu.async_sample_mode)
+		schedule_work(&dev_priv->oa_pmu.work_timer);
+
 	head -= dev_priv->oa_pmu.oa_buffer.gtt_offset;
 	tail -= dev_priv->oa_pmu.oa_buffer.gtt_offset;
 
@@ -176,6 +306,8 @@ oa_async_buffer_destroy(struct drm_i915_private *i915)
 
 	i915->oa_pmu.oa_async_buffer.obj = NULL;
 	i915->oa_pmu.oa_async_buffer.addr = NULL;
+	kfree(i915->oa_pmu.oa_async_buffer.snapshot);
+
 	mutex_unlock(&i915->dev->struct_mutex);
 }
 
@@ -358,7 +490,7 @@ static int init_async_oa_buffer(struct perf_event *event)
 	struct drm_i915_private *dev_priv =
 		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
 	struct drm_i915_gem_object *bo;
-	int ret;
+	int snapshot_size, ret;
 
 	BUG_ON(!IS_HASWELL(dev_priv->dev));
 	BUG_ON(dev_priv->oa_pmu.oa_async_buffer.obj);
@@ -374,6 +506,12 @@ static int init_async_oa_buffer(struct perf_event *event)
 	dev_priv->oa_pmu.oa_async_buffer.obj = bo;
 
 	dev_priv->oa_pmu.oa_async_buffer.addr = vmap_oa_buffer(bo);
+	init_oa_async_buf_queue(dev_priv);
+
+	snapshot_size = dev_priv->oa_pmu.oa_async_buffer.format_size +
+				sizeof(struct drm_i915_oa_async_node_footer);
+	dev_priv->oa_pmu.oa_async_buffer.snapshot =
+			kmalloc(snapshot_size, GFP_KERNEL);
 
 	DRM_DEBUG_DRIVER("OA Async Buffer initialized, vaddr = %p",
 			 dev_priv->oa_pmu.oa_async_buffer.addr);
@@ -814,6 +952,11 @@ static void i915_oa_event_stop(struct perf_event *event, int flags)
 		flush_oa_snapshots(dev_priv, false);
 	}
 
+	if (dev_priv->oa_pmu.async_sample_mode) {
+		dev_priv->oa_pmu.oa_async_buffer.tail = 0;
+		dev_priv->oa_pmu.oa_async_buffer.head = 0;
+	}
+
 	event->hw.state = PERF_HES_STOPPED;
 }
 
@@ -844,7 +987,15 @@ static int i915_oa_event_flush(struct perf_event *event)
 	if (event->attr.sample_period) {
 		struct drm_i915_private *i915 =
 			container_of(event->pmu, typeof(*i915), oa_pmu.pmu);
+		int ret;
 
+		if (i915->oa_pmu.async_sample_mode) {
+			ret = i915_mutex_lock_interruptible(i915->dev);
+			if (ret)
+				return ret;
+			i915_oa_async_wait_gpu(i915);
+			mutex_unlock(&i915->dev->struct_mutex);
+		}
 		flush_oa_snapshots(i915, true);
 	}
 
@@ -940,6 +1091,8 @@ void i915_oa_pmu_register(struct drm_device *dev)
 	hrtimer_init(&i915->oa_pmu.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	i915->oa_pmu.timer.function = hrtimer_sample;
 
+	INIT_WORK(&i915->oa_pmu.work_timer, forward_oa_async_snapshots_work);
+
 	spin_lock_init(&i915->oa_pmu.lock);
 
 	i915->oa_pmu.pmu.capabilities  = PERF_PMU_CAP_IS_DEVICE;
@@ -969,6 +1122,9 @@ void i915_oa_pmu_unregister(struct drm_device *dev)
 	if (i915->oa_pmu.pmu.event_init == NULL)
 		return;
 
+	if (i915->oa_pmu.async_sample_mode)
+		cancel_work_sync(&i915->oa_pmu.work_timer);
+
 	unregister_sysctl_table(i915->oa_pmu.sysctl_header);
 
 	perf_pmu_unregister(&i915->oa_pmu.pmu);
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 5/8] drm/i915: Wait for GPU to finish before event stop, in async OA counter mode
  2015-06-22  9:50 [RFC 0/8] Introduce framework to forward asynchronous OA counter sourab.gupta
                   ` (3 preceding siblings ...)
  2015-06-22  9:50 ` [RFC 4/8] drm/i915: Add mechanism for forwarding async OA counter snapshots through perf sourab.gupta
@ 2015-06-22  9:50 ` sourab.gupta
  2015-06-22  9:50 ` [RFC 6/8] drm/i915: Routines for inserting OA capture commands in the ringbuffer sourab.gupta
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: sourab.gupta @ 2015-06-22  9:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

The mode of asynchronous OA counter snapshot collection would need insertion
of MI_REPORT_PERF_COUNT commands into the ringbuffer. Therefore, during the
stop event call, we need to wait for GPU to complete processing the last
request for which MI_RPC command was inserted. We need to ensure the processing
is completed before event_destroy callback which deallocates the buffer

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |  2 +
 drivers/gpu/drm/i915/i915_oa_perf.c | 95 ++++++++++++++++++++++++++++++-------
 2 files changed, 81 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d738f7a..5453842 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1979,6 +1979,8 @@ struct drm_i915_private {
 			u8 *snapshot;
 		} oa_async_buffer;
 		struct work_struct work_timer;
+		struct work_struct work_event_stop;
+		struct completion complete;
 	} oa_pmu;
 #endif
 
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index 3bf4c47..5d63dab 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -118,6 +118,9 @@ void forward_oa_async_snapshots_work(struct work_struct *__work)
 	int ret, head, tail, num_nodes;
 	struct drm_i915_gem_request *req;
 
+	if (dev_priv->oa_pmu.event_active == false)
+		return;
+
 	first_node = (struct drm_i915_oa_async_node *)
 			((char *)hdr + hdr->data_offset);
 	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
@@ -298,6 +301,7 @@ static void flush_oa_snapshots(struct drm_i915_private *dev_priv,
 static void
 oa_async_buffer_destroy(struct drm_i915_private *i915)
 {
+	wait_for_completion(&i915->oa_pmu.complete);
 	mutex_lock(&i915->dev->struct_mutex);
 
 	vunmap(i915->oa_pmu.oa_async_buffer.addr);
@@ -854,6 +858,63 @@ static void config_oa_regs(struct drm_i915_private *dev_priv,
 	}
 }
 
+
+void i915_oa_async_stop_work_fn(struct work_struct *__work)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(__work, typeof(*dev_priv),
+			oa_pmu.work_event_stop);
+	struct perf_event *event = dev_priv->oa_pmu.exclusive_event;
+	struct drm_i915_oa_async_queue_header *hdr =
+		(struct drm_i915_oa_async_queue_header *)
+		dev_priv->oa_pmu.oa_async_buffer.addr;
+	struct drm_i915_oa_async_node *first_node, *node;
+	struct drm_i915_gem_request *req;
+	int ret, head, tail, num_nodes;
+
+	first_node = (struct drm_i915_oa_async_node *)
+			((char *)hdr + hdr->data_offset);
+	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
+			sizeof(*node);
+
+
+	ret = i915_mutex_lock_interruptible(dev_priv->dev);
+	if (ret)
+		return;
+
+	dev_priv->oa_pmu.event_active = false;
+
+	i915_oa_async_wait_gpu(dev_priv);
+
+	update_oacontrol(dev_priv);
+	mmiowb();
+
+	/* Ensure that all requests are completed*/
+	tail = hdr->node_count;
+	head = dev_priv->oa_pmu.oa_async_buffer.head;
+	while ((head % num_nodes) != (tail % num_nodes)) {
+		node = &first_node[head % num_nodes];
+		req = node->node_info.req;
+		if (req && !i915_gem_request_completed(req, true))
+			WARN_ON(1);
+		head++;
+	}
+
+	if (event->attr.sample_period) {
+		hrtimer_cancel(&dev_priv->oa_pmu.timer);
+		flush_oa_snapshots(dev_priv, false);
+	}
+	cancel_work_sync(&dev_priv->oa_pmu.work_timer);
+
+	dev_priv->oa_pmu.oa_async_buffer.tail = 0;
+	dev_priv->oa_pmu.oa_async_buffer.head = 0;
+
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+
+	event->hw.state = PERF_HES_STOPPED;
+	complete(&dev_priv->oa_pmu.complete);
+}
+
 static void i915_oa_event_start(struct perf_event *event, int flags)
 {
 	struct drm_i915_private *dev_priv =
@@ -939,25 +1000,23 @@ static void i915_oa_event_stop(struct perf_event *event, int flags)
 		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
 	unsigned long lock_flags;
 
-	spin_lock_irqsave(&dev_priv->oa_pmu.lock, lock_flags);
-
-	dev_priv->oa_pmu.event_active = false;
-	update_oacontrol(dev_priv);
-
-	mmiowb();
-	spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, lock_flags);
+	if (dev_priv->oa_pmu.async_sample_mode)
+		schedule_work(&dev_priv->oa_pmu.work_event_stop);
+	else {
+		spin_lock_irqsave(&dev_priv->oa_pmu.lock, lock_flags);
+		dev_priv->oa_pmu.event_active = false;
+		update_oacontrol(dev_priv);
 
-	if (event->attr.sample_period) {
-		hrtimer_cancel(&dev_priv->oa_pmu.timer);
-		flush_oa_snapshots(dev_priv, false);
-	}
+		mmiowb();
+		spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, lock_flags);
+		if (event->attr.sample_period) {
+			hrtimer_cancel(&dev_priv->oa_pmu.timer);
+			flush_oa_snapshots(dev_priv, false);
+		}
 
-	if (dev_priv->oa_pmu.async_sample_mode) {
-		dev_priv->oa_pmu.oa_async_buffer.tail = 0;
-		dev_priv->oa_pmu.oa_async_buffer.head = 0;
+		event->hw.state = PERF_HES_STOPPED;
 	}
 
-	event->hw.state = PERF_HES_STOPPED;
 }
 
 static int i915_oa_event_add(struct perf_event *event, int flags)
@@ -1092,6 +1151,8 @@ void i915_oa_pmu_register(struct drm_device *dev)
 	i915->oa_pmu.timer.function = hrtimer_sample;
 
 	INIT_WORK(&i915->oa_pmu.work_timer, forward_oa_async_snapshots_work);
+	INIT_WORK(&i915->oa_pmu.work_event_stop, i915_oa_async_stop_work_fn);
+	init_completion(&i915->oa_pmu.complete);
 
 	spin_lock_init(&i915->oa_pmu.lock);
 
@@ -1122,8 +1183,10 @@ void i915_oa_pmu_unregister(struct drm_device *dev)
 	if (i915->oa_pmu.pmu.event_init == NULL)
 		return;
 
-	if (i915->oa_pmu.async_sample_mode)
+	if (i915->oa_pmu.async_sample_mode) {
 		cancel_work_sync(&i915->oa_pmu.work_timer);
+		cancel_work_sync(&i915->oa_pmu.work_event_stop);
+	}
 
 	unregister_sysctl_table(i915->oa_pmu.sysctl_header);
 
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 6/8] drm/i915: Routines for inserting OA capture commands in the ringbuffer
  2015-06-22  9:50 [RFC 0/8] Introduce framework to forward asynchronous OA counter sourab.gupta
                   ` (4 preceding siblings ...)
  2015-06-22  9:50 ` [RFC 5/8] drm/i915: Wait for GPU to finish before event stop, in async OA counter mode sourab.gupta
@ 2015-06-22  9:50 ` sourab.gupta
  2015-06-22 15:55   ` Daniel Vetter
  2015-06-22  9:50 ` [RFC 7/8] drm/i915: Add commands in ringbuf for OA snapshot capture across Batchbuffer boundaries sourab.gupta
  2015-06-22  9:50 ` [RFC 8/8] drm/i915: Add perfTag support for OA counter reports sourab.gupta
  7 siblings, 1 reply; 12+ messages in thread
From: sourab.gupta @ 2015-06-22  9:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch introduces the routines which insert commands for capturing OA
snapshots, into the ringbuffer for RCS engine.
The command MI_REPORT_PERF_COUNT can be used to capture	snapshots of OA
counters. The routines introduced in this patch can be called to insert these
commands at appropriate points during workload submission

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_dma.c     |  1 +
 drivers/gpu/drm/i915/i915_drv.h     |  3 ++
 drivers/gpu/drm/i915/i915_oa_perf.c | 86 +++++++++++++++++++++++++++++++++++++
 3 files changed, 90 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 0553f20..f12feaa 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -821,6 +821,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
 	/* Must at least be registered before trying to pin any context
 	 * otherwise i915_oa_context_pin_notify() will lock an un-initialized
 	 * spinlock, upsetting lockdep checks */
+	INIT_LIST_HEAD(&dev_priv->profile_cmd);
 	i915_oa_pmu_register(dev);
 
 	intel_pm_setup(dev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 5453842..798da49 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1982,6 +1982,8 @@ struct drm_i915_private {
 		struct work_struct work_event_stop;
 		struct completion complete;
 	} oa_pmu;
+
+	struct list_head profile_cmd;
 #endif
 
 	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
@@ -3162,6 +3164,7 @@ void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
 				struct intel_context *context);
 void i915_oa_context_unpin_notify(struct drm_i915_private *dev_priv,
 				  struct intel_context *context);
+void i915_insert_profiling_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id);
 #else
 static inline void
 i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index 5d63dab..b02850c 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -25,6 +25,76 @@ static int hsw_perf_format_sizes[] = {
 	64   /* C4_B8_HSW */
 };
 
+struct drm_i915_insert_cmd {
+	struct list_head list;
+	void (*insert_cmd)(struct intel_ringbuffer *ringbuf, u32 ctx_id);
+};
+
+void i915_insert_profiling_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id)
+{
+	struct intel_engine_cs *ring = ringbuf->ring;
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct drm_i915_insert_cmd *entry;
+
+	list_for_each_entry(entry, &dev_priv->profile_cmd, list)
+		entry->insert_cmd(ringbuf, ctx_id);
+}
+
+void i915_oa_insert_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id)
+{
+	struct intel_engine_cs *ring = ringbuf->ring;
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct drm_i915_oa_async_node_info *node_info = NULL;
+	struct drm_i915_oa_async_queue_header *queue_hdr =
+			(struct drm_i915_oa_async_queue_header *)
+			dev_priv->oa_pmu.oa_async_buffer.addr;
+	void *data_ptr = (u8 *)queue_hdr + queue_hdr->data_offset;
+	int data_size =	(queue_hdr->size_in_bytes - queue_hdr->data_offset);
+	u32 data_offset, addr = 0;
+	int ret;
+
+	struct drm_i915_oa_async_node *nodes = data_ptr;
+	int num_nodes = 0;
+	int index = 0;
+
+	/* OA counters are only supported on the render ring */
+	if (ring->id != RCS)
+		return;
+
+	num_nodes = data_size / sizeof(*nodes);
+	index = queue_hdr->node_count % num_nodes;
+
+	data_offset = offsetof(struct drm_i915_oa_async_node, report_perf);
+
+	addr = i915_gem_obj_ggtt_offset(dev_priv->oa_pmu.oa_async_buffer.obj) +
+		queue_hdr->data_offset +
+		index * sizeof(struct drm_i915_oa_async_node) +
+		data_offset;
+
+	/* addr should be 64 byte aligned */
+	BUG_ON(addr & 0x3f);
+
+	ret = intel_ring_begin(ring, 4);
+	if (ret)
+		return;
+
+	intel_ring_emit(ring, MI_REPORT_PERF_COUNT | (1<<0));
+	intel_ring_emit(ring, addr | MI_REPORT_PERF_COUNT_GGTT);
+	intel_ring_emit(ring, ring->outstanding_lazy_request->seqno);
+	intel_ring_emit(ring, MI_NOOP);
+	intel_ring_advance(ring);
+
+	node_info = &nodes[index].node_info;
+	i915_gem_request_assign(&node_info->req,
+				ring->outstanding_lazy_request);
+
+	node_info->pid = current->pid;
+	node_info->ctx_id = ctx_id;
+	queue_hdr->node_count++;
+	if (queue_hdr->node_count > num_nodes)
+		queue_hdr->wrap_count++;
+}
+
 static void init_oa_async_buf_queue(struct drm_i915_private *dev_priv)
 {
 	struct drm_i915_oa_async_queue_header *hdr =
@@ -865,6 +935,7 @@ void i915_oa_async_stop_work_fn(struct work_struct *__work)
 		container_of(__work, typeof(*dev_priv),
 			oa_pmu.work_event_stop);
 	struct perf_event *event = dev_priv->oa_pmu.exclusive_event;
+	struct drm_i915_insert_cmd *entry, *next;
 	struct drm_i915_oa_async_queue_header *hdr =
 		(struct drm_i915_oa_async_queue_header *)
 		dev_priv->oa_pmu.oa_async_buffer.addr;
@@ -882,6 +953,13 @@ void i915_oa_async_stop_work_fn(struct work_struct *__work)
 	if (ret)
 		return;
 
+	list_for_each_entry_safe(entry, next, &dev_priv->profile_cmd, list) {
+		if (entry->insert_cmd == i915_oa_insert_cmd) {
+			list_del(&entry->list);
+			kfree(entry);
+		}
+	}
+
 	dev_priv->oa_pmu.event_active = false;
 
 	i915_oa_async_wait_gpu(dev_priv);
@@ -920,8 +998,14 @@ static void i915_oa_event_start(struct perf_event *event, int flags)
 	struct drm_i915_private *dev_priv =
 		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
 	unsigned long lock_flags;
+	struct drm_i915_insert_cmd *entry;
 	u32 oastatus1, tail;
 
+	entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
+	if (!entry)
+		return;
+	entry->insert_cmd = i915_oa_insert_cmd;
+
 	if (dev_priv->oa_pmu.metrics_set == I915_OA_METRICS_SET_3D) {
 		config_oa_regs(dev_priv, i915_oa_3d_mux_config_hsw,
 				i915_oa_3d_mux_config_hsw_len);
@@ -976,6 +1060,8 @@ static void i915_oa_event_start(struct perf_event *event, int flags)
 	dev_priv->oa_pmu.event_active = true;
 	update_oacontrol(dev_priv);
 
+	list_add_tail(&entry->list, &dev_priv->profile_cmd);
+
 	/* Reset the head ptr to ensure we don't forward reports relating
 	 * to a previous perf event */
 	oastatus1 = I915_READ(GEN7_OASTATUS1);
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 7/8] drm/i915: Add commands in ringbuf for OA snapshot capture across Batchbuffer boundaries
  2015-06-22  9:50 [RFC 0/8] Introduce framework to forward asynchronous OA counter sourab.gupta
                   ` (5 preceding siblings ...)
  2015-06-22  9:50 ` [RFC 6/8] drm/i915: Routines for inserting OA capture commands in the ringbuffer sourab.gupta
@ 2015-06-22  9:50 ` sourab.gupta
  2015-06-22  9:50 ` [RFC 8/8] drm/i915: Add perfTag support for OA counter reports sourab.gupta
  7 siblings, 0 replies; 12+ messages in thread
From: sourab.gupta @ 2015-06-22  9:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch inserts the commands in the ring for capturing OA snapshots across
batchbuffer boundaries. The data generated thus, would be of Batchbuffer
granularity. This data can be useful standalone for per batch buffer profiling
purposes. The issue of counter wraparound for large batch buffers can be
subverted by using this data in conjunction with the periodic OA sample data
which is generated alongside per BB snapshot data.
Such data gives useful information to userspace tool in order to analyse
batchbuffer specific performance and timing information.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 3336e1c..f5a2308 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1318,6 +1318,10 @@ i915_gem_ringbuffer_submission(struct drm_device *dev, struct drm_file *file,
 	}
 
 	exec_len = args->batch_len;
+
+	i915_insert_profiling_cmd(ring->buffer,
+			i915_execbuffer2_get_context_id(*args));
+
 	if (cliprects) {
 		for (i = 0; i < args->num_cliprects; i++) {
 			ret = i915_emit_box(ring, &cliprects[i],
@@ -1339,6 +1343,9 @@ i915_gem_ringbuffer_submission(struct drm_device *dev, struct drm_file *file,
 			return ret;
 	}
 
+	i915_insert_profiling_cmd(ring->buffer,
+			i915_execbuffer2_get_context_id(*args));
+
 	trace_i915_gem_ring_dispatch(intel_ring_get_request(ring), dispatch_flags);
 
 	i915_gem_execbuffer_move_to_active(vmas, ring);
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC 8/8] drm/i915: Add perfTag support for OA counter reports
  2015-06-22  9:50 [RFC 0/8] Introduce framework to forward asynchronous OA counter sourab.gupta
                   ` (6 preceding siblings ...)
  2015-06-22  9:50 ` [RFC 7/8] drm/i915: Add commands in ringbuf for OA snapshot capture across Batchbuffer boundaries sourab.gupta
@ 2015-06-22  9:50 ` sourab.gupta
  7 siblings, 0 replies; 12+ messages in thread
From: sourab.gupta @ 2015-06-22  9:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch enables collection of perfTag in the OA reports.

PerfTag is a mechanism, whereby the reports collected are marked with a
perfTag passed by userspace during the execbuffer call. This way the userspace
can identify the reports collected with the particular execbuffers.
This feature is particularly useful for identifying individual stages of a
single context, and associating the reports with these individual stages.

In this patch, rsvd2 field of execbuffer arguments is being utilized for passing
in the perfTag. A new bitfield in execbuffer flags is introduced in order to
inform kernel of perftag being passed in execbuffer arguments.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |  7 +++++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  6 ++++--
 drivers/gpu/drm/i915/i915_oa_perf.c        | 12 ++++++++----
 include/uapi/drm/i915_drm.h                | 15 +++++++++++++--
 4 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 798da49..758d924 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1691,8 +1691,10 @@ struct drm_i915_oa_async_queue_header {
 struct drm_i915_oa_async_node_info {
 	__u32 pid;
 	__u32 ctx_id;
+	__u32 perftag;
+	__u32 padding;
 	struct drm_i915_gem_request *req;
-	__u32 pad[12];
+	__u32 pad[10];
 };
 
 struct drm_i915_oa_async_node {
@@ -3164,7 +3166,8 @@ void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
 				struct intel_context *context);
 void i915_oa_context_unpin_notify(struct drm_i915_private *dev_priv,
 				  struct intel_context *context);
-void i915_insert_profiling_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id);
+void i915_insert_profiling_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id,
+				int perftag);
 #else
 static inline void
 i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index f5a2308..7be4f6a 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1320,7 +1320,8 @@ i915_gem_ringbuffer_submission(struct drm_device *dev, struct drm_file *file,
 	exec_len = args->batch_len;
 
 	i915_insert_profiling_cmd(ring->buffer,
-			i915_execbuffer2_get_context_id(*args));
+			i915_execbuffer2_get_context_id(*args),
+			i915_execbuffer2_get_perftag(*args));
 
 	if (cliprects) {
 		for (i = 0; i < args->num_cliprects; i++) {
@@ -1344,7 +1345,8 @@ i915_gem_ringbuffer_submission(struct drm_device *dev, struct drm_file *file,
 	}
 
 	i915_insert_profiling_cmd(ring->buffer,
-			i915_execbuffer2_get_context_id(*args));
+			i915_execbuffer2_get_context_id(*args),
+			i915_execbuffer2_get_perftag(*args));
 
 	trace_i915_gem_ring_dispatch(intel_ring_get_request(ring), dispatch_flags);
 
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index b02850c..ab419d9 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -27,20 +27,23 @@ static int hsw_perf_format_sizes[] = {
 
 struct drm_i915_insert_cmd {
 	struct list_head list;
-	void (*insert_cmd)(struct intel_ringbuffer *ringbuf, u32 ctx_id);
+	void (*insert_cmd)(struct intel_ringbuffer *ringbuf, u32 ctx_id,
+				int perftag);
 };
 
-void i915_insert_profiling_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id)
+void i915_insert_profiling_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id,
+				int perftag)
 {
 	struct intel_engine_cs *ring = ringbuf->ring;
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
 	struct drm_i915_insert_cmd *entry;
 
 	list_for_each_entry(entry, &dev_priv->profile_cmd, list)
-		entry->insert_cmd(ringbuf, ctx_id);
+		entry->insert_cmd(ringbuf, ctx_id, perftag);
 }
 
-void i915_oa_insert_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id)
+void i915_oa_insert_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id,
+			int perftag)
 {
 	struct intel_engine_cs *ring = ringbuf->ring;
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
@@ -90,6 +93,7 @@ void i915_oa_insert_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id)
 
 	node_info->pid = current->pid;
 	node_info->ctx_id = ctx_id;
+	node_info->perftag = perftag;
 	queue_hdr->node_count++;
 	if (queue_hdr->node_count > num_nodes)
 		queue_hdr->wrap_count++;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index c91b427..4d99992 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -127,6 +127,8 @@ enum drm_i915_oa_event_type {
 struct drm_i915_oa_async_node_footer {
 	__u32 pid;
 	__u32 ctx_id;
+	__u32 perftag;
+	__u32 pad;
 };
 
 /* Each region is a minimum of 16k, and there are at most 255 of them.
@@ -797,7 +799,7 @@ struct drm_i915_gem_execbuffer2 {
 #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
 	__u64 flags;
 	__u64 rsvd1; /* now used for context info */
-	__u64 rsvd2;
+	__u64 rsvd2; /* used for perftag */
 };
 
 /** Resets the SO write offset registers for transform feedback on gen7. */
@@ -835,7 +837,12 @@ struct drm_i915_gem_execbuffer2 {
 #define I915_EXEC_BSD_RING1		(1<<13)
 #define I915_EXEC_BSD_RING2		(2<<13)
 
-#define __I915_EXEC_UNKNOWN_FLAGS -(1<<15)
+/** Inform the kernel that the perftag is passed through rsvd2 field of
+ * execbuffer args
+ */
+#define I915_EXEC_PERFTAG		(1<<15)
+
+#define __I915_EXEC_UNKNOWN_FLAGS -(1<<16)
 
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
@@ -843,6 +850,10 @@ struct drm_i915_gem_execbuffer2 {
 #define i915_execbuffer2_get_context_id(eb2) \
 	((eb2).rsvd1 & I915_EXEC_CONTEXT_ID_MASK)
 
+#define I915_EXEC_PERFTAG_MASK		(0xffffffff)
+#define i915_execbuffer2_get_perftag(eb2) \
+	((eb2).rsvd2 & I915_EXEC_PERFTAG_MASK)
+
 struct drm_i915_gem_pin {
 	/** Handle of the buffer to be pinned. */
 	__u32 handle;
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC 6/8] drm/i915: Routines for inserting OA capture commands in the ringbuffer
  2015-06-22  9:50 ` [RFC 6/8] drm/i915: Routines for inserting OA capture commands in the ringbuffer sourab.gupta
@ 2015-06-22 15:55   ` Daniel Vetter
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Vetter @ 2015-06-22 15:55 UTC (permalink / raw)
  To: sourab.gupta; +Cc: intel-gfx, Insoo Woo, Peter Zijlstra, Jabin Wu

On Mon, Jun 22, 2015 at 03:20:17PM +0530, sourab.gupta@intel.com wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
> 
> This patch introduces the routines which insert commands for capturing OA
> snapshots, into the ringbuffer for RCS engine.
> The command MI_REPORT_PERF_COUNT can be used to capture	snapshots of OA
> counters. The routines introduced in this patch can be called to insert these
> commands at appropriate points during workload submission
> 
> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_dma.c     |  1 +
>  drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>  drivers/gpu/drm/i915/i915_oa_perf.c | 86 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 90 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index 0553f20..f12feaa 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -821,6 +821,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
>  	/* Must at least be registered before trying to pin any context
>  	 * otherwise i915_oa_context_pin_notify() will lock an un-initialized
>  	 * spinlock, upsetting lockdep checks */
> +	INIT_LIST_HEAD(&dev_priv->profile_cmd);
>  	i915_oa_pmu_register(dev);
>  
>  	intel_pm_setup(dev);
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 5453842..798da49 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1982,6 +1982,8 @@ struct drm_i915_private {
>  		struct work_struct work_event_stop;
>  		struct completion complete;
>  	} oa_pmu;
> +
> +	struct list_head profile_cmd;

Adding a list for just one entry (or maybe 2-3) is imo total overkill, but
what it achieves is making the code much harder to read. Please remove.

Also please don't split up patches by first adding dead code which isn't
called anywhere and then a 2nd patch which just adds 1-2 places that call
the new functions. Understanding the calling context and place of a
function is very important to do the review, by splitting things up like
that you force reviewers to read ahead and jump around in the patch
series. Which is inefficient.

If the code would otherwise be too large for 1 patch (not the case here)
then a good practice is to first introduce the scaffolding of function
calls and later on fill out the guts of each.
-Daniel

>  #endif
>  
>  	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
> @@ -3162,6 +3164,7 @@ void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
>  				struct intel_context *context);
>  void i915_oa_context_unpin_notify(struct drm_i915_private *dev_priv,
>  				  struct intel_context *context);
> +void i915_insert_profiling_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id);
>  #else
>  static inline void
>  i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
> diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
> index 5d63dab..b02850c 100644
> --- a/drivers/gpu/drm/i915/i915_oa_perf.c
> +++ b/drivers/gpu/drm/i915/i915_oa_perf.c
> @@ -25,6 +25,76 @@ static int hsw_perf_format_sizes[] = {
>  	64   /* C4_B8_HSW */
>  };
>  
> +struct drm_i915_insert_cmd {
> +	struct list_head list;
> +	void (*insert_cmd)(struct intel_ringbuffer *ringbuf, u32 ctx_id);
> +};
> +
> +void i915_insert_profiling_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id)
> +{
> +	struct intel_engine_cs *ring = ringbuf->ring;
> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> +	struct drm_i915_insert_cmd *entry;
> +
> +	list_for_each_entry(entry, &dev_priv->profile_cmd, list)
> +		entry->insert_cmd(ringbuf, ctx_id);
> +}
> +
> +void i915_oa_insert_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id)
> +{
> +	struct intel_engine_cs *ring = ringbuf->ring;
> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> +	struct drm_i915_oa_async_node_info *node_info = NULL;
> +	struct drm_i915_oa_async_queue_header *queue_hdr =
> +			(struct drm_i915_oa_async_queue_header *)
> +			dev_priv->oa_pmu.oa_async_buffer.addr;
> +	void *data_ptr = (u8 *)queue_hdr + queue_hdr->data_offset;
> +	int data_size =	(queue_hdr->size_in_bytes - queue_hdr->data_offset);
> +	u32 data_offset, addr = 0;
> +	int ret;
> +
> +	struct drm_i915_oa_async_node *nodes = data_ptr;
> +	int num_nodes = 0;
> +	int index = 0;
> +
> +	/* OA counters are only supported on the render ring */
> +	if (ring->id != RCS)
> +		return;
> +
> +	num_nodes = data_size / sizeof(*nodes);
> +	index = queue_hdr->node_count % num_nodes;
> +
> +	data_offset = offsetof(struct drm_i915_oa_async_node, report_perf);
> +
> +	addr = i915_gem_obj_ggtt_offset(dev_priv->oa_pmu.oa_async_buffer.obj) +
> +		queue_hdr->data_offset +
> +		index * sizeof(struct drm_i915_oa_async_node) +
> +		data_offset;
> +
> +	/* addr should be 64 byte aligned */
> +	BUG_ON(addr & 0x3f);
> +
> +	ret = intel_ring_begin(ring, 4);
> +	if (ret)
> +		return;
> +
> +	intel_ring_emit(ring, MI_REPORT_PERF_COUNT | (1<<0));
> +	intel_ring_emit(ring, addr | MI_REPORT_PERF_COUNT_GGTT);
> +	intel_ring_emit(ring, ring->outstanding_lazy_request->seqno);
> +	intel_ring_emit(ring, MI_NOOP);
> +	intel_ring_advance(ring);
> +
> +	node_info = &nodes[index].node_info;
> +	i915_gem_request_assign(&node_info->req,
> +				ring->outstanding_lazy_request);
> +
> +	node_info->pid = current->pid;
> +	node_info->ctx_id = ctx_id;
> +	queue_hdr->node_count++;
> +	if (queue_hdr->node_count > num_nodes)
> +		queue_hdr->wrap_count++;
> +}
> +
>  static void init_oa_async_buf_queue(struct drm_i915_private *dev_priv)
>  {
>  	struct drm_i915_oa_async_queue_header *hdr =
> @@ -865,6 +935,7 @@ void i915_oa_async_stop_work_fn(struct work_struct *__work)
>  		container_of(__work, typeof(*dev_priv),
>  			oa_pmu.work_event_stop);
>  	struct perf_event *event = dev_priv->oa_pmu.exclusive_event;
> +	struct drm_i915_insert_cmd *entry, *next;
>  	struct drm_i915_oa_async_queue_header *hdr =
>  		(struct drm_i915_oa_async_queue_header *)
>  		dev_priv->oa_pmu.oa_async_buffer.addr;
> @@ -882,6 +953,13 @@ void i915_oa_async_stop_work_fn(struct work_struct *__work)
>  	if (ret)
>  		return;
>  
> +	list_for_each_entry_safe(entry, next, &dev_priv->profile_cmd, list) {
> +		if (entry->insert_cmd == i915_oa_insert_cmd) {
> +			list_del(&entry->list);
> +			kfree(entry);
> +		}
> +	}
> +
>  	dev_priv->oa_pmu.event_active = false;
>  
>  	i915_oa_async_wait_gpu(dev_priv);
> @@ -920,8 +998,14 @@ static void i915_oa_event_start(struct perf_event *event, int flags)
>  	struct drm_i915_private *dev_priv =
>  		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
>  	unsigned long lock_flags;
> +	struct drm_i915_insert_cmd *entry;
>  	u32 oastatus1, tail;
>  
> +	entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
> +	if (!entry)
> +		return;
> +	entry->insert_cmd = i915_oa_insert_cmd;
> +
>  	if (dev_priv->oa_pmu.metrics_set == I915_OA_METRICS_SET_3D) {
>  		config_oa_regs(dev_priv, i915_oa_3d_mux_config_hsw,
>  				i915_oa_3d_mux_config_hsw_len);
> @@ -976,6 +1060,8 @@ static void i915_oa_event_start(struct perf_event *event, int flags)
>  	dev_priv->oa_pmu.event_active = true;
>  	update_oacontrol(dev_priv);
>  
> +	list_add_tail(&entry->list, &dev_priv->profile_cmd);
> +
>  	/* Reset the head ptr to ensure we don't forward reports relating
>  	 * to a previous perf event */
>  	oastatus1 = I915_READ(GEN7_OASTATUS1);
> -- 
> 1.8.5.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 2/8] drm/i915: Introduce mode for asynchronous capture of OA counters
  2015-06-22  9:50 ` [RFC 2/8] drm/i915: Introduce mode for asynchronous capture of OA counters sourab.gupta
@ 2015-06-22 15:59   ` Daniel Vetter
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Vetter @ 2015-06-22 15:59 UTC (permalink / raw)
  To: sourab.gupta; +Cc: intel-gfx, Insoo Woo, Peter Zijlstra, Jabin Wu

On Mon, Jun 22, 2015 at 03:20:13PM +0530, sourab.gupta@intel.com wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
> 
> The perf event framework supports periodic capture of OA counter snapshots. The
> raw OA reports generated by HW are forwarded to userspace using perf apis. This
> patch looks to extend the perf pmu introduced earlier to support the capture of
> asynchronous OA snapshots (in addition to periodic snapshots). These
> asynchronous snapshots will be forwarded by the perf pmu to userspace alongside
> periodic snapshots in the same perf ringbuffer.
> This patch introduces fields for enabling the asynchronous capture mode of OA
> counter snapshots.
> 
> There may be usecases wherein we need more than periodic OA capture mode which
> is supported by perf_event currently. We may need to insert the commands into
> the ring to dump the OA counter snapshots at some asynchronous points during
> workload execution.
> This mode is primarily used for two usecases:
>     - Ability to capture system wide metrics. The reports captured should be
>       able to be mapped back to individual contexts.
>     - Ability to inject tags for work, into the reports. This provides
>       visibility into the multiple stages of work within single context.
> 
> The asynchronous reports generated in this way (using MI_REPORT_PERF_COUNT
> commands), will be forwarded to userspace after appending a footer, which will
> have this metadata information. This will enable the usecases mentioned above.
> 
> This may also be seen as a way to overcome a limitation of Haswell, which
> doesn't write out a context ID with reports and handling this in the kernel
> makes sense when we plan for compatibility with Broadwell which doesn't include
> context id in reports.
> 
> This patch introduces an additional field in the oa attr structure
> for supporting this type of capture. The data thus captured needs to be stored
> in a separate buffer, which will be different from the buffer used otherwise for
> periodic OA capture mode. Again this buffer address will not need to be mapped to
> OA unit register addresses such as OASTATUS1, OASTATUS2 and OABUFFER.
> 
> The subsequent patches in the series introduce the data structures and mechanism
> for forwarding reports to userspace, and mechanism for inserting corresponding
> commands into the ringbuffer.
> 
> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h     |  11 ++++
>  drivers/gpu/drm/i915/i915_oa_perf.c | 122 ++++++++++++++++++++++++++++++------
>  include/uapi/drm/i915_drm.h         |   4 +-
>  3 files changed, 118 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index baa0234..ee4a5d3 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1933,6 +1933,7 @@ struct drm_i915_private {
>  		u32 period_exponent;
>  
>  		u32 metrics_set;
> +		bool async_sample_mode;
>  
>  		struct {
>  			struct drm_i915_gem_object *obj;
> @@ -1944,6 +1945,16 @@ struct drm_i915_private {
>  			int format_size;
>  			spinlock_t flush_lock;
>  		} oa_buffer;
> +
> +		/* Fields for asynchronous OA snapshot capture */
> +		struct {
> +			struct drm_i915_gem_object *obj;
> +			u8 *addr;
> +			u32 head;
> +			u32 tail;
> +			int format;
> +			int format_size;
> +		} oa_async_buffer;
>  	} oa_pmu;
>  #endif
>  
> diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
> index e7e0b2b..419b6a5 100644
> --- a/drivers/gpu/drm/i915/i915_oa_perf.c
> +++ b/drivers/gpu/drm/i915/i915_oa_perf.c
> @@ -166,6 +166,20 @@ static void flush_oa_snapshots(struct drm_i915_private *dev_priv,
>  }
>  
>  static void
> +oa_async_buffer_destroy(struct drm_i915_private *i915)
> +{
> +	mutex_lock(&i915->dev->struct_mutex);
> +
> +	vunmap(i915->oa_pmu.oa_async_buffer.addr);
> +	i915_gem_object_ggtt_unpin(i915->oa_pmu.oa_async_buffer.obj);
> +	drm_gem_object_unreference(&i915->oa_pmu.oa_async_buffer.obj->base);
> +
> +	i915->oa_pmu.oa_async_buffer.obj = NULL;
> +	i915->oa_pmu.oa_async_buffer.addr = NULL;

Please don't reuse dev->struct_mutex to protect OA state, but instead
create a new lock. Yes you need dev->struct_mutex here for handling the
gem buffers, but that should only be taken right around the relevant
function calls.

dev->struct_mutex is _really_ complex and all over the place and therefore
one of the largest pieces of technical debt we have in i915. Any piece of
code that extends the coverage of dev->struct_mutex is pretty much
guaranteed to get rejected therefore.
-Daniel

> +	mutex_unlock(&i915->dev->struct_mutex);
> +}
> +
> +static void
>  oa_buffer_destroy(struct drm_i915_private *i915)
>  {
>  	mutex_lock(&i915->dev->struct_mutex);
> @@ -207,6 +221,9 @@ static void i915_oa_event_destroy(struct perf_event *event)
>  	I915_WRITE(GDT_CHICKEN_BITS, (I915_READ(GDT_CHICKEN_BITS) &
>  				      ~GT_NOA_ENABLE));
>  
> +	if (dev_priv->oa_pmu.async_sample_mode)
> +		oa_async_buffer_destroy(dev_priv);
> +
>  	oa_buffer_destroy(dev_priv);
>  
>  	BUG_ON(dev_priv->oa_pmu.exclusive_event != event);
> @@ -247,21 +264,11 @@ finish:
>  	return addr;
>  }
>  
> -static int init_oa_buffer(struct perf_event *event)
> +static int alloc_oa_obj(struct drm_i915_private *dev_priv,
> +				struct drm_i915_gem_object **obj)
>  {
> -	struct drm_i915_private *dev_priv =
> -		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
>  	struct drm_i915_gem_object *bo;
> -	int ret;
> -
> -	BUG_ON(!IS_HASWELL(dev_priv->dev));
> -	BUG_ON(dev_priv->oa_pmu.oa_buffer.obj);
> -
> -	ret = i915_mutex_lock_interruptible(dev_priv->dev);
> -	if (ret)
> -		return ret;
> -
> -	spin_lock_init(&dev_priv->oa_pmu.oa_buffer.flush_lock);
> +	int ret = 0;
>  
>  	/* NB: We over allocate the OA buffer due to the way raw sample data
>  	 * gets copied from the gpu mapped circular buffer into the perf
> @@ -277,13 +284,13 @@ static int init_oa_buffer(struct perf_event *event)
>  	 * when a report is at the end of the gpu mapped buffer we need to
>  	 * read 4 bytes past the end of the buffer.
>  	 */
> +	intel_runtime_pm_get(dev_priv);
>  	bo = i915_gem_alloc_object(dev_priv->dev, OA_BUFFER_SIZE + PAGE_SIZE);
>  	if (bo == NULL) {
>  		DRM_ERROR("Failed to allocate OA buffer\n");
>  		ret = -ENOMEM;
> -		goto unlock;
> +		goto out;
>  	}
> -	dev_priv->oa_pmu.oa_buffer.obj = bo;
>  
>  	ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
>  	if (ret)
> @@ -294,6 +301,38 @@ static int init_oa_buffer(struct perf_event *event)
>  	if (ret)
>  		goto err_unref;
>  
> +	*obj = bo;
> +	goto out;
> +
> +err_unref:
> +	drm_gem_object_unreference_unlocked(&bo->base);
> +out:
> +	intel_runtime_pm_put(dev_priv);
> +	return ret;
> +}
> +
> +static int init_oa_buffer(struct perf_event *event)
> +{
> +	struct drm_i915_private *dev_priv =
> +		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
> +	struct drm_i915_gem_object *bo;
> +	int ret;
> +
> +	BUG_ON(!IS_HASWELL(dev_priv->dev));
> +	BUG_ON(dev_priv->oa_pmu.oa_buffer.obj);
> +
> +	ret = i915_mutex_lock_interruptible(dev_priv->dev);
> +	if (ret)
> +		return ret;
> +
> +	spin_lock_init(&dev_priv->oa_pmu.oa_buffer.flush_lock);
> +
> +	ret = alloc_oa_obj(dev_priv, &bo);
> +	if (ret)
> +		goto unlock;
> +
> +	dev_priv->oa_pmu.oa_buffer.obj = bo;
> +
>  	dev_priv->oa_pmu.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo);
>  	dev_priv->oa_pmu.oa_buffer.addr = vmap_oa_buffer(bo);
>  
> @@ -309,10 +348,35 @@ static int init_oa_buffer(struct perf_event *event)
>  			 dev_priv->oa_pmu.oa_buffer.gtt_offset,
>  			 dev_priv->oa_pmu.oa_buffer.addr);
>  
> -	goto unlock;
> +unlock:
> +	mutex_unlock(&dev_priv->dev->struct_mutex);
> +	return ret;
> +}
>  
> -err_unref:
> -	drm_gem_object_unreference(&bo->base);
> +static int init_async_oa_buffer(struct perf_event *event)
> +{
> +	struct drm_i915_private *dev_priv =
> +		container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu);
> +	struct drm_i915_gem_object *bo;
> +	int ret;
> +
> +	BUG_ON(!IS_HASWELL(dev_priv->dev));
> +	BUG_ON(dev_priv->oa_pmu.oa_async_buffer.obj);
> +
> +	ret = i915_mutex_lock_interruptible(dev_priv->dev);
> +	if (ret)
> +		return ret;
> +
> +	ret = alloc_oa_obj(dev_priv, &bo);
> +	if (ret)
> +		goto unlock;
> +
> +	dev_priv->oa_pmu.oa_async_buffer.obj = bo;
> +
> +	dev_priv->oa_pmu.oa_async_buffer.addr = vmap_oa_buffer(bo);
> +
> +	DRM_DEBUG_DRIVER("OA Async Buffer initialized, vaddr = %p",
> +			 dev_priv->oa_pmu.oa_async_buffer.addr);
>  
>  unlock:
>  	mutex_unlock(&dev_priv->dev->struct_mutex);
> @@ -444,6 +508,9 @@ static int i915_oa_event_init(struct perf_event *event)
>  
>  	report_format = oa_attr.format;
>  	dev_priv->oa_pmu.oa_buffer.format = report_format;
> +	if (oa_attr.batch_buffer_sample)
> +		dev_priv->oa_pmu.oa_async_buffer.format = report_format;
> +
>  	dev_priv->oa_pmu.metrics_set = oa_attr.metrics_set;
>  
>  	if (IS_HASWELL(dev_priv->dev)) {
> @@ -457,6 +524,9 @@ static int i915_oa_event_init(struct perf_event *event)
>  			return -EINVAL;
>  
>  		dev_priv->oa_pmu.oa_buffer.format_size = snapshot_size;
> +		if (oa_attr.batch_buffer_sample)
> +			dev_priv->oa_pmu.oa_async_buffer.format_size =
> +				snapshot_size;
>  
>  		if (oa_attr.metrics_set > I915_OA_METRICS_SET_MAX)
>  			return -EINVAL;
> @@ -465,6 +535,16 @@ static int i915_oa_event_init(struct perf_event *event)
>  		return -ENODEV;
>  	}
>  
> +	/*
> +	 * In case of per batch buffer sampling, we need to check for
> +	 * CAP_SYS_ADMIN capability as we profile all the running contexts
> +	 */
> +	if (oa_attr.batch_buffer_sample) {
> +		if (!capable(CAP_SYS_ADMIN))
> +			return -EACCES;
> +		dev_priv->oa_pmu.async_sample_mode = true;
> +	}
> +
>  	/* Since we are limited to an exponential scale for
>  	 * programming the OA sampling period we don't allow userspace
>  	 * to pass a precise attr.sample_period. */
> @@ -528,6 +608,12 @@ static int i915_oa_event_init(struct perf_event *event)
>  	if (ret)
>  		return ret;
>  
> +	if (oa_attr.batch_buffer_sample) {
> +		ret = init_async_oa_buffer(event);
> +		if (ret)
> +			return ret;
> +	}
> +
>  	BUG_ON(dev_priv->oa_pmu.exclusive_event);
>  	dev_priv->oa_pmu.exclusive_event = event;
>  
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 992e1e9..354dc3a 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -92,7 +92,9 @@ typedef struct _drm_i915_oa_attr {
>  	__u32 ctx_id;
>  
>  	__u64 single_context : 1,
> -	      __reserved_1 : 63;
> +	__reserved_1 : 31;
> +	__u32 batch_buffer_sample:1,
> +	__reserved_2:31;
>  } drm_i915_oa_attr_t;
>  
>  /* Header for PERF_RECORD_DEVICE type events */
> -- 
> 1.8.5.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 3/8] drm/i915: Add the data structures for async OA capture mode
  2015-06-22  9:50 ` [RFC 3/8] drm/i915: Add the data structures for async OA capture mode sourab.gupta
@ 2015-06-22 16:01   ` Daniel Vetter
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Vetter @ 2015-06-22 16:01 UTC (permalink / raw)
  To: sourab.gupta; +Cc: intel-gfx, Insoo Woo, Peter Zijlstra, Jabin Wu

On Mon, Jun 22, 2015 at 03:20:14PM +0530, sourab.gupta@intel.com wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
> 
> This patch introduces the data structures for capturing asynchronous OA
> snapshots
> 
> The data captured will be organized into nodes. Each node has the field for OA
> report alongwith metadata information such as ctx_id, pid, etc. The metadata
> information can be extended to provided any additional information.
> The data is organized to have a queue header at beginning, which will have
> information about size, data offset, number of nodes captured etc.
> 
> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>

Please don't add data structures without code, it essentially makes this
patch here unreviable without looking at other patches. Which just
increases the review burden for no gain.

If you create a big new structure instead only add the new fields you're
using in each patch, and by doing so slowly build up the entire thing.

Thanks, Daniel

> ---
>  drivers/gpu/drm/i915/i915_drv.h | 21 +++++++++++++++++++++
>  include/uapi/drm/i915_drm.h     |  5 +++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index ee4a5d3..da150bc 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1677,6 +1677,27 @@ extern const struct i915_oa_reg i915_oa_sampler_balance_mux_config_hsw[];
>  extern const int i915_oa_sampler_balance_mux_config_hsw_len;
>  extern const struct i915_oa_reg i915_oa_sampler_balance_b_counter_config_hsw[];
>  extern const int i915_oa_sampler_balance_b_counter_config_hsw_len;
> +
> +
> +struct drm_i915_oa_async_queue_header {
> +	__u64 size_in_bytes;
> +	/* Byte offset, start of queue header to first node */
> +	__u64 data_offset;
> +	__u32 node_count;
> +	__u32 wrap_count;
> +	__u32 pad[10];
> +};
> +
> +struct drm_i915_oa_async_node_info {
> +	__u32 pid;
> +	__u32 ctx_id;
> +	__u32 pad[14];
> +};
> +
> +struct drm_i915_oa_async_node {
> +	struct drm_i915_oa_async_node_info node_info;
> +	__u32 report_perf[64]; /* Must be aligned to 64-byte boundary */
> +};
>  #endif
>  
>  struct drm_i915_private {
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 354dc3a..c91b427 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -124,6 +124,11 @@ enum drm_i915_oa_event_type {
>  	I915_OA_RECORD_MAX,			/* non-ABI */
>  };
>  
> +struct drm_i915_oa_async_node_footer {
> +	__u32 pid;
> +	__u32 ctx_id;
> +};
> +
>  /* Each region is a minimum of 16k, and there are at most 255 of them.
>   */
>  #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
> -- 
> 1.8.5.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-06-22 15:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-22  9:50 [RFC 0/8] Introduce framework to forward asynchronous OA counter sourab.gupta
2015-06-22  9:50 ` [RFC 1/8] drm/i915: Have globally unique context ids, as opposed to drm file specific sourab.gupta
2015-06-22  9:50 ` [RFC 2/8] drm/i915: Introduce mode for asynchronous capture of OA counters sourab.gupta
2015-06-22 15:59   ` Daniel Vetter
2015-06-22  9:50 ` [RFC 3/8] drm/i915: Add the data structures for async OA capture mode sourab.gupta
2015-06-22 16:01   ` Daniel Vetter
2015-06-22  9:50 ` [RFC 4/8] drm/i915: Add mechanism for forwarding async OA counter snapshots through perf sourab.gupta
2015-06-22  9:50 ` [RFC 5/8] drm/i915: Wait for GPU to finish before event stop, in async OA counter mode sourab.gupta
2015-06-22  9:50 ` [RFC 6/8] drm/i915: Routines for inserting OA capture commands in the ringbuffer sourab.gupta
2015-06-22 15:55   ` Daniel Vetter
2015-06-22  9:50 ` [RFC 7/8] drm/i915: Add commands in ringbuf for OA snapshot capture across Batchbuffer boundaries sourab.gupta
2015-06-22  9:50 ` [RFC 8/8] drm/i915: Add perfTag support for OA counter reports sourab.gupta

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.