All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf
@ 2016-06-02  5:18 sourab.gupta
  2016-06-02  5:18 ` [PATCH 01/15] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id sourab.gupta
                   ` (15 more replies)
  0 siblings, 16 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx
  Cc: Christopher S . Hall, Daniel Vetter, Sourab Gupta, Deepak S,
	Thomas Gleixner

From: Sourab Gupta <sourab.gupta@intel.com>

This series adds framework for collection of gpu performance metrics
associated with the command stream of a particular engine. These metrics
include OA reports, timestamps, mmio metrics, etc. These metrics are
are collected around batchbuffer boundaries.

This work utilizes the underlying infrastructure introduced in Robert Bragg's
patches for collecting periodic OA counter snapshots (based on Haswell):
https://lists.freedesktop.org/archives/intel-gfx/2016-April/093206.html

This patch set is based on Gen8+ version of Robert's patches which can be found
here: https://github.com/rib/linux/tree/wip/rib/oa-2016-05-05-nightly
These are not yet individually floated in the mailing list, which I hope
doesn't lead to any significant loss of clarity in order to review the work
proposed in this patch series.

Compared to last series I floated earlier,
(https://lists.freedesktop.org/archives/intel-gfx/2016-April/093645.html),
this series incorporates the following changes/fixes, besides rebasing
on Robert's latest work (on a later nightly):

* Based on Chris's suggestion, I have tried experimenting with using the cross
  timestamp framework for the purpose of retrieving tightly coupled
  device/system timestamps. In our case, this framework enables us to have
  correlated pairs of gpu+system time which can be used over a period of time
  to correct the frequency of timestamp clock, and thus enable to accurately
  send system time (_MONO_RAW) as requested to the userspace. The results are
  generally observed to quite better with the use of cross timestamps and the
  frequency delta gradually tapers down to 0 with increasing correction periods.
  The use of cross timestamp framework though requires us to have
  clockcounter/timecounter abstraction for the timestamp clocksource, and
  further requires few changes in the kernel timekeeping/clocksource code. I am
  looking for feedback on the use of this framework and the changes involved.

These patches can be found for viewing at
https://github.com/sourabgu/linux/tree/oa-2016-05-05

Sourab Gupta (15):
  drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id
  drm/i915: Expose OA sample source to userspace
  drm/i915: Framework for capturing command stream based OA reports
  drm/i915: flush periodic samples, in case of no pending CS sample
    requests
  drm/i915: Handle the overflow condition for command stream buf
  drm/i915: Populate ctx ID for periodic OA reports
  drm/i915: Add support for having pid output with OA report
  drm/i915: Add support for emitting execbuffer tags through OA counter
    reports
  drm/i915: Extend i915 perf framework for collecting timestamps on all
    gpu engines
  drm/i915: Extract raw GPU timestamps from OA reports to forward in
    perf samples
  drm/i915: Support opening multiple concurrent perf streams
  time: Expose current clocksource in use by timekeeping framework
  time: export clocks_calc_mult_shift
  drm/i915: Mechanism to forward clock monotonic raw time in perf
    samples
  drm/i915: Support for capturing MMIO register values

 drivers/gpu/drm/i915/i915_dma.c            |    2 +
 drivers/gpu/drm/i915/i915_drv.h            |  117 +-
 drivers/gpu/drm/i915/i915_gem_context.c    |    3 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    5 +
 drivers/gpu/drm/i915/i915_perf.c           | 1925 +++++++++++++++++++++++++---
 drivers/gpu/drm/i915/i915_reg.h            |    6 +
 drivers/gpu/drm/i915/intel_lrc.c           |    4 +
 include/linux/timekeeping.h                |    5 +
 include/uapi/drm/i915_drm.h                |   79 ++
 kernel/time/clocksource.c                  |    1 +
 kernel/time/timekeeping.c                  |   12 +
 11 files changed, 1958 insertions(+), 201 deletions(-)

-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/15] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-07-27  9:18   ` Deepak
  2016-06-02  5:18 ` [PATCH 02/15] drm/i915: Expose OA sample source to userspace sourab.gupta
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Sourab Gupta, Deepak S

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds a new ctx getparam ioctl parameter, which can be used to
retrieve ctx unique id by userspace.

This can be used by userspace to map the i915 perf samples with their
particular ctx's, since those would be having ctx unique id's.
Otherwise the userspace has no way of maintaining this association,
since it has the knowledge of only per-drm file specific ctx handles.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 3 +++
 include/uapi/drm/i915_drm.h             | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index e974451..09f5178 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1001,6 +1001,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 		else
 			args->value = to_i915(dev)->ggtt.base.total;
 		break;
+	case I915_CONTEXT_PARAM_HW_ID:
+		args->value = ctx->hw_id;
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 4a1bcfd8..0badc16 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1171,6 +1171,7 @@ struct drm_i915_gem_context_param {
 #define I915_CONTEXT_PARAM_BAN_PERIOD	0x1
 #define I915_CONTEXT_PARAM_NO_ZEROMAP	0x2
 #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
+#define I915_CONTEXT_PARAM_HW_ID	0x4
 	__u64 value;
 };
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 02/15] drm/i915: Expose OA sample source to userspace
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
  2016-06-02  5:18 ` [PATCH 01/15] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  5:18 ` [PATCH 03/15] drm/i915: Framework for capturing command stream based OA reports sourab.gupta
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Sourab Gupta, Deepak S

From: Sourab Gupta <sourab.gupta@intel.com>

This patch exposes a new sample source field to userspace. This field can
be populated to specify the origin of the OA report.
For e.g. for internally triggerred reports (non MI_RPC reports), the RPT_ID
field has bitfields for specifying the origin such as timer, or render ctx
switch, etc.
Likewise this field can be used to specify the source as MI_RPC when such
support is added.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_perf.c | 55 ++++++++++++++++++++++++++++++++++------
 include/uapi/drm/i915_drm.h      | 16 ++++++++++++
 2 files changed, 63 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index bb2e44a..a557a82 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -74,6 +74,13 @@ static u32 i915_perf_stream_paranoid = true;
  */
 #define OA_EXPONENT_MAX 31
 
+#define GEN8_OAREPORT_REASON_TIMER          (1<<19)
+#define GEN8_OAREPORT_REASON_TRIGGER1       (1<<20)
+#define GEN8_OAREPORT_REASON_TRIGGER2       (1<<21)
+#define GEN8_OAREPORT_REASON_CTX_SWITCH     (1<<22)
+#define GEN8_OAREPORT_REASON_GO_TRANSITION  (1<<23)
+#define GEN9_OAREPORT_REASON_CLK_RATIO      (1<<24)
+
 /* for sysctl proc_dointvec_minmax of i915_oa_min_timer_exponent */
 static int zero;
 static int oa_exponent_max = OA_EXPONENT_MAX;
@@ -113,7 +120,8 @@ static struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = {
 	[I915_OA_FORMAT_C4_B8]		    = { 7, 64 },
 };
 
-#define SAMPLE_OA_REPORT      (1<<0)
+#define SAMPLE_OA_REPORT	(1<<0)
+#define SAMPLE_OA_SOURCE_INFO	(1<<1)
 
 struct perf_open_properties {
 	u32 sample_flags;
@@ -216,6 +224,27 @@ static int append_oa_sample(struct i915_perf_stream *stream,
 		return -EFAULT;
 	buf += sizeof(header);
 
+	if (sample_flags & SAMPLE_OA_SOURCE_INFO) {
+		enum drm_i915_perf_oa_event_source source;
+
+		if (INTEL_INFO(dev_priv)->gen >= 8) {
+			u32 reason = *(u32 *)report;
+
+			if (reason & GEN8_OAREPORT_REASON_CTX_SWITCH)
+				source =
+				I915_PERF_OA_EVENT_SOURCE_CONTEXT_SWITCH;
+			else if (reason & GEN8_OAREPORT_REASON_TIMER)
+				source = I915_PERF_OA_EVENT_SOURCE_PERIODIC;
+			else
+				source = I915_PERF_OA_EVENT_SOURCE_UNDEFINED;
+		} else
+			source = I915_PERF_OA_EVENT_SOURCE_PERIODIC;
+
+		if (copy_to_user(buf, &source, 4))
+			return -EFAULT;
+		buf += 4;
+	}
+
 	if (sample_flags & SAMPLE_OA_REPORT) {
 		if (copy_to_user(buf, report, report_size))
 			return -EFAULT;
@@ -1170,11 +1199,6 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	int format_size;
 	int ret;
 
-	if (!(props->sample_flags & SAMPLE_OA_REPORT)) {
-		DRM_ERROR("Only OA report sampling supported\n");
-		return -EINVAL;
-	}
-
 	if (!dev_priv->perf.oa.ops.init_oa_buffer) {
 		DRM_ERROR("OA unit not supported\n");
 		return -ENODEV;
@@ -1203,8 +1227,20 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 
 	format_size = dev_priv->perf.oa.oa_formats[props->oa_format].size;
 
-	stream->sample_flags |= SAMPLE_OA_REPORT;
-	stream->sample_size += format_size;
+	if (props->sample_flags & SAMPLE_OA_REPORT) {
+		stream->sample_flags |= SAMPLE_OA_REPORT;
+		stream->sample_size += format_size;
+	}
+
+	if (props->sample_flags & SAMPLE_OA_SOURCE_INFO) {
+		if (!(props->sample_flags & SAMPLE_OA_REPORT)) {
+			DRM_ERROR(
+			"OA source type can't be sampled without OA report");
+			return -EINVAL;
+		}
+		stream->sample_flags |= SAMPLE_OA_SOURCE_INFO;
+		stream->sample_size += 4;
+	}
 
 	dev_priv->perf.oa.oa_buffer.format_size = format_size;
 	BUG_ON(dev_priv->perf.oa.oa_buffer.format_size == 0);
@@ -1842,6 +1878,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 			props->oa_periodic = true;
 			props->oa_period_exponent = value;
 			break;
+		case DRM_I915_PERF_PROP_SAMPLE_OA_SOURCE:
+			props->sample_flags |= SAMPLE_OA_SOURCE_INFO;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 0badc16..c4e8dad 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1192,6 +1192,14 @@ enum drm_i915_oa_format {
 	I915_OA_FORMAT_MAX	    /* non-ABI */
 };
 
+enum drm_i915_perf_oa_event_source {
+	I915_PERF_OA_EVENT_SOURCE_UNDEFINED,
+	I915_PERF_OA_EVENT_SOURCE_PERIODIC,
+	I915_PERF_OA_EVENT_SOURCE_CONTEXT_SWITCH,
+
+	I915_PERF_OA_EVENT_SOURCE_MAX	/* non-ABI */
+};
+
 enum drm_i915_perf_property_id {
 	/**
 	 * Open the stream for a specific context handle (as used with
@@ -1226,6 +1234,13 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_PROP_OA_EXPONENT,
 
+	/**
+	 * The value of this property set to 1 requests inclusion of sample
+	 * source field to be given to userspace. The sample source field
+	 * specifies the origin of OA report.
+	 */
+	DRM_I915_PERF_PROP_SAMPLE_OA_SOURCE,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1290,6 +1305,7 @@ enum drm_i915_perf_record_type {
 	 * struct {
 	 *     struct drm_i915_perf_record_header header;
 	 *
+	 *     { u32 source_info; } && DRM_I915_PERF_PROP_SAMPLE_OA_SOURCE
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 03/15] drm/i915: Framework for capturing command stream based OA reports
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
  2016-06-02  5:18 ` [PATCH 01/15] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id sourab.gupta
  2016-06-02  5:18 ` [PATCH 02/15] drm/i915: Expose OA sample source to userspace sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  6:00   ` Martin Peres
  2016-06-02  5:18 ` [PATCH 04/15] drm/i915: flush periodic samples, in case of no pending CS sample requests sourab.gupta
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Sourab Gupta, Deepak S

From: Sourab Gupta <sourab.gupta@intel.com>

This patch introduces a framework to enable OA counter reports associated
with Render command stream. We can then associate the reports captured
through this mechanism with their corresponding context id's. This can be
further extended to associate any other metadata information with the
corresponding samples (since the association with Render command stream
gives us the ability to capture these information while inserting the
corresponding capture commands into the command stream).

The OA reports generated in this way are associated with a corresponding
workload, and thus can be used the delimit the workload (i.e. sample the
counters at the workload boundaries), within an ongoing stream of periodic
counter snapshots.

There may be usecases wherein we need more than periodic OA capture mode
which is supported currently. This mode is primarily used for two usecases:
    - Ability to capture system wide metrics, alongwith the ability to map
      the reports back to individual contexts (particularly for HSW).
    - Ability to inject tags for work, into the reports. This provides
      visibility into the multiple stages of work within single context.

The userspace will be able to distinguish between the periodic and CS based
OA reports by the virtue of source_info sample field.

The command MI_REPORT_PERF_COUNT can be used to capture snapshots of OA
counters, and is inserted at BB boundaries.
The data thus captured will be stored in a separate buffer, which will
be different from the buffer used otherwise for periodic OA capture mode.
The metadata information pertaining to snapshot is maintained in a list,
which also has offsets into the gem buffer object per captured snapshot.
In order to track whether the gpu has completed processing the node,
a field pertaining to corresponding gem request is added, which is tracked
for completion of the command.

Both periodic and RCS based reports are associated with a single stream
(corresponding to render engine), and it is expected to have the samples
in the sequential order according to their timestamps. Now, since these
reports are collected in separate buffers, these are merge sorted at the
time of forwarding to userspace during the read call.

v2: Aligining with the non-perf interface (custom drm ioctl based). Also,
few related patches are squashed together for better readability

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
Signed-off-by: Robert Bragg <robert@sixbynine.org>
---
 drivers/gpu/drm/i915/i915_drv.h            |  44 +-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   4 +
 drivers/gpu/drm/i915/i915_perf.c           | 871 ++++++++++++++++++++++++-----
 drivers/gpu/drm/i915/intel_lrc.c           |   4 +
 include/uapi/drm/i915_drm.h                |  15 +
 5 files changed, 804 insertions(+), 134 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 82622c4..f95b02b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1791,6 +1791,18 @@ struct i915_perf_stream_ops {
 	 * The stream will always be disabled before this is called.
 	 */
 	void (*destroy)(struct i915_perf_stream *stream);
+
+	/*
+	 * Routine to emit the commands in the command streamer associated
+	 * with the corresponding gpu engine.
+	 */
+	void (*command_stream_hook)(struct drm_i915_gem_request *req);
+};
+
+enum i915_perf_stream_state {
+	I915_PERF_STREAM_DISABLED,
+	I915_PERF_STREAM_ENABLE_IN_PROGRESS,
+	I915_PERF_STREAM_ENABLED,
 };
 
 struct i915_perf_stream {
@@ -1798,11 +1810,15 @@ struct i915_perf_stream {
 
 	struct list_head link;
 
+	enum intel_engine_id engine;
 	u32 sample_flags;
 	int sample_size;
 
 	struct intel_context *ctx;
-	bool enabled;
+	enum i915_perf_stream_state state;
+
+	/* Whether command stream based data collection is enabled */
+	bool cs_mode;
 
 	const struct i915_perf_stream_ops *ops;
 };
@@ -1818,10 +1834,21 @@ struct i915_oa_ops {
 					u32 ctx_id);
 	void (*legacy_ctx_switch_unlocked)(struct drm_i915_gem_request *req);
 	int (*read)(struct i915_perf_stream *stream,
-		    struct i915_perf_read_state *read_state);
+		    struct i915_perf_read_state *read_state, u32 ts);
 	bool (*oa_buffer_is_empty)(struct drm_i915_private *dev_priv);
 };
 
+/*
+ * List element to hold info about the perf sample data associated
+ * with a particular GPU command stream.
+ */
+struct i915_perf_cs_data_node {
+	struct list_head link;
+	struct drm_i915_gem_request *request;
+	u32 offset;
+	u32 ctx_id;
+};
+
 struct drm_i915_private {
 	struct drm_device *dev;
 	struct kmem_cache *objects;
@@ -2107,6 +2134,8 @@ struct drm_i915_private {
 		struct ctl_table_header *sysctl_header;
 
 		struct mutex lock;
+
+		struct mutex streams_lock;
 		struct list_head streams;
 
 		spinlock_t hook_lock;
@@ -2151,6 +2180,16 @@ struct drm_i915_private {
 			const struct i915_oa_format *oa_formats;
 			int n_builtin_sets;
 		} oa;
+
+		/* Command stream based perf data buffer */
+		struct {
+			struct drm_i915_gem_object *obj;
+			struct i915_vma *vma;
+			u8 *addr;
+		} command_stream_buf;
+
+		struct list_head node_list;
+		spinlock_t node_list_lock;
 	} perf;
 
 	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
@@ -3513,6 +3552,7 @@ void i915_oa_legacy_ctx_switch_notify(struct drm_i915_gem_request *req);
 void i915_oa_update_reg_state(struct intel_engine_cs *engine,
 			      struct intel_context *ctx,
 			      uint32_t *reg_state);
+void i915_perf_command_stream_hook(struct drm_i915_gem_request *req);
 
 /* i915_gem_evict.c */
 int __must_check i915_gem_evict_something(struct drm_device *dev,
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index e0ee5d1..8b759af 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1305,12 +1305,16 @@ i915_gem_ringbuffer_submission(struct i915_execbuffer_params *params,
 	if (exec_len == 0)
 		exec_len = params->batch_obj->base.size;
 
+	i915_perf_command_stream_hook(params->request);
+
 	ret = engine->dispatch_execbuffer(params->request,
 					exec_start, exec_len,
 					params->dispatch_flags);
 	if (ret)
 		return ret;
 
+	i915_perf_command_stream_hook(params->request);
+
 	trace_i915_gem_ring_dispatch(params->request, params->dispatch_flags);
 
 	i915_gem_execbuffer_move_to_active(vmas, params->request);
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index a557a82..42e930f 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -81,6 +81,13 @@ static u32 i915_perf_stream_paranoid = true;
 #define GEN8_OAREPORT_REASON_GO_TRANSITION  (1<<23)
 #define GEN9_OAREPORT_REASON_CLK_RATIO      (1<<24)
 
+/* Data common to periodic and RCS based samples */
+struct oa_sample_data {
+	u32 source;
+	u32 ctx_id;
+	const u8 *report;
+};
+
 /* for sysctl proc_dointvec_minmax of i915_oa_min_timer_exponent */
 static int zero;
 static int oa_exponent_max = OA_EXPONENT_MAX;
@@ -120,8 +127,19 @@ static struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = {
 	[I915_OA_FORMAT_C4_B8]		    = { 7, 64 },
 };
 
+/* Duplicated from similar static enum in i915_gem_execbuffer.c */
+#define I915_USER_RINGS (4)
+static const enum intel_engine_id user_ring_map[I915_USER_RINGS + 1] = {
+	[I915_EXEC_DEFAULT]	= RCS,
+	[I915_EXEC_RENDER]	= RCS,
+	[I915_EXEC_BLT]		= BCS,
+	[I915_EXEC_BSD]		= VCS,
+	[I915_EXEC_VEBOX]	= VECS
+};
+
 #define SAMPLE_OA_REPORT	(1<<0)
 #define SAMPLE_OA_SOURCE_INFO	(1<<1)
+#define SAMPLE_CTX_ID		(1<<2)
 
 struct perf_open_properties {
 	u32 sample_flags;
@@ -134,8 +152,231 @@ struct perf_open_properties {
 	int oa_format;
 	bool oa_periodic;
 	int oa_period_exponent;
+
+	/* Command stream mode */
+	bool cs_mode;
+	enum intel_engine_id engine;
 };
 
+/*
+ * Emit the commands to capture metrics, into the command stream. This function
+ * can be called concurrently with the stream operations and doesn't require
+ * perf mutex lock.
+ */
+
+void i915_perf_command_stream_hook(struct drm_i915_gem_request *request)
+{
+	struct intel_engine_cs *engine = request->engine;
+	struct drm_i915_private *dev_priv = engine->dev->dev_private;
+	struct i915_perf_stream *stream;
+
+	if (!dev_priv->perf.initialized)
+		return;
+
+	mutex_lock(&dev_priv->perf.streams_lock);
+	list_for_each_entry(stream, &dev_priv->perf.streams, link) {
+		if ((stream->state == I915_PERF_STREAM_ENABLED) &&
+					stream->cs_mode)
+			stream->ops->command_stream_hook(request);
+	}
+	mutex_unlock(&dev_priv->perf.streams_lock);
+}
+
+/*
+ * Release some perf entries to make space for a new entry data. We dereference
+ * the associated request before deleting the entry. Also, no need to check for
+ * gpu completion of commands, since, these entries are anyways going to be
+ * replaced by a new entry, and gpu will overwrite the buffer contents
+ * eventually, when the request associated with new entry completes.
+ */
+static void release_some_perf_entries(struct drm_i915_private *dev_priv,
+					u32 target_size)
+{
+	struct i915_perf_cs_data_node *entry, *next;
+	u32 entry_size = dev_priv->perf.oa.oa_buffer.format_size;
+	u32 size = 0;
+
+	list_for_each_entry_safe
+		(entry, next, &dev_priv->perf.node_list, link) {
+
+		size += entry_size;
+		i915_gem_request_unreference(entry->request);
+		list_del(&entry->link);
+		kfree(entry);
+
+		if (size >= target_size)
+			break;
+	}
+}
+
+/*
+ * Insert the perf entry to the end of the list. This function never fails,
+ * since it always manages to insert the entry. If the space is exhausted in
+ * the buffer, it will remove the oldest entries in order to make space.
+ */
+static void insert_perf_entry(struct drm_i915_private *dev_priv,
+				struct i915_perf_cs_data_node *entry)
+{
+	struct i915_perf_cs_data_node *first_entry, *last_entry;
+	int max_offset = dev_priv->perf.command_stream_buf.obj->base.size;
+	u32 entry_size = dev_priv->perf.oa.oa_buffer.format_size;
+
+	spin_lock(&dev_priv->perf.node_list_lock);
+	if (list_empty(&dev_priv->perf.node_list)) {
+		entry->offset = 0;
+		list_add_tail(&entry->link, &dev_priv->perf.node_list);
+		spin_unlock(&dev_priv->perf.node_list_lock);
+		return;
+	}
+
+	first_entry = list_first_entry(&dev_priv->perf.node_list,
+				       typeof(*first_entry), link);
+	last_entry = list_last_entry(&dev_priv->perf.node_list,
+				     typeof(*last_entry), link);
+
+	if (last_entry->offset >= first_entry->offset) {
+		/* Sufficient space available at the end of buffer? */
+		if (last_entry->offset + 2*entry_size < max_offset)
+			entry->offset = last_entry->offset + entry_size;
+		/*
+		 * Wraparound condition. Is sufficient space available at
+		 * beginning of buffer?
+		 */
+		else if (entry_size < first_entry->offset)
+			entry->offset = 0;
+		/* Insufficient space. Overwrite existing old entries */
+		else {
+			u32 target_size = entry_size - first_entry->offset;
+
+			release_some_perf_entries(dev_priv, target_size);
+			entry->offset = 0;
+		}
+	} else {
+		/* Sufficient space available? */
+		if (last_entry->offset + 2*entry_size < first_entry->offset)
+			entry->offset = last_entry->offset + entry_size;
+		/* Insufficient space. Overwrite existing old entries */
+		else {
+			u32 target_size = entry_size -
+				(first_entry->offset - last_entry->offset -
+				entry_size);
+
+			release_some_perf_entries(dev_priv, target_size);
+			entry->offset = last_entry->offset + entry_size;
+		}
+	}
+	list_add_tail(&entry->link, &dev_priv->perf.node_list);
+	spin_unlock(&dev_priv->perf.node_list_lock);
+}
+
+static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req)
+{
+	struct intel_engine_cs *engine = req->engine;
+	struct intel_ringbuffer *ringbuf = req->ringbuf;
+	struct intel_context *ctx = req->ctx;
+	struct drm_i915_private *dev_priv = engine->dev->dev_private;
+	struct i915_perf_cs_data_node *entry;
+	u32 addr = 0;
+	int ret;
+
+	/* OA counters are only supported on the render engine */
+	BUG_ON(engine->id != RCS);
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (entry == NULL) {
+		DRM_ERROR("alloc failed\n");
+		return;
+	}
+
+	ret = intel_ring_begin(req, 4);
+	if (ret) {
+		kfree(entry);
+		return;
+	}
+
+	entry->ctx_id = ctx->hw_id;
+	i915_gem_request_assign(&entry->request, req);
+
+	insert_perf_entry(dev_priv, entry);
+
+	addr = dev_priv->perf.command_stream_buf.vma->node.start +
+		entry->offset;
+
+	/* addr should be 64 byte aligned */
+	BUG_ON(addr & 0x3f);
+
+	if (i915.enable_execlists) {
+		intel_logical_ring_emit(ringbuf, MI_REPORT_PERF_COUNT | (2<<0));
+		intel_logical_ring_emit(ringbuf,
+					addr | MI_REPORT_PERF_COUNT_GGTT);
+		intel_logical_ring_emit(ringbuf, 0);
+		intel_logical_ring_emit(ringbuf,
+			i915_gem_request_get_seqno(req));
+		intel_logical_ring_advance(ringbuf);
+	} else {
+		if (INTEL_INFO(engine->dev)->gen >= 8) {
+			intel_ring_emit(engine, MI_REPORT_PERF_COUNT | (2<<0));
+			intel_ring_emit(engine, addr | MI_REPORT_PERF_COUNT_GGTT);
+			intel_ring_emit(engine, 0);
+			intel_ring_emit(engine,
+				i915_gem_request_get_seqno(req));
+		} else {
+			intel_ring_emit(engine, MI_REPORT_PERF_COUNT | (1<<0));
+			intel_ring_emit(engine, addr | MI_REPORT_PERF_COUNT_GGTT);
+			intel_ring_emit(engine, i915_gem_request_get_seqno(req));
+			intel_ring_emit(engine, MI_NOOP);
+		}
+		intel_ring_advance(engine);
+	}
+	i915_vma_move_to_active(dev_priv->perf.command_stream_buf.vma, req);
+}
+
+static int i915_oa_rcs_wait_gpu(struct drm_i915_private *dev_priv)
+{
+	struct i915_perf_cs_data_node *last_entry = NULL;
+	struct drm_i915_gem_request *req = NULL;
+	int ret;
+
+	/*
+	 * Wait for the last scheduled request to complete. This would
+	 * implicitly wait for the prior submitted requests. The refcount
+	 * of the requests is not decremented here.
+	 */
+	spin_lock(&dev_priv->perf.node_list_lock);
+
+	if (!list_empty(&dev_priv->perf.node_list)) {
+		last_entry = list_last_entry(&dev_priv->perf.node_list,
+			struct i915_perf_cs_data_node, link);
+		req = last_entry->request;
+	}
+	spin_unlock(&dev_priv->perf.node_list_lock);
+
+	if (!req)
+		return 0;
+
+	ret = __i915_wait_request(req, true, NULL, NULL);
+	if (ret) {
+		DRM_ERROR("Failed to wait for request\n");
+		return ret;
+	}
+	return 0;
+}
+
+static void i915_oa_rcs_free_requests(struct drm_i915_private *dev_priv)
+{
+	struct i915_perf_cs_data_node *entry, *next;
+
+	list_for_each_entry_safe
+		(entry, next, &dev_priv->perf.node_list, link) {
+		i915_gem_request_unreference(entry->request);
+
+		spin_lock(&dev_priv->perf.node_list_lock);
+		list_del(&entry->link);
+		spin_unlock(&dev_priv->perf.node_list_lock);
+		kfree(entry);
+	}
+}
+
 /* NB: This is either called via fops or the poll check hrtimer (atomic ctx)
  *
  * It's safe to read OA config state here unlocked, assuming that this is only
@@ -205,7 +446,7 @@ static int append_oa_status(struct i915_perf_stream *stream,
  */
 static int append_oa_sample(struct i915_perf_stream *stream,
 			    struct i915_perf_read_state *read_state,
-			    const u8 *report)
+			    struct oa_sample_data *data)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
@@ -225,6 +466,38 @@ static int append_oa_sample(struct i915_perf_stream *stream,
 	buf += sizeof(header);
 
 	if (sample_flags & SAMPLE_OA_SOURCE_INFO) {
+		if (copy_to_user(buf, &data->source, 4))
+			return -EFAULT;
+		buf += 4;
+	}
+
+	if (sample_flags & SAMPLE_CTX_ID) {
+		if (copy_to_user(buf, &data->ctx_id, 4))
+			return -EFAULT;
+		buf += 4;
+	}
+
+	if (sample_flags & SAMPLE_OA_REPORT) {
+		if (copy_to_user(buf, data->report, report_size))
+			return -EFAULT;
+		buf += report_size;
+	}
+
+	read_state->buf = buf;
+	read_state->read += header.size;
+
+	return 0;
+}
+
+static int append_oa_buffer_sample(struct i915_perf_stream *stream,
+				    struct i915_perf_read_state *read_state,
+				    const u8 *report)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	u32 sample_flags = stream->sample_flags;
+	struct oa_sample_data data = { 0 };
+
+	if (sample_flags & SAMPLE_OA_SOURCE_INFO) {
 		enum drm_i915_perf_oa_event_source source;
 
 		if (INTEL_INFO(dev_priv)->gen >= 8) {
@@ -240,20 +513,16 @@ static int append_oa_sample(struct i915_perf_stream *stream,
 		} else
 			source = I915_PERF_OA_EVENT_SOURCE_PERIODIC;
 
-		if (copy_to_user(buf, &source, 4))
-			return -EFAULT;
-		buf += 4;
-	}
-
-	if (sample_flags & SAMPLE_OA_REPORT) {
-		if (copy_to_user(buf, report, report_size))
-			return -EFAULT;
+		data.source = source;
 	}
+#warning "FIXME: append_oa_buffer_sample: read ctx ID from report and map that to an intel_context::global_id"
+	if (sample_flags & SAMPLE_CTX_ID)
+		data.ctx_id = 0;
 
-	read_state->buf += header.size;
-	read_state->read += header.size;
+	if (sample_flags & SAMPLE_OA_REPORT)
+		data.report = report;
 
-	return 0;
+	return append_oa_sample(stream, read_state, &data);
 }
 
 /**
@@ -273,7 +542,7 @@ static int append_oa_sample(struct i915_perf_stream *stream,
 static int gen8_append_oa_reports(struct i915_perf_stream *stream,
 				  struct i915_perf_read_state *read_state,
 				  u32 *head_ptr,
-				  u32 tail)
+				  u32 tail, u32 ts)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
@@ -283,7 +552,7 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
 	u32 taken;
 	int ret = 0;
 
-	BUG_ON(!stream->enabled);
+	BUG_ON(stream->state != I915_PERF_STREAM_ENABLED);
 
 	head = *head_ptr - dev_priv->perf.oa.oa_buffer.gtt_offset;
 	tail -= dev_priv->perf.oa.oa_buffer.gtt_offset;
@@ -313,6 +582,11 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
 		u8 *report = oa_buf_base + head;
 		u32 *report32 = (void *)report;
 		u32 ctx_id = report32[2];
+		u32 report_ts = report32[1];
+
+		/* Report timestamp should not exceed passed in ts */
+		if (report_ts > ts)
+			break;
 
 		/* All the report sizes factor neatly into the buffer
 		 * size so we never expect to see a report split
@@ -364,7 +638,8 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
 			    dev_priv->perf.oa.specific_ctx_id != ctx_id)
 				report32[2] = 0x1fffff;
 
-			ret = append_oa_sample(stream, read_state, report);
+			ret = append_oa_buffer_sample(stream, read_state,
+							report);
 			if (ret)
 				break;
 
@@ -386,7 +661,7 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
  * updated @read_state.
  */
 static int gen8_oa_read(struct i915_perf_stream *stream,
-			struct i915_perf_read_state *read_state)
+			struct i915_perf_read_state *read_state, u32 ts)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
@@ -428,7 +703,7 @@ static int gen8_oa_read(struct i915_perf_stream *stream,
 
 	/* If there is still buffer space */
 
-	ret = gen8_append_oa_reports(stream, read_state, &head, tail);
+	ret = gen8_append_oa_reports(stream, read_state, &head, tail, ts);
 
 	/* All the report sizes are a power of two and the
 	 * head should always be incremented by some multiple
@@ -467,7 +742,7 @@ static int gen8_oa_read(struct i915_perf_stream *stream,
 static int gen7_append_oa_reports(struct i915_perf_stream *stream,
 				  struct i915_perf_read_state *read_state,
 				  u32 *head_ptr,
-				  u32 tail)
+				  u32 tail, u32 ts)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
@@ -478,7 +753,7 @@ static int gen7_append_oa_reports(struct i915_perf_stream *stream,
 	u32 taken;
 	int ret = 0;
 
-	BUG_ON(!stream->enabled);
+	BUG_ON(stream->state != I915_PERF_STREAM_ENABLED);
 
 	head = *head_ptr - dev_priv->perf.oa.oa_buffer.gtt_offset;
 	tail -= dev_priv->perf.oa.oa_buffer.gtt_offset;
@@ -542,7 +817,11 @@ static int gen7_append_oa_reports(struct i915_perf_stream *stream,
 			continue;
 		}
 
-		ret = append_oa_sample(stream, read_state, report);
+		/* Report timestamp should not exceed passed in ts */
+		if (report32[1] > ts)
+			break;
+
+		ret = append_oa_buffer_sample(stream, read_state, report);
 		if (ret)
 			break;
 
@@ -569,7 +848,7 @@ static int gen7_append_oa_reports(struct i915_perf_stream *stream,
  * updated @read_state.
  */
 static int gen7_oa_read(struct i915_perf_stream *stream,
-			struct i915_perf_read_state *read_state)
+			struct i915_perf_read_state *read_state, u32 ts)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
@@ -641,7 +920,7 @@ static int gen7_oa_read(struct i915_perf_stream *stream,
 			GEN7_OASTATUS1_REPORT_LOST;
 	}
 
-	ret = gen7_append_oa_reports(stream, read_state, &head, tail);
+	ret = gen7_append_oa_reports(stream, read_state, &head, tail, ts);
 
 	/* All the report sizes are a power of two and the
 	 * head should always be incremented by some multiple
@@ -665,20 +944,131 @@ static int gen7_oa_read(struct i915_perf_stream *stream,
 	return ret;
 }
 
-static bool i915_oa_can_read(struct i915_perf_stream *stream)
+/**
+ * Copies a command stream OA report into userspace read() buffer, while also
+ * forwarding the periodic OA reports with timestamp lower than CS report.
+ *
+ * NB: some data may be successfully copied to the userspace buffer
+ * even if an error is returned, and this is reflected in the
+ * updated @read_state.
+ */
+static int append_oa_rcs_sample(struct i915_perf_stream *stream,
+				 struct i915_perf_read_state *read_state,
+				 struct i915_perf_cs_data_node *node)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
+	struct oa_sample_data data = { 0 };
+	const u8 *report = dev_priv->perf.command_stream_buf.addr +
+				node->offset;
+	u32 sample_flags = stream->sample_flags;
+	u32 report_ts;
+	int ret;
 
-	return !dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv);
+	/* First, append the periodic OA samples having lower timestamps */
+	report_ts = *(u32 *)(report + 4);
+	ret = dev_priv->perf.oa.ops.read(stream, read_state, report_ts);
+	if (ret)
+		return ret;
+
+	if (sample_flags & SAMPLE_OA_SOURCE_INFO)
+		data.source = I915_PERF_OA_EVENT_SOURCE_RCS;
+
+	if (sample_flags & SAMPLE_CTX_ID)
+		data.ctx_id = node->ctx_id;
+
+	if (sample_flags & SAMPLE_OA_REPORT)
+		data.report = report;
+
+	return append_oa_sample(stream, read_state, &data);
 }
 
-static int i915_oa_wait_unlocked(struct i915_perf_stream *stream)
+/**
+ * Copies all command stream based OA reports into userspace read() buffer.
+ *
+ * NB: some data may be successfully copied to the userspace buffer
+ * even if an error is returned, and this is reflected in the
+ * updated @read_state.
+ */
+static int oa_rcs_append_reports(struct i915_perf_stream *stream,
+				  struct i915_perf_read_state *read_state)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
+	struct i915_perf_cs_data_node *entry, *next;
+	LIST_HEAD(free_list);
+	int ret = 0;
 
-	/* We would wait indefinitly if periodic sampling is not enabled */
-	if (!dev_priv->perf.oa.periodic)
-		return -EIO;
+	spin_lock(&dev_priv->perf.node_list_lock);
+	if (list_empty(&dev_priv->perf.node_list)) {
+		spin_unlock(&dev_priv->perf.node_list_lock);
+		return 0;
+	}
+	list_for_each_entry_safe(entry, next,
+				 &dev_priv->perf.node_list, link) {
+		if (!i915_gem_request_completed(entry->request, true))
+			break;
+		list_move_tail(&entry->link, &free_list);
+	}
+	spin_unlock(&dev_priv->perf.node_list_lock);
+
+	if (list_empty(&free_list))
+		return 0;
+
+	list_for_each_entry_safe(entry, next, &free_list, link) {
+		ret = append_oa_rcs_sample(stream, read_state, entry);
+		if (ret)
+			break;
+
+		list_del(&entry->link);
+		i915_gem_request_unreference(entry->request);
+		kfree(entry);
+	}
+
+	/* Don't discard remaining entries, keep them for next read */
+	spin_lock(&dev_priv->perf.node_list_lock);
+	list_splice(&free_list, &dev_priv->perf.node_list);
+	spin_unlock(&dev_priv->perf.node_list_lock);
+
+	return ret;
+}
+
+/*
+ * Checks whether the command stream buffer associated with the stream has
+ * data ready to be forwarded to userspace.
+ * Returns true if atleast one request associated with command stream is
+ * completed, else returns false.
+ */
+static bool command_stream_buf_is_empty(struct i915_perf_stream *stream)
+
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	struct i915_perf_cs_data_node *entry = NULL;
+	struct drm_i915_gem_request *request = NULL;
+
+	spin_lock(&dev_priv->perf.node_list_lock);
+	entry = list_first_entry_or_null(&dev_priv->perf.node_list,
+			struct i915_perf_cs_data_node, link);
+	if (entry)
+		request = entry->request;
+	spin_unlock(&dev_priv->perf.node_list_lock);
+
+	if (!entry)
+		return true;
+	else if (!i915_gem_request_completed(request, true))
+		return true;
+	else
+		return false;
+}
+
+/*
+ * Checks whether the stream has data ready to forward to userspace.
+ * For command stream based streams, check if the command stream buffer has
+ * atleast one sample ready, if not return false, irrespective of periodic
+ * oa buffer having the data or not.
+ */
+
+static bool stream_have_data__unlocked(struct i915_perf_stream *stream)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
 
 	/* Note: the oa_buffer_is_empty() condition is ok to run unlocked as it
 	 * just performs mmio reads of the OA buffer head + tail pointers and
@@ -686,8 +1076,35 @@ static int i915_oa_wait_unlocked(struct i915_perf_stream *stream)
 	 * can't be destroyed until completion (such as a read()) that ensures
 	 * the device + OA buffer can't disappear
 	 */
+	if (stream->cs_mode)
+		return !command_stream_buf_is_empty(stream);
+	else
+		return !dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv);
+}
+
+static bool i915_oa_can_read(struct i915_perf_stream *stream)
+{
+
+	return stream_have_data__unlocked(stream);
+}
+
+static int i915_oa_wait_unlocked(struct i915_perf_stream *stream)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	int ret;
+
+	/* We would wait indefinitly if periodic sampling is not enabled */
+	if (!dev_priv->perf.oa.periodic)
+		return -EIO;
+
+	if (stream->cs_mode) {
+		ret = i915_oa_rcs_wait_gpu(dev_priv);
+		if (ret)
+			return ret;
+	}
+
 	return wait_event_interruptible(dev_priv->perf.oa.poll_wq,
-					!dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv));
+					stream_have_data__unlocked(stream));
 }
 
 static void i915_oa_poll_wait(struct i915_perf_stream *stream,
@@ -704,7 +1121,27 @@ static int i915_oa_read(struct i915_perf_stream *stream,
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	return dev_priv->perf.oa.ops.read(stream, read_state);
+	if (stream->cs_mode)
+		return oa_rcs_append_reports(stream, read_state);
+	else
+		return dev_priv->perf.oa.ops.read(stream, read_state, U32_MAX);
+}
+
+static void
+free_command_stream_buf(struct drm_i915_private *dev_priv)
+{
+	mutex_lock(&dev_priv->dev->struct_mutex);
+
+	i915_gem_object_unpin_map(dev_priv->perf.command_stream_buf.obj);
+	i915_gem_object_ggtt_unpin(dev_priv->perf.command_stream_buf.obj);
+	drm_gem_object_unreference(
+			&dev_priv->perf.command_stream_buf.obj->base);
+
+	dev_priv->perf.command_stream_buf.obj = NULL;
+	dev_priv->perf.command_stream_buf.vma = NULL;
+	dev_priv->perf.command_stream_buf.addr = NULL;
+
+	mutex_unlock(&dev_priv->dev->struct_mutex);
 }
 
 static void
@@ -729,12 +1166,17 @@ static void i915_oa_stream_destroy(struct i915_perf_stream *stream)
 
 	BUG_ON(stream != dev_priv->perf.oa.exclusive_stream);
 
-	dev_priv->perf.oa.ops.disable_metric_set(dev_priv);
+	if (stream->cs_mode)
+		free_command_stream_buf(dev_priv);
 
-	free_oa_buffer(dev_priv);
+	if (dev_priv->perf.oa.oa_buffer.obj) {
+		dev_priv->perf.oa.ops.disable_metric_set(dev_priv);
 
-	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
-	intel_runtime_pm_put(dev_priv);
+		free_oa_buffer(dev_priv);
+
+		intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+		intel_runtime_pm_put(dev_priv);
+	}
 
 	dev_priv->perf.oa.exclusive_stream = NULL;
 }
@@ -792,16 +1234,17 @@ static void gen8_init_oa_buffer(struct drm_i915_private *dev_priv)
 		    GEN8_OATAILPTR_MASK));
 }
 
-static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
+static int alloc_obj(struct drm_i915_private *dev_priv,
+		     struct drm_i915_gem_object **obj, u8 **addr)
 {
 	struct drm_i915_gem_object *bo;
 	int ret;
 
-	BUG_ON(dev_priv->perf.oa.oa_buffer.obj);
+	intel_runtime_pm_get(dev_priv);
 
 	ret = i915_mutex_lock_interruptible(dev_priv->dev);
 	if (ret)
-		return ret;
+		goto out;
 
 	bo = i915_gem_object_create(dev_priv->dev, OA_BUFFER_SIZE);
 	if (IS_ERR(bo)) {
@@ -809,7 +1252,6 @@ static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
 		ret = PTR_ERR(bo);
 		goto unlock;
 	}
-	dev_priv->perf.oa.oa_buffer.obj = bo;
 
 	ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
 	if (ret)
@@ -820,17 +1262,13 @@ static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
 	if (ret)
 		goto err_unref;
 
-	dev_priv->perf.oa.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo);
-	dev_priv->perf.oa.oa_buffer.addr = i915_gem_object_pin_map(bo);
-	if (dev_priv->perf.oa.oa_buffer.addr == NULL)
+	*addr = i915_gem_object_pin_map(bo);
+	if (*addr == NULL) {
+		ret = -ENOMEM;
 		goto err_unpin;
+	}
 
-	dev_priv->perf.oa.ops.init_oa_buffer(dev_priv);
-
-	DRM_DEBUG_DRIVER("OA Buffer initialized, gtt offset = 0x%x, vaddr = %p",
-			 dev_priv->perf.oa.oa_buffer.gtt_offset,
-			 dev_priv->perf.oa.oa_buffer.addr);
-
+	*obj = bo;
 	goto unlock;
 
 err_unpin:
@@ -841,9 +1279,63 @@ err_unref:
 
 unlock:
 	mutex_unlock(&dev_priv->dev->struct_mutex);
+
+out:
+	intel_runtime_pm_put(dev_priv);
 	return ret;
 }
 
+static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
+{
+	struct drm_i915_gem_object *bo;
+	u8 *obj_addr;
+	int ret;
+
+	BUG_ON(dev_priv->perf.oa.oa_buffer.obj);
+
+	ret = alloc_obj(dev_priv, &bo, &obj_addr);
+	if (ret)
+		return ret;
+
+	dev_priv->perf.oa.oa_buffer.obj = bo;
+	dev_priv->perf.oa.oa_buffer.addr = obj_addr;
+	dev_priv->perf.oa.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo);
+
+	dev_priv->perf.oa.ops.init_oa_buffer(dev_priv);
+
+	DRM_DEBUG_DRIVER("OA Buffer initialized, gtt offset = 0x%x, vaddr = %p",
+			 dev_priv->perf.oa.oa_buffer.gtt_offset,
+			 dev_priv->perf.oa.oa_buffer.addr);
+	return 0;
+}
+
+static int alloc_command_stream_buf(struct drm_i915_private *dev_priv)
+{
+	struct drm_i915_gem_object *bo;
+	u8 *obj_addr;
+	int ret;
+
+	BUG_ON(dev_priv->perf.command_stream_buf.obj);
+
+	ret = alloc_obj(dev_priv, &bo, &obj_addr);
+	if (ret)
+		return ret;
+
+	dev_priv->perf.command_stream_buf.obj = bo;
+	dev_priv->perf.command_stream_buf.addr = obj_addr;
+	dev_priv->perf.command_stream_buf.vma = i915_gem_obj_to_ggtt(bo);
+	if (WARN_ON(!list_empty(&dev_priv->perf.node_list)))
+		INIT_LIST_HEAD(&dev_priv->perf.node_list);
+
+	DRM_DEBUG_DRIVER(
+		"command stream buf initialized, gtt offset = 0x%x, vaddr = %p",
+		 (unsigned int)
+		 dev_priv->perf.command_stream_buf.vma->node.start,
+		 dev_priv->perf.command_stream_buf.addr);
+
+	return 0;
+}
+
 static void config_oa_regs(struct drm_i915_private *dev_priv,
 			   const struct i915_oa_reg *regs,
 			   int n_regs)
@@ -1087,7 +1579,8 @@ static void gen7_update_oacontrol_locked(struct drm_i915_private *dev_priv)
 {
 	assert_spin_locked(&dev_priv->perf.hook_lock);
 
-	if (dev_priv->perf.oa.exclusive_stream->enabled) {
+	if (dev_priv->perf.oa.exclusive_stream->state !=
+					I915_PERF_STREAM_DISABLED) {
 		unsigned long ctx_id = 0;
 
 		if (dev_priv->perf.oa.exclusive_stream->ctx)
@@ -1169,10 +1662,15 @@ static void i915_oa_stream_disable(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	dev_priv->perf.oa.ops.oa_disable(dev_priv);
-
 	if (dev_priv->perf.oa.periodic)
 		hrtimer_cancel(&dev_priv->perf.oa.poll_check_timer);
+
+	if (stream->cs_mode) {
+		i915_oa_rcs_wait_gpu(dev_priv);
+		i915_oa_rcs_free_requests(dev_priv);
+	}
+
+	dev_priv->perf.oa.ops.oa_disable(dev_priv);
 }
 
 static u64 oa_exponent_to_ns(struct drm_i915_private *dev_priv, int exponent)
@@ -1189,6 +1687,7 @@ static const struct i915_perf_stream_ops i915_oa_stream_ops = {
 	.wait_unlocked = i915_oa_wait_unlocked,
 	.poll_wait = i915_oa_poll_wait,
 	.read = i915_oa_read,
+	.command_stream_hook = i915_perf_command_stream_hook_oa,
 };
 
 static int i915_oa_stream_init(struct i915_perf_stream *stream,
@@ -1196,14 +1695,11 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 			       struct perf_open_properties *props)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
-	int format_size;
+	bool require_oa_unit = props->sample_flags & (SAMPLE_OA_REPORT |
+						      SAMPLE_OA_SOURCE_INFO);
+	bool cs_sample_data = props->sample_flags & SAMPLE_OA_REPORT;
 	int ret;
 
-	if (!dev_priv->perf.oa.ops.init_oa_buffer) {
-		DRM_ERROR("OA unit not supported\n");
-		return -ENODEV;
-	}
-
 	/* To avoid the complexity of having to accurately filter
 	 * counter reports and marshal to the appropriate client
 	 * we currently only allow exclusive access
@@ -1213,97 +1709,166 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 		return -EBUSY;
 	}
 
-	if (!props->metrics_set) {
-		DRM_ERROR("OA metric set not specified\n");
-		return -EINVAL;
-	}
-
-	if (!props->oa_format) {
-		DRM_ERROR("OA report format not specified\n");
-		return -EINVAL;
+	if ((props->sample_flags & SAMPLE_CTX_ID) && !props->cs_mode) {
+		if (IS_HASWELL(dev_priv->dev)) {
+			DRM_ERROR(
+				"On HSW, context ID sampling only supported via command stream");
+			return -EINVAL;
+		} else if (!i915.enable_execlists) {
+			DRM_ERROR(
+				"On Gen8+ without execlists, context ID sampling only supported via command stream");
+			return -EINVAL;
+		}
 	}
 
 	stream->sample_size = sizeof(struct drm_i915_perf_record_header);
 
-	format_size = dev_priv->perf.oa.oa_formats[props->oa_format].size;
+	if (require_oa_unit) {
+		int format_size;
 
-	if (props->sample_flags & SAMPLE_OA_REPORT) {
-		stream->sample_flags |= SAMPLE_OA_REPORT;
-		stream->sample_size += format_size;
-	}
+		if (!dev_priv->perf.oa.ops.init_oa_buffer) {
+			DRM_ERROR("OA unit not supported\n");
+			return -ENODEV;
+		}
+
+		if (!props->metrics_set) {
+			DRM_ERROR("OA metric set not specified\n");
+			return -EINVAL;
+		}
+
+		if (!props->oa_format) {
+			DRM_ERROR("OA report format not specified\n");
+			return -EINVAL;
+		}
 
-	if (props->sample_flags & SAMPLE_OA_SOURCE_INFO) {
-		if (!(props->sample_flags & SAMPLE_OA_REPORT)) {
+		if (props->cs_mode  && (props->engine!= RCS)) {
 			DRM_ERROR(
-			"OA source type can't be sampled without OA report");
+				"Command stream OA metrics only available via Render CS\n");
 			return -EINVAL;
 		}
-		stream->sample_flags |= SAMPLE_OA_SOURCE_INFO;
-		stream->sample_size += 4;
-	}
+		stream->engine= RCS;
+
+		format_size =
+			dev_priv->perf.oa.oa_formats[props->oa_format].size;
+
+		if (props->sample_flags & SAMPLE_OA_REPORT) {
+			stream->sample_flags |= SAMPLE_OA_REPORT;
+			stream->sample_size += format_size;
+		}
+
+		if (props->sample_flags & SAMPLE_OA_SOURCE_INFO) {
+			if (!(props->sample_flags & SAMPLE_OA_REPORT)) {
+				DRM_ERROR(
+					  "OA source type can't be sampled without OA report");
+				return -EINVAL;
+			}
+			stream->sample_flags |= SAMPLE_OA_SOURCE_INFO;
+			stream->sample_size += 4;
+		}
+
+		dev_priv->perf.oa.oa_buffer.format_size = format_size;
+		BUG_ON(dev_priv->perf.oa.oa_buffer.format_size == 0);
+
+		dev_priv->perf.oa.oa_buffer.format =
+			dev_priv->perf.oa.oa_formats[props->oa_format].format;
 
-	dev_priv->perf.oa.oa_buffer.format_size = format_size;
-	BUG_ON(dev_priv->perf.oa.oa_buffer.format_size == 0);
+		dev_priv->perf.oa.metrics_set = props->metrics_set;
 
-	dev_priv->perf.oa.oa_buffer.format =
-		dev_priv->perf.oa.oa_formats[props->oa_format].format;
+		dev_priv->perf.oa.periodic = props->oa_periodic;
+		if (dev_priv->perf.oa.periodic) {
+			u64 period_ns = oa_exponent_to_ns(dev_priv,
+						props->oa_period_exponent);
 
-	dev_priv->perf.oa.metrics_set = props->metrics_set;
+			dev_priv->perf.oa.period_exponent =
+						props->oa_period_exponent;
 
-	dev_priv->perf.oa.periodic = props->oa_periodic;
-	if (dev_priv->perf.oa.periodic) {
-		u64 period_ns = oa_exponent_to_ns(dev_priv,
-						  props->oa_period_exponent);
+			/* See comment for OA_TAIL_MARGIN_NSEC for details
+			 * about this tail_margin...
+			 */
+			dev_priv->perf.oa.tail_margin =
+				((OA_TAIL_MARGIN_NSEC / period_ns) + 1) *
+				format_size;
+		}
+
+		if (i915.enable_execlists && stream->ctx)
+			dev_priv->perf.oa.specific_ctx_id = stream->ctx->hw_id;
 
-		dev_priv->perf.oa.period_exponent = props->oa_period_exponent;
+		ret = alloc_oa_buffer(dev_priv);
+		if (ret)
+			return ret;
 
-		/* See comment for OA_TAIL_MARGIN_NSEC for details
-		 * about this tail_margin...
+		/* PRM - observability performance counters:
+		 *
+		 *   OACONTROL, performance counter enable, note:
+		 *
+		 *   "When this bit is set, in order to have coherent counts,
+		 *   RC6 power state and trunk clock gating must be disabled.
+		 *   This can be achieved by programming MMIO registers as
+		 *   0xA094=0 and 0xA090[31]=1"
+		 *
+		 *   In our case we are expecting that taking pm + FORCEWAKE
+		 *   references will effectively disable RC6.
 		 */
-		dev_priv->perf.oa.tail_margin =
-			((OA_TAIL_MARGIN_NSEC / period_ns) + 1) * format_size;
+		intel_runtime_pm_get(dev_priv);
+		intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+		ret = dev_priv->perf.oa.ops.enable_metric_set(dev_priv);
+		if (ret) {
+			intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+			intel_runtime_pm_put(dev_priv);
+			free_oa_buffer(dev_priv);
+			return ret;
+		}
+
+		/* On Haswell we have to track which OASTATUS1 flags we've already
+		 * seen since they can't be cleared while periodic sampling is enabled.
+		 */
+		dev_priv->perf.oa.gen7_latched_oastatus1 = 0;
 	}
 
-	if (i915.enable_execlists && stream->ctx)
-		dev_priv->perf.oa.specific_ctx_id = stream->ctx->hw_id;
+	if (props->sample_flags & SAMPLE_CTX_ID) {
+		stream->sample_flags |= SAMPLE_CTX_ID;
+		stream->sample_size += 4;
+	}
 
-	ret = alloc_oa_buffer(dev_priv);
-	if (ret)
-		return ret;
+	if (props->cs_mode) {
+		if (!cs_sample_data) {
+			DRM_ERROR(
+				"Ring given without requesting any CS data to sample");
+			ret = -EINVAL;
+			goto cs_error;
+		}
 
-	/* PRM - observability performance counters:
-	 *
-	 *   OACONTROL, performance counter enable, note:
-	 *
-	 *   "When this bit is set, in order to have coherent counts,
-	 *   RC6 power state and trunk clock gating must be disabled.
-	 *   This can be achieved by programming MMIO registers as
-	 *   0xA094=0 and 0xA090[31]=1"
-	 *
-	 *   In our case we are expecting that taking pm + FORCEWAKE
-	 *   references will effectively disable RC6.
-	 */
-	intel_runtime_pm_get(dev_priv);
-	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+		if (!(props->sample_flags & SAMPLE_CTX_ID)) {
+			DRM_ERROR(
+				"Ring given without requesting any CS specific property");
+			ret = -EINVAL;
+			goto cs_error;
+		}
 
-	ret = dev_priv->perf.oa.ops.enable_metric_set(dev_priv);
-	if (ret) {
-		intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
-		intel_runtime_pm_put(dev_priv);
-		free_oa_buffer(dev_priv);
-		return ret;
+		stream->cs_mode = true;
+
+		ret = alloc_command_stream_buf(dev_priv);
+		if (ret)
+			goto cs_error;
 	}
 
 	stream->ops = &i915_oa_stream_ops;
 
-	/* On Haswell we have to track which OASTATUS1 flags we've already
-	 * seen since they can't be cleared while periodic sampling is enabled.
-	 */
-	dev_priv->perf.oa.gen7_latched_oastatus1 = 0;
-
 	dev_priv->perf.oa.exclusive_stream = stream;
 
 	return 0;
+
+cs_error:
+	if (require_oa_unit) {
+		dev_priv->perf.oa.ops.disable_metric_set(dev_priv);
+
+		intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+		intel_runtime_pm_put(dev_priv);
+
+		free_oa_buffer(dev_priv);
+	}
+	return ret;
 }
 
 static void gen7_update_hw_ctx_id_locked(struct drm_i915_private *dev_priv,
@@ -1395,7 +1960,8 @@ void i915_oa_legacy_ctx_switch_notify(struct drm_i915_gem_request *req)
 		return;
 
 	if (dev_priv->perf.oa.exclusive_stream &&
-	    dev_priv->perf.oa.exclusive_stream->enabled) {
+	    dev_priv->perf.oa.exclusive_stream->state !=
+				I915_PERF_STREAM_DISABLED) {
 
 		/* XXX: We don't take a lock here and this may run
 		 * async with respect to stream methods. Notably we
@@ -1505,7 +2071,7 @@ static ssize_t i915_perf_read(struct file *file,
 	 * disabled stream as an error. In particular it might otherwise lead
 	 * to a deadlock for blocking file descriptors...
 	 */
-	if (!stream->enabled)
+	if (stream->state == I915_PERF_STREAM_DISABLED)
 		return -EIO;
 
 	if (!(file->f_flags & O_NONBLOCK)) {
@@ -1537,12 +2103,21 @@ static ssize_t i915_perf_read(struct file *file,
 
 static enum hrtimer_restart oa_poll_check_timer_cb(struct hrtimer *hrtimer)
 {
+	struct i915_perf_stream *stream;
+
 	struct drm_i915_private *dev_priv =
 		container_of(hrtimer, typeof(*dev_priv),
 			     perf.oa.poll_check_timer);
 
-	if (!dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv))
-		wake_up(&dev_priv->perf.oa.poll_wq);
+	/* No need to protect the streams list here, since the hrtimer is
+	 * disabled before the stream is removed from list, and currently a
+	 * single exclusive_stream is supported.
+	 * XXX: revisit this when multiple concurrent streams are supported.
+	 */
+	list_for_each_entry(stream, &dev_priv->perf.streams, link) {
+		if (stream_have_data__unlocked(stream))
+			wake_up(&dev_priv->perf.oa.poll_wq);
+	}
 
 	hrtimer_forward_now(hrtimer, ns_to_ktime(POLL_PERIOD));
 
@@ -1578,23 +2153,23 @@ static unsigned int i915_perf_poll(struct file *file, poll_table *wait)
 
 static void i915_perf_enable_locked(struct i915_perf_stream *stream)
 {
-	if (stream->enabled)
+	if (stream->state != I915_PERF_STREAM_DISABLED)
 		return;
 
-	/* Allow stream->ops->enable() to refer to this */
-	stream->enabled = true;
+	stream->state = I915_PERF_STREAM_ENABLE_IN_PROGRESS;
 
 	if (stream->ops->enable)
 		stream->ops->enable(stream);
+
+	stream->state = I915_PERF_STREAM_ENABLED;
 }
 
 static void i915_perf_disable_locked(struct i915_perf_stream *stream)
 {
-	if (!stream->enabled)
+	if (stream->state != I915_PERF_STREAM_ENABLED)
 		return;
 
-	/* Allow stream->ops->disable() to refer to this */
-	stream->enabled = false;
+	stream->state = I915_PERF_STREAM_DISABLED;
 
 	if (stream->ops->disable)
 		stream->ops->disable(stream);
@@ -1635,14 +2210,16 @@ static void i915_perf_destroy_locked(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	if (stream->enabled)
+	mutex_lock(&dev_priv->perf.streams_lock);
+	list_del(&stream->link);
+	mutex_unlock(&dev_priv->perf.streams_lock);
+
+	if (stream->state == I915_PERF_STREAM_ENABLED)
 		i915_perf_disable_locked(stream);
 
 	if (stream->ops->destroy)
 		stream->ops->destroy(stream);
 
-	list_del(&stream->link);
-
 	if (stream->ctx) {
 		mutex_lock(&dev_priv->dev->struct_mutex);
 		i915_gem_context_unreference(stream->ctx);
@@ -1753,7 +2330,9 @@ int i915_perf_open_ioctl_locked(struct drm_device *dev,
 	 */
 	BUG_ON(stream->sample_flags != props->sample_flags);
 
+	mutex_lock(&dev_priv->perf.streams_lock);
 	list_add(&stream->link, &dev_priv->perf.streams);
+	mutex_unlock(&dev_priv->perf.streams_lock);
 
 	if (param->flags & I915_PERF_FLAG_FD_CLOEXEC)
 		f_flags |= O_CLOEXEC;
@@ -1772,7 +2351,9 @@ int i915_perf_open_ioctl_locked(struct drm_device *dev,
 	return stream_fd;
 
 err_open:
+	mutex_lock(&dev_priv->perf.streams_lock);
 	list_del(&stream->link);
+	mutex_unlock(&dev_priv->perf.streams_lock);
 	if (stream->ops->destroy)
 		stream->ops->destroy(stream);
 err_alloc:
@@ -1881,6 +2462,29 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_PROP_SAMPLE_OA_SOURCE:
 			props->sample_flags |= SAMPLE_OA_SOURCE_INFO;
 			break;
+		case DRM_I915_PERF_PROP_ENGINE: {
+				unsigned int user_ring_id =
+					value & I915_EXEC_RING_MASK;
+				enum intel_engine_id engine;
+
+				if (user_ring_id > I915_USER_RINGS)
+					return -EINVAL;
+
+				/* XXX: Currently only RCS is supported.
+				 * Remove this check when support for other
+				 * engines is added
+				 */
+				engine = user_ring_map[user_ring_id];
+				if (engine != RCS)
+					return -EINVAL;
+
+				props->cs_mode = true;
+				props->engine = engine;
+			}
+			break;
+		case DRM_I915_PERF_PROP_SAMPLE_CTX_ID:
+			props->sample_flags |= SAMPLE_CTX_ID;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
@@ -1988,8 +2592,11 @@ void i915_perf_init(struct drm_device *dev)
 	init_waitqueue_head(&dev_priv->perf.oa.poll_wq);
 
 	INIT_LIST_HEAD(&dev_priv->perf.streams);
+	INIT_LIST_HEAD(&dev_priv->perf.node_list);
 	mutex_init(&dev_priv->perf.lock);
+	mutex_init(&dev_priv->perf.streams_lock);
 	spin_lock_init(&dev_priv->perf.hook_lock);
+	spin_lock_init(&dev_priv->perf.node_list_lock);
 
 	if (IS_HASWELL(dev)) {
 		dev_priv->perf.oa.ops.init_oa_buffer = gen7_init_oa_buffer;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 7b09299..2751f3a 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -869,10 +869,14 @@ int intel_execlists_submission(struct i915_execbuffer_params *params,
 	exec_start = params->batch_obj_vm_offset +
 		     args->batch_start_offset;
 
+	i915_perf_command_stream_hook(params->request);
+
 	ret = engine->emit_bb_start(params->request, exec_start, params->dispatch_flags);
 	if (ret)
 		return ret;
 
+	i915_perf_command_stream_hook(params->request);
+
 	trace_i915_gem_ring_dispatch(params->request, params->dispatch_flags);
 
 	i915_gem_execbuffer_move_to_active(vmas, params->request);
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index c4e8dad..7faced8 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1196,6 +1196,7 @@ enum drm_i915_perf_oa_event_source {
 	I915_PERF_OA_EVENT_SOURCE_UNDEFINED,
 	I915_PERF_OA_EVENT_SOURCE_PERIODIC,
 	I915_PERF_OA_EVENT_SOURCE_CONTEXT_SWITCH,
+	I915_PERF_OA_EVENT_SOURCE_RCS,
 
 	I915_PERF_OA_EVENT_SOURCE_MAX	/* non-ABI */
 };
@@ -1241,6 +1242,19 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_PROP_SAMPLE_OA_SOURCE,
 
+	/**
+	 * The value of this property specifies the GPU engine for which
+	 * the samples need to be collected. Specifying this property also
+	 * implies the command stream based sample collection.
+	 */
+	DRM_I915_PERF_PROP_ENGINE,
+
+	/**
+	 * The value of this property set to 1 requests inclusion of context ID
+	 * in the perf sample data.
+	 */
+	DRM_I915_PERF_PROP_SAMPLE_CTX_ID,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1306,6 +1320,7 @@ enum drm_i915_perf_record_type {
 	 *     struct drm_i915_perf_record_header header;
 	 *
 	 *     { u32 source_info; } && DRM_I915_PERF_PROP_SAMPLE_OA_SOURCE
+	 *     { u32 ctx_id; } && DRM_I915_PERF_PROP_SAMPLE_CTX_ID
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 04/15] drm/i915: flush periodic samples, in case of no pending CS sample requests
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (2 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 03/15] drm/i915: Framework for capturing command stream based OA reports sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  5:18 ` [PATCH 05/15] drm/i915: Handle the overflow condition for command stream buf sourab.gupta
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Sourab Gupta, Deepak S

From: Sourab Gupta <sourab.gupta@intel.com>

When there are no pending CS OA samples, flush the periodic OA samples
collected so far.

We can safely forward the periodic OA samples in the case we
have no pending CS samples, but we can't do so in the case we have
pending CS samples, since we don't know what the ordering between
pending CS samples and periodic samples will eventually be. If we
have no pending CS sample, it won't be possible for future pending CS
sample to have timestamps earlier than current periodic timestamp.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |  14 ++--
 drivers/gpu/drm/i915/i915_perf.c | 173 +++++++++++++++++++++++++++++----------
 2 files changed, 140 insertions(+), 47 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f95b02b..7efdfc2 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1751,7 +1751,7 @@ struct i915_perf_stream_ops {
 	/* Return: true if any i915 perf records are ready to read()
 	 * for this stream.
 	 */
-	bool (*can_read)(struct i915_perf_stream *stream);
+	bool (*can_read_unlocked)(struct i915_perf_stream *stream);
 
 	/* Call poll_wait, passing a wait queue that will be woken
 	 * once there is something ready to read() for the stream
@@ -1763,8 +1763,8 @@ struct i915_perf_stream_ops {
 	/* For handling a blocking read, wait until there is something
 	 * to ready to read() for the stream. E.g. wait on the same
 	 * wait queue that would be passed to poll_wait() until
-	 * ->can_read() returns true (if its safe to call ->can_read()
-	 * without the i915 perf lock held).
+	 * ->can_read_unlocked() returns true (if its safe to call
+	 * ->can_read_unlocked() without the i915 perf lock held).
 	 */
 	int (*wait_unlocked)(struct i915_perf_stream *stream);
 
@@ -1834,8 +1834,10 @@ struct i915_oa_ops {
 					u32 ctx_id);
 	void (*legacy_ctx_switch_unlocked)(struct drm_i915_gem_request *req);
 	int (*read)(struct i915_perf_stream *stream,
-		    struct i915_perf_read_state *read_state, u32 ts);
-	bool (*oa_buffer_is_empty)(struct drm_i915_private *dev_priv);
+		    struct i915_perf_read_state *read_state,
+		    u32 ts, u32 max_records);
+	int (*oa_buffer_num_samples)(struct drm_i915_private *dev_priv,
+					u32 *last_ts);
 };
 
 /*
@@ -2175,6 +2177,8 @@ struct drm_i915_private {
 			u32 gen7_latched_oastatus1;
 			u32 ctx_oactxctrl_off;
 			u32 ctx_flexeu0_off;
+			u32 n_pending_periodic_samples;
+			u32 pending_periodic_ts;
 
 			struct i915_oa_ops ops;
 			const struct i915_oa_format *oa_formats;
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 42e930f..b53ccf5 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -388,13 +388,30 @@ static void i915_oa_rcs_free_requests(struct drm_i915_private *dev_priv)
  * pointers.  A race here could result in a false positive !empty status which
  * is acceptable.
  */
-static bool gen8_oa_buffer_is_empty_fop_unlocked(struct drm_i915_private *dev_priv)
+static int
+gen8_oa_buffer_num_samples_fop_unlocked(struct drm_i915_private *dev_priv,
+					u32 *last_ts)
 {
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
-	u32 head = I915_READ(GEN8_OAHEADPTR);
-	u32 tail = I915_READ(GEN8_OATAILPTR);
+	u8 *oa_buf_base = dev_priv->perf.oa.oa_buffer.addr;
+	u32 head = I915_READ(GEN8_OAHEADPTR) & GEN8_OAHEADPTR_MASK;
+	u32 tail = I915_READ(GEN8_OATAILPTR) & GEN8_OATAILPTR_MASK;
+	u32 mask = (OA_BUFFER_SIZE - 1);
+	u32 num_samples;
+	u8 *report;
+
+	head -= dev_priv->perf.oa.oa_buffer.gtt_offset;
+	tail -= dev_priv->perf.oa.oa_buffer.gtt_offset;
+	num_samples = OA_TAKEN(tail, head) / report_size;
 
-	return OA_TAKEN(tail, head) < report_size;
+	/* read the timestamp of the last sample */
+	if (num_samples) {
+		head += report_size*(num_samples - 1);
+		report = oa_buf_base + (head & mask);
+		*last_ts = *(u32 *)(report + 4);
+	}
+
+	return num_samples;
 }
 
 /* NB: This is either called via fops or the poll check hrtimer (atomic ctx)
@@ -408,16 +425,32 @@ static bool gen8_oa_buffer_is_empty_fop_unlocked(struct drm_i915_private *dev_pr
  * pointers.  A race here could result in a false positive !empty status which
  * is acceptable.
  */
-static bool gen7_oa_buffer_is_empty_fop_unlocked(struct drm_i915_private *dev_priv)
+static int
+gen7_oa_buffer_num_samples_fop_unlocked(struct drm_i915_private *dev_priv,
+					u32 *last_ts)
 {
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
 	u32 oastatus2 = I915_READ(GEN7_OASTATUS2);
 	u32 oastatus1 = I915_READ(GEN7_OASTATUS1);
 	u32 head = oastatus2 & GEN7_OASTATUS2_HEAD_MASK;
 	u32 tail = oastatus1 & GEN7_OASTATUS1_TAIL_MASK;
+	u8 *oa_buf_base = dev_priv->perf.oa.oa_buffer.addr;
+	u32 mask = (OA_BUFFER_SIZE - 1);
+	int available_size;
+	u32 num_samples = 0;
+	u8 *report;
 
-	return OA_TAKEN(tail, head) <
-		dev_priv->perf.oa.tail_margin + report_size;
+	head -= dev_priv->perf.oa.oa_buffer.gtt_offset;
+	tail -= dev_priv->perf.oa.oa_buffer.gtt_offset;
+	available_size = OA_TAKEN(tail, head) - dev_priv->perf.oa.tail_margin;
+	if (available_size >= report_size) {
+		num_samples = available_size / report_size;
+		head += report_size*(num_samples - 1);
+		report = oa_buf_base + (head & mask);
+		*last_ts = *(u32 *)(report + 4);
+	}
+
+	return num_samples;
 }
 
 /**
@@ -542,7 +575,7 @@ static int append_oa_buffer_sample(struct i915_perf_stream *stream,
 static int gen8_append_oa_reports(struct i915_perf_stream *stream,
 				  struct i915_perf_read_state *read_state,
 				  u32 *head_ptr,
-				  u32 tail, u32 ts)
+				  u32 tail, u32 ts, u32 max_records)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
@@ -551,6 +584,7 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
 	u32 head;
 	u32 taken;
 	int ret = 0;
+	int n_records = 0;
 
 	BUG_ON(stream->state != I915_PERF_STREAM_ENABLED);
 
@@ -577,7 +611,7 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
 	tail &= ~(report_size - 1);
 
 	for (/* none */;
-	     (taken = OA_TAKEN(tail, head));
+	     (taken = OA_TAKEN(tail, head)) && (n_records <= max_records);
 	     head = (head + report_size) & mask) {
 		u8 *report = oa_buf_base + head;
 		u32 *report32 = (void *)report;
@@ -643,6 +677,7 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
 			if (ret)
 				break;
 
+				n_records++;
 			dev_priv->perf.oa.oa_buffer.last_ctx_id = ctx_id;
 		}
 	}
@@ -661,7 +696,8 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
  * updated @read_state.
  */
 static int gen8_oa_read(struct i915_perf_stream *stream,
-			struct i915_perf_read_state *read_state, u32 ts)
+			struct i915_perf_read_state *read_state,
+			u32 ts, u32 max_records)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
@@ -703,7 +739,8 @@ static int gen8_oa_read(struct i915_perf_stream *stream,
 
 	/* If there is still buffer space */
 
-	ret = gen8_append_oa_reports(stream, read_state, &head, tail, ts);
+	ret = gen8_append_oa_reports(stream, read_state, &head, tail,
+				     ts, max_records);
 
 	/* All the report sizes are a power of two and the
 	 * head should always be incremented by some multiple
@@ -742,7 +779,7 @@ static int gen8_oa_read(struct i915_perf_stream *stream,
 static int gen7_append_oa_reports(struct i915_perf_stream *stream,
 				  struct i915_perf_read_state *read_state,
 				  u32 *head_ptr,
-				  u32 tail, u32 ts)
+				  u32 tail, u32 ts, u32 max_records)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
@@ -752,6 +789,7 @@ static int gen7_append_oa_reports(struct i915_perf_stream *stream,
 	u32 head;
 	u32 taken;
 	int ret = 0;
+	int n_records = 0;
 
 	BUG_ON(stream->state != I915_PERF_STREAM_ENABLED);
 
@@ -791,7 +829,7 @@ static int gen7_append_oa_reports(struct i915_perf_stream *stream,
 	tail &= mask;
 
 	for (/* none */;
-	     (taken = OA_TAKEN(tail, head));
+	     (taken = OA_TAKEN(tail, head)) && (n_records <= max_records);
 	     head = (head + report_size) & mask) {
 		u8 *report = oa_buf_base + head;
 		u32 *report32 = (void *)report;
@@ -824,6 +862,7 @@ static int gen7_append_oa_reports(struct i915_perf_stream *stream,
 		ret = append_oa_buffer_sample(stream, read_state, report);
 		if (ret)
 			break;
+		n_records++;
 
 		/* The above report-id field sanity check is based on
 		 * the assumption that the OA buffer is initially
@@ -848,7 +887,8 @@ static int gen7_append_oa_reports(struct i915_perf_stream *stream,
  * updated @read_state.
  */
 static int gen7_oa_read(struct i915_perf_stream *stream,
-			struct i915_perf_read_state *read_state, u32 ts)
+			struct i915_perf_read_state *read_state,
+			u32 ts, u32 max_records)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
@@ -920,7 +960,8 @@ static int gen7_oa_read(struct i915_perf_stream *stream,
 			GEN7_OASTATUS1_REPORT_LOST;
 	}
 
-	ret = gen7_append_oa_reports(stream, read_state, &head, tail, ts);
+	ret = gen7_append_oa_reports(stream, read_state, &head, tail,
+				ts, max_records);
 
 	/* All the report sizes are a power of two and the
 	 * head should always be incremented by some multiple
@@ -966,7 +1007,8 @@ static int append_oa_rcs_sample(struct i915_perf_stream *stream,
 
 	/* First, append the periodic OA samples having lower timestamps */
 	report_ts = *(u32 *)(report + 4);
-	ret = dev_priv->perf.oa.ops.read(stream, read_state, report_ts);
+	ret = dev_priv->perf.oa.ops.read(stream, read_state,
+					report_ts, U32_MAX);
 	if (ret)
 		return ret;
 
@@ -983,7 +1025,8 @@ static int append_oa_rcs_sample(struct i915_perf_stream *stream,
 }
 
 /**
- * Copies all command stream based OA reports into userspace read() buffer.
+ * Copies all OA reports into userspace read() buffer. This includes command
+ * stream as well as periodic OA reports.
  *
  * NB: some data may be successfully copied to the userspace buffer
  * even if an error is returned, and this is reflected in the
@@ -1000,7 +1043,7 @@ static int oa_rcs_append_reports(struct i915_perf_stream *stream,
 	spin_lock(&dev_priv->perf.node_list_lock);
 	if (list_empty(&dev_priv->perf.node_list)) {
 		spin_unlock(&dev_priv->perf.node_list_lock);
-		return 0;
+		goto pending_periodic;
 	}
 	list_for_each_entry_safe(entry, next,
 				 &dev_priv->perf.node_list, link) {
@@ -1011,7 +1054,7 @@ static int oa_rcs_append_reports(struct i915_perf_stream *stream,
 	spin_unlock(&dev_priv->perf.node_list_lock);
 
 	if (list_empty(&free_list))
-		return 0;
+		goto pending_periodic;
 
 	list_for_each_entry_safe(entry, next, &free_list, link) {
 		ret = append_oa_rcs_sample(stream, read_state, entry);
@@ -1029,16 +1072,35 @@ static int oa_rcs_append_reports(struct i915_perf_stream *stream,
 	spin_unlock(&dev_priv->perf.node_list_lock);
 
 	return ret;
+
+pending_periodic:
+	if (!dev_priv->perf.oa.n_pending_periodic_samples)
+		return 0;
+
+	ret = dev_priv->perf.oa.ops.read(stream, read_state,
+				dev_priv->perf.oa.pending_periodic_ts,
+				dev_priv->perf.oa.n_pending_periodic_samples);
+	dev_priv->perf.oa.n_pending_periodic_samples = 0;
+	dev_priv->perf.oa.pending_periodic_ts = 0;
+	return ret;
 }
 
+enum cs_buf_data_state {
+	CS_BUF_EMPTY,
+	CS_BUF_REQ_PENDING,
+	CS_BUF_HAVE_DATA,
+};
+
 /*
  * Checks whether the command stream buffer associated with the stream has
  * data ready to be forwarded to userspace.
- * Returns true if atleast one request associated with command stream is
- * completed, else returns false.
+ * Value returned:
+ * CS_BUF_HAVE_DATA	- if there is atleast one completed request
+ * CS_BUF_REQ_PENDING	- there are requests pending, but no completed requests
+ * CS_BUF_EMPTY		- no requests scheduled
  */
-static bool command_stream_buf_is_empty(struct i915_perf_stream *stream)
-
+static enum cs_buf_data_state command_stream_buf_state(
+				struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	struct i915_perf_cs_data_node *entry = NULL;
@@ -1052,37 +1114,63 @@ static bool command_stream_buf_is_empty(struct i915_perf_stream *stream)
 	spin_unlock(&dev_priv->perf.node_list_lock);
 
 	if (!entry)
-		return true;
+		return CS_BUF_EMPTY;
 	else if (!i915_gem_request_completed(request, true))
-		return true;
+		return CS_BUF_REQ_PENDING;
 	else
-		return false;
+		return CS_BUF_HAVE_DATA;
 }
 
 /*
- * Checks whether the stream has data ready to forward to userspace.
- * For command stream based streams, check if the command stream buffer has
- * atleast one sample ready, if not return false, irrespective of periodic
- * oa buffer having the data or not.
+ * Checks whether the stream has data ready to forward to userspace, by
+ * querying for periodic oa buffer and command stream buffer samples.
  */
 
 static bool stream_have_data__unlocked(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
+	enum cs_buf_data_state cs_buf_state;
+	u32 num_samples, last_ts = 0;
 
-	/* Note: the oa_buffer_is_empty() condition is ok to run unlocked as it
-	 * just performs mmio reads of the OA buffer head + tail pointers and
+	/* Note: oa_buffer_num_samples() is ok to run unlocked as it just
+	 * performs mmio reads of the OA buffer head + tail pointers and
 	 * it's assumed we're handling some operation that implies the stream
 	 * can't be destroyed until completion (such as a read()) that ensures
 	 * the device + OA buffer can't disappear
 	 */
+	dev_priv->perf.oa.n_pending_periodic_samples = 0;
+	dev_priv->perf.oa.pending_periodic_ts = 0;
+	num_samples = dev_priv->perf.oa.ops.oa_buffer_num_samples(dev_priv,
+								&last_ts);
 	if (stream->cs_mode)
-		return !command_stream_buf_is_empty(stream);
+		cs_buf_state = command_stream_buf_state(stream);
 	else
-		return !dev_priv->perf.oa.ops.oa_buffer_is_empty(dev_priv);
+		cs_buf_state = CS_BUF_EMPTY;
+
+	/*
+	 * Note: We can safely forward the periodic OA samples in the case we
+	 * have no pending CS samples, but we can't do so in the case we have
+	 * pending CS samples, since we don't know what the ordering between
+	 * pending CS samples and periodic samples will eventually be. If we
+	 * have no pending CS sample, it won't be possible for future pending CS
+	 * sample to have timestamps earlier than current periodic timestamp.
+	 */
+	switch (cs_buf_state) {
+	case CS_BUF_EMPTY:
+		dev_priv->perf.oa.n_pending_periodic_samples = num_samples;
+		dev_priv->perf.oa.pending_periodic_ts = last_ts;
+		return (num_samples != 0);
+
+	case CS_BUF_HAVE_DATA:
+		return true;
+
+	case CS_BUF_REQ_PENDING:
+		default:
+		return false;
+	}
 }
 
-static bool i915_oa_can_read(struct i915_perf_stream *stream)
+static bool i915_oa_can_read_unlocked(struct i915_perf_stream *stream)
 {
 
 	return stream_have_data__unlocked(stream);
@@ -1124,7 +1212,8 @@ static int i915_oa_read(struct i915_perf_stream *stream,
 	if (stream->cs_mode)
 		return oa_rcs_append_reports(stream, read_state);
 	else
-		return dev_priv->perf.oa.ops.read(stream, read_state, U32_MAX);
+		return dev_priv->perf.oa.ops.read(stream, read_state,
+						U32_MAX, U32_MAX);
 }
 
 static void
@@ -1683,7 +1772,7 @@ static const struct i915_perf_stream_ops i915_oa_stream_ops = {
 	.destroy = i915_oa_stream_destroy,
 	.enable = i915_oa_stream_enable,
 	.disable = i915_oa_stream_disable,
-	.can_read = i915_oa_can_read,
+	.can_read_unlocked = i915_oa_can_read_unlocked,
 	.wait_unlocked = i915_oa_wait_unlocked,
 	.poll_wait = i915_oa_poll_wait,
 	.read = i915_oa_read,
@@ -2132,7 +2221,7 @@ static unsigned int i915_perf_poll_locked(struct i915_perf_stream *stream,
 
 	stream->ops->poll_wait(stream, file, wait);
 
-	if (stream->ops->can_read(stream))
+	if (stream->ops->can_read_unlocked(stream))
 		streams |= POLLIN;
 
 	return streams;
@@ -2607,8 +2696,8 @@ void i915_perf_init(struct drm_device *dev)
 		dev_priv->perf.oa.ops.update_hw_ctx_id_locked =
 			gen7_update_hw_ctx_id_locked;
 		dev_priv->perf.oa.ops.read = gen7_oa_read;
-		dev_priv->perf.oa.ops.oa_buffer_is_empty =
-			gen7_oa_buffer_is_empty_fop_unlocked;
+		dev_priv->perf.oa.ops.oa_buffer_num_samples =
+			gen7_oa_buffer_num_samples_fop_unlocked;
 
 		dev_priv->perf.oa.timestamp_frequency = 12500000;
 
@@ -2624,8 +2713,8 @@ void i915_perf_init(struct drm_device *dev)
 		dev_priv->perf.oa.ops.oa_enable = gen8_oa_enable;
 		dev_priv->perf.oa.ops.oa_disable = gen8_oa_disable;
 		dev_priv->perf.oa.ops.read = gen8_oa_read;
-		dev_priv->perf.oa.ops.oa_buffer_is_empty =
-			gen8_oa_buffer_is_empty_fop_unlocked;
+		dev_priv->perf.oa.ops.oa_buffer_num_samples =
+				gen8_oa_buffer_num_samples_fop_unlocked;
 
 		dev_priv->perf.oa.oa_formats = gen8_plus_oa_formats;
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 05/15] drm/i915: Handle the overflow condition for command stream buf
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (3 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 04/15] drm/i915: flush periodic samples, in case of no pending CS sample requests sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  5:18 ` [PATCH 06/15] drm/i915: Populate ctx ID for periodic OA reports sourab.gupta
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Sourab Gupta, Deepak S

From: Sourab Gupta <sourab.gupta@intel.com>

Add a compile time option for detecting the overflow condition of command
stream buffer, and not overwriting the old entries in such a case.
Also, set a status flag to forward the overflow condition to userspace if
overflow is detected.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |  2 ++
 drivers/gpu/drm/i915/i915_perf.c | 75 ++++++++++++++++++++++++++++++++--------
 2 files changed, 62 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 7efdfc2..8cce8bd 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2190,6 +2190,8 @@ struct drm_i915_private {
 			struct drm_i915_gem_object *obj;
 			struct i915_vma *vma;
 			u8 *addr;
+#define I915_PERF_CMD_STREAM_BUF_STATUS_OVERFLOW (1<<0)
+			u32 status;
 		} command_stream_buf;
 
 		struct list_head node_list;
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index b53ccf5..a9cf103 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -81,6 +81,9 @@ static u32 i915_perf_stream_paranoid = true;
 #define GEN8_OAREPORT_REASON_GO_TRANSITION  (1<<23)
 #define GEN9_OAREPORT_REASON_CLK_RATIO      (1<<24)
 
+/* For determining the behavior on overflow of command stream samples */
+#define CMD_STREAM_BUF_OVERFLOW_ALLOWED
+
 /* Data common to periodic and RCS based samples */
 struct oa_sample_data {
 	u32 source;
@@ -182,6 +185,7 @@ void i915_perf_command_stream_hook(struct drm_i915_gem_request *request)
 	mutex_unlock(&dev_priv->perf.streams_lock);
 }
 
+#ifdef CMD_STREAM_BUF_OVERFLOW_ALLOWED
 /*
  * Release some perf entries to make space for a new entry data. We dereference
  * the associated request before deleting the entry. Also, no need to check for
@@ -208,25 +212,26 @@ static void release_some_perf_entries(struct drm_i915_private *dev_priv,
 			break;
 	}
 }
+#endif
 
 /*
- * Insert the perf entry to the end of the list. This function never fails,
- * since it always manages to insert the entry. If the space is exhausted in
- * the buffer, it will remove the oldest entries in order to make space.
+ * Insert the perf entry to the end of the list. If the overwrite of old entries
+ * is allowed, the function always manages to insert the entry and returns 0.
+ * If overwrite is not allowed, on detection of overflow condition, an
+ * appropriate status flag is set, and function returns -ENOSPC.
  */
-static void insert_perf_entry(struct drm_i915_private *dev_priv,
+static int insert_perf_entry(struct drm_i915_private *dev_priv,
 				struct i915_perf_cs_data_node *entry)
 {
 	struct i915_perf_cs_data_node *first_entry, *last_entry;
 	int max_offset = dev_priv->perf.command_stream_buf.obj->base.size;
 	u32 entry_size = dev_priv->perf.oa.oa_buffer.format_size;
+	int ret = 0;
 
 	spin_lock(&dev_priv->perf.node_list_lock);
 	if (list_empty(&dev_priv->perf.node_list)) {
 		entry->offset = 0;
-		list_add_tail(&entry->link, &dev_priv->perf.node_list);
-		spin_unlock(&dev_priv->perf.node_list_lock);
-		return;
+		goto out;
 	}
 
 	first_entry = list_first_entry(&dev_priv->perf.node_list,
@@ -244,29 +249,49 @@ static void insert_perf_entry(struct drm_i915_private *dev_priv,
 		 */
 		else if (entry_size < first_entry->offset)
 			entry->offset = 0;
-		/* Insufficient space. Overwrite existing old entries */
+		/* Insufficient space */
 		else {
+#ifdef CMD_STREAM_BUF_OVERFLOW_ALLOWED
 			u32 target_size = entry_size - first_entry->offset;
 
 			release_some_perf_entries(dev_priv, target_size);
 			entry->offset = 0;
+#else
+			dev_priv->perf.command_stream_buf.status |=
+				I915_PERF_CMD_STREAM_BUF_STATUS_OVERFLOW;
+			ret = -ENOSPC;
+			goto out_unlock;
+#endif
 		}
 	} else {
 		/* Sufficient space available? */
 		if (last_entry->offset + 2*entry_size < first_entry->offset)
 			entry->offset = last_entry->offset + entry_size;
-		/* Insufficient space. Overwrite existing old entries */
+		/* Insufficient space */
 		else {
+#ifdef CMD_STREAM_BUF_OVERFLOW_ALLOWED
 			u32 target_size = entry_size -
 				(first_entry->offset - last_entry->offset -
 				entry_size);
 
 			release_some_perf_entries(dev_priv, target_size);
 			entry->offset = last_entry->offset + entry_size;
+#else
+			dev_priv->perf.command_stream_buf.status |=
+				I915_PERF_CMD_STREAM_BUF_STATUS_OVERFLOW;
+			ret = -ENOSPC;
+			goto out_unlock;
+#endif
 		}
 	}
+
+out:
 	list_add_tail(&entry->link, &dev_priv->perf.node_list);
+#ifndef CMD_STREAM_BUF_OVERFLOW_ALLOWED
+out_unlock:
+#endif
 	spin_unlock(&dev_priv->perf.node_list_lock);
+	return ret;
 }
 
 static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req)
@@ -288,17 +313,17 @@ static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req)
 		return;
 	}
 
+	ret = insert_perf_entry(dev_priv, entry);
+	if (ret)
+		goto out_free;
+
 	ret = intel_ring_begin(req, 4);
-	if (ret) {
-		kfree(entry);
-		return;
-	}
+	if (ret)
+		goto out;
 
 	entry->ctx_id = ctx->hw_id;
 	i915_gem_request_assign(&entry->request, req);
 
-	insert_perf_entry(dev_priv, entry);
-
 	addr = dev_priv->perf.command_stream_buf.vma->node.start +
 		entry->offset;
 
@@ -329,6 +354,13 @@ static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req)
 		intel_ring_advance(engine);
 	}
 	i915_vma_move_to_active(dev_priv->perf.command_stream_buf.vma, req);
+	return;
+out:
+	spin_lock(&dev_priv->perf.node_list_lock);
+	list_del(&entry->link);
+	spin_unlock(&dev_priv->perf.node_list_lock);
+out_free:
+	kfree(entry);
 }
 
 static int i915_oa_rcs_wait_gpu(struct drm_i915_private *dev_priv)
@@ -1039,7 +1071,20 @@ static int oa_rcs_append_reports(struct i915_perf_stream *stream,
 	struct i915_perf_cs_data_node *entry, *next;
 	LIST_HEAD(free_list);
 	int ret = 0;
+#ifndef CMD_STREAM_BUF_OVERFLOW_ALLOWED
+	u32 cs_buf_status = dev_priv->perf.command_stream_buf.status;
+
+	if (unlikely(cs_buf_status &
+			I915_PERF_CMD_STREAM_BUF_STATUS_OVERFLOW)) {
+		ret = append_oa_status(stream, read_state,
+				       DRM_I915_PERF_RECORD_OA_BUFFER_OVERFLOW);
+		if (ret)
+			return ret;
 
+		dev_priv->perf.command_stream_buf.status &=
+				~I915_PERF_CMD_STREAM_BUF_STATUS_OVERFLOW;
+	}
+#endif
 	spin_lock(&dev_priv->perf.node_list_lock);
 	if (list_empty(&dev_priv->perf.node_list)) {
 		spin_unlock(&dev_priv->perf.node_list_lock);
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 06/15] drm/i915: Populate ctx ID for periodic OA reports
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (4 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 05/15] drm/i915: Handle the overflow condition for command stream buf sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  5:18 ` [PATCH 07/15] drm/i915: Add support for having pid output with OA report sourab.gupta
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Sourab Gupta, Deepak S

From: Sourab Gupta <sourab.gupta@intel.com>

This adds support for populating the ctx id for the periodic OA reports
when requested through the corresponding property.

For Gen8, the OA reports itself have the ctx ID and it is the one programmed
into HW while submitting workloads. Thus it's retrieved from reports itself.
For Gen7, the OA reports don't have any such field, and we can populate this
field with the last seen ctx ID while sending CS reports.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |  3 +++
 drivers/gpu/drm/i915/i915_perf.c | 52 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 8cce8bd..9d23ca1 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1838,6 +1838,8 @@ struct i915_oa_ops {
 		    u32 ts, u32 max_records);
 	int (*oa_buffer_num_samples)(struct drm_i915_private *dev_priv,
 					u32 *last_ts);
+	u32 (*oa_buffer_get_ctx_id)(struct i915_perf_stream *stream,
+					const u8 *report);
 };
 
 /*
@@ -2194,6 +2196,7 @@ struct drm_i915_private {
 			u32 status;
 		} command_stream_buf;
 
+		u32 last_ctx_id;
 		struct list_head node_list;
 		spinlock_t node_list_lock;
 	} perf;
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index a9cf103..2496a4b 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -485,6 +485,46 @@ gen7_oa_buffer_num_samples_fop_unlocked(struct drm_i915_private *dev_priv,
 	return num_samples;
 }
 
+static u32 gen7_oa_buffer_get_ctx_id(struct i915_perf_stream *stream,
+				    const u8 *report)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+
+	if (!stream->cs_mode)
+		WARN_ONCE(1,
+			"CTX ID can't be retrieved if command stream mode not enabled");
+
+	/*
+	 * OA reports generated in Gen7 don't have the ctx ID information.
+	 * Therefore, just rely on the ctx ID information from the last CS
+	 * sample forwarded
+	 */
+	return dev_priv->perf.last_ctx_id;
+}
+
+static u32 gen8_oa_buffer_get_ctx_id(struct i915_perf_stream *stream,
+				    const u8 *report)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+
+	/* The ctx ID present in the OA reports have intel_context::global_id
+	 * present, since this is programmed into the ELSP in execlist mode.
+	 * In non-execlist mode, fall back to retrieving the ctx ID from the
+	 * last saved ctx ID from command stream mode.
+	 */
+	if (i915.enable_execlists) {
+		u32 ctx_id = *(u32 *)(report + 12);
+		ctx_id &= 0xfffff;
+		return ctx_id;
+	} else {
+		if (!stream->cs_mode)
+		WARN_ONCE(1,
+			"CTX ID can't be retrieved if command stream mode not enabled");
+
+		return dev_priv->perf.last_ctx_id;
+	}
+}
+
 /**
  * Appends a status record to a userspace read() buffer.
  */
@@ -580,9 +620,9 @@ static int append_oa_buffer_sample(struct i915_perf_stream *stream,
 
 		data.source = source;
 	}
-#warning "FIXME: append_oa_buffer_sample: read ctx ID from report and map that to an intel_context::global_id"
 	if (sample_flags & SAMPLE_CTX_ID)
-		data.ctx_id = 0;
+		data.ctx_id = dev_priv->perf.oa.ops.oa_buffer_get_ctx_id(
+						stream, report);
 
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
@@ -1047,8 +1087,10 @@ static int append_oa_rcs_sample(struct i915_perf_stream *stream,
 	if (sample_flags & SAMPLE_OA_SOURCE_INFO)
 		data.source = I915_PERF_OA_EVENT_SOURCE_RCS;
 
-	if (sample_flags & SAMPLE_CTX_ID)
+	if (sample_flags & SAMPLE_CTX_ID) {
 		data.ctx_id = node->ctx_id;
+		dev_priv->perf.last_ctx_id = node->ctx_id;
+	}
 
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
@@ -2743,6 +2785,8 @@ void i915_perf_init(struct drm_device *dev)
 		dev_priv->perf.oa.ops.read = gen7_oa_read;
 		dev_priv->perf.oa.ops.oa_buffer_num_samples =
 			gen7_oa_buffer_num_samples_fop_unlocked;
+		dev_priv->perf.oa.ops.oa_buffer_get_ctx_id =
+					gen7_oa_buffer_get_ctx_id;
 
 		dev_priv->perf.oa.timestamp_frequency = 12500000;
 
@@ -2760,6 +2804,8 @@ void i915_perf_init(struct drm_device *dev)
 		dev_priv->perf.oa.ops.read = gen8_oa_read;
 		dev_priv->perf.oa.ops.oa_buffer_num_samples =
 				gen8_oa_buffer_num_samples_fop_unlocked;
+		dev_priv->perf.oa.ops.oa_buffer_get_ctx_id =
+					gen8_oa_buffer_get_ctx_id;
 
 		dev_priv->perf.oa.oa_formats = gen8_plus_oa_formats;
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 07/15] drm/i915: Add support for having pid output with OA report
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (5 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 06/15] drm/i915: Populate ctx ID for periodic OA reports sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  5:18 ` [PATCH 08/15] drm/i915: Add support for emitting execbuffer tags through OA counter reports sourab.gupta
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Sourab Gupta, Deepak S

From: Sourab Gupta <sourab.gupta@intel.com>

This patch introduces flags and adds support for having pid output with
the OA reports generated through the RCS commands.

When the stream is opened with pid sample type, the pid information is also
captured through the command stream samples and forwarded along with the
OA reports.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |  2 ++
 drivers/gpu/drm/i915/i915_perf.c | 48 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/drm/i915_drm.h      |  7 ++++++
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 9d23ca1..60e94e6 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1851,6 +1851,7 @@ struct i915_perf_cs_data_node {
 	struct drm_i915_gem_request *request;
 	u32 offset;
 	u32 ctx_id;
+	u32 pid;
 };
 
 struct drm_i915_private {
@@ -2197,6 +2198,7 @@ struct drm_i915_private {
 		} command_stream_buf;
 
 		u32 last_ctx_id;
+		u32 last_pid;
 		struct list_head node_list;
 		spinlock_t node_list_lock;
 	} perf;
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 2496a4b..bb5356f 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -88,6 +88,7 @@ static u32 i915_perf_stream_paranoid = true;
 struct oa_sample_data {
 	u32 source;
 	u32 ctx_id;
+	u32 pid;
 	const u8 *report;
 };
 
@@ -143,6 +144,7 @@ static const enum intel_engine_id user_ring_map[I915_USER_RINGS + 1] = {
 #define SAMPLE_OA_REPORT	(1<<0)
 #define SAMPLE_OA_SOURCE_INFO	(1<<1)
 #define SAMPLE_CTX_ID		(1<<2)
+#define SAMPLE_PID		(1<<3)
 
 struct perf_open_properties {
 	u32 sample_flags;
@@ -322,6 +324,7 @@ static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req)
 		goto out;
 
 	entry->ctx_id = ctx->hw_id;
+	entry->pid = current->pid;
 	i915_gem_request_assign(&entry->request, req);
 
 	addr = dev_priv->perf.command_stream_buf.vma->node.start +
@@ -582,6 +585,12 @@ static int append_oa_sample(struct i915_perf_stream *stream,
 		buf += 4;
 	}
 
+	if (sample_flags & SAMPLE_PID) {
+		if (copy_to_user(buf, &data->pid, 4))
+			return -EFAULT;
+		buf += 4;
+	}
+
 	if (sample_flags & SAMPLE_OA_REPORT) {
 		if (copy_to_user(buf, data->report, report_size))
 			return -EFAULT;
@@ -624,6 +633,9 @@ static int append_oa_buffer_sample(struct i915_perf_stream *stream,
 		data.ctx_id = dev_priv->perf.oa.ops.oa_buffer_get_ctx_id(
 						stream, report);
 
+	if (sample_flags & SAMPLE_PID)
+		data.pid = dev_priv->perf.last_pid;
+
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
 
@@ -1092,6 +1104,11 @@ static int append_oa_rcs_sample(struct i915_perf_stream *stream,
 		dev_priv->perf.last_ctx_id = node->ctx_id;
 	}
 
+	if (sample_flags & SAMPLE_PID) {
+		data.pid = node->pid;
+		dev_priv->perf.last_pid = node->pid;
+	}
+
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
 
@@ -1873,6 +1890,7 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	bool require_oa_unit = props->sample_flags & (SAMPLE_OA_REPORT |
 						      SAMPLE_OA_SOURCE_INFO);
+	bool require_cs_mode = props->sample_flags & SAMPLE_PID;
 	bool cs_sample_data = props->sample_flags & SAMPLE_OA_REPORT;
 	int ret;
 
@@ -2005,6 +2023,20 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	if (props->sample_flags & SAMPLE_CTX_ID) {
 		stream->sample_flags |= SAMPLE_CTX_ID;
 		stream->sample_size += 4;
+
+		/*
+		 * NB: it's meaningful to request SAMPLE_CTX_ID with just CS
+		 * mode or periodic OA mode sampling but we don't allow
+		 * SAMPLE_CTX_ID without either mode
+		 */
+		if (!require_oa_unit)
+			require_cs_mode = true;
+	}
+
+	if (require_cs_mode && !props->cs_mode) {
+		DRM_ERROR("PID sampling requires a ring to be specified");
+		ret = -EINVAL;
+		goto cs_error;
 	}
 
 	if (props->cs_mode) {
@@ -2015,7 +2047,13 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 			goto cs_error;
 		}
 
-		if (!(props->sample_flags & SAMPLE_CTX_ID)) {
+		/*
+		 * The only time we should allow enabling CS mode if it's not
+		 * strictly required, is if SAMPLE_CTX_ID has been requested
+		 * as it's usable with periodic OA or CS sampling.
+		 */
+		if (!require_cs_mode &&
+		    !(props->sample_flags & SAMPLE_CTX_ID)) {
 			DRM_ERROR(
 				"Ring given without requesting any CS specific property");
 			ret = -EINVAL;
@@ -2024,6 +2062,11 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 
 		stream->cs_mode = true;
 
+		if (props->sample_flags & SAMPLE_PID) {
+			stream->sample_flags |= SAMPLE_PID;
+			stream->sample_size += 4;
+		}
+
 		ret = alloc_command_stream_buf(dev_priv);
 		if (ret)
 			goto cs_error;
@@ -2661,6 +2704,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_PROP_SAMPLE_CTX_ID:
 			props->sample_flags |= SAMPLE_CTX_ID;
 			break;
+		case DRM_I915_PERF_PROP_SAMPLE_PID:
+			props->sample_flags |= SAMPLE_PID;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 7faced8..8069790 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1255,6 +1255,12 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_PROP_SAMPLE_CTX_ID,
 
+	/**
+	 * The value of this property set to 1 requests inclusion of pid in the
+	 * perf sample data.
+	 */
+	DRM_I915_PERF_PROP_SAMPLE_PID,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1321,6 +1327,7 @@ enum drm_i915_perf_record_type {
 	 *
 	 *     { u32 source_info; } && DRM_I915_PERF_PROP_SAMPLE_OA_SOURCE
 	 *     { u32 ctx_id; } && DRM_I915_PERF_PROP_SAMPLE_CTX_ID
+	 *     { u32 pid; } && DRM_I915_PERF_PROP_SAMPLE_PID
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 08/15] drm/i915: Add support for emitting execbuffer tags through OA counter reports
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (6 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 07/15] drm/i915: Add support for having pid output with OA report sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  5:18 ` [PATCH 09/15] drm/i915: Extend i915 perf framework for collecting timestamps on all gpu engines sourab.gupta
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Sourab Gupta, Deepak S

From: Sourab Gupta <sourab.gupta@intel.com>

This patch enables userspace to specify tags (per workload), provided via
execbuffer ioctl, which could be added to OA reports, to help associate
reports with the corresponding workloads.

There may be multiple stages within a single context, from a userspace
perspective. An ability is needed to individually associate the OA reports
with their corresponding workloads(execbuffers), which may not be possible
solely with ctx_id or pid information. This patch enables such a mechanism.

In this patch, upper 32 bits of rsvd1 field, which were previously unused
are now being used to pass in the tag.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |  7 ++++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  5 ++--
 drivers/gpu/drm/i915/i915_perf.c           | 38 ++++++++++++++++++++++++++----
 drivers/gpu/drm/i915/intel_lrc.c           |  4 ++--
 include/uapi/drm/i915_drm.h                | 12 ++++++++++
 5 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 60e94e6..9da5007 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1708,6 +1708,7 @@ struct i915_execbuffer_params {
 	struct drm_i915_gem_object      *batch_obj;
 	struct intel_context            *ctx;
 	struct drm_i915_gem_request     *request;
+	uint32_t			tag;
 };
 
 /* used in computing the new watermarks state */
@@ -1796,7 +1797,7 @@ struct i915_perf_stream_ops {
 	 * Routine to emit the commands in the command streamer associated
 	 * with the corresponding gpu engine.
 	 */
-	void (*command_stream_hook)(struct drm_i915_gem_request *req);
+	void (*command_stream_hook)(struct drm_i915_gem_request *req, u32 tag);
 };
 
 enum i915_perf_stream_state {
@@ -1852,6 +1853,7 @@ struct i915_perf_cs_data_node {
 	u32 offset;
 	u32 ctx_id;
 	u32 pid;
+	u32 tag;
 };
 
 struct drm_i915_private {
@@ -2199,6 +2201,7 @@ struct drm_i915_private {
 
 		u32 last_ctx_id;
 		u32 last_pid;
+		u32 last_tag;
 		struct list_head node_list;
 		spinlock_t node_list_lock;
 	} perf;
@@ -3563,7 +3566,7 @@ void i915_oa_legacy_ctx_switch_notify(struct drm_i915_gem_request *req);
 void i915_oa_update_reg_state(struct intel_engine_cs *engine,
 			      struct intel_context *ctx,
 			      uint32_t *reg_state);
-void i915_perf_command_stream_hook(struct drm_i915_gem_request *req);
+void i915_perf_command_stream_hook(struct drm_i915_gem_request *req, u32 tag);
 
 /* i915_gem_evict.c */
 int __must_check i915_gem_evict_something(struct drm_device *dev,
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 8b759af..a6564a0 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1305,7 +1305,7 @@ i915_gem_ringbuffer_submission(struct i915_execbuffer_params *params,
 	if (exec_len == 0)
 		exec_len = params->batch_obj->base.size;
 
-	i915_perf_command_stream_hook(params->request);
+	i915_perf_command_stream_hook(params->request, params->tag);
 
 	ret = engine->dispatch_execbuffer(params->request,
 					exec_start, exec_len,
@@ -1313,7 +1313,7 @@ i915_gem_ringbuffer_submission(struct i915_execbuffer_params *params,
 	if (ret)
 		return ret;
 
-	i915_perf_command_stream_hook(params->request);
+	i915_perf_command_stream_hook(params->request, params->tag);
 
 	trace_i915_gem_ring_dispatch(params->request, params->dispatch_flags);
 
@@ -1634,6 +1634,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	params->batch_obj               = batch_obj;
 	params->ctx                     = ctx;
 	params->request                 = req;
+	params->tag			= i915_execbuffer2_get_tag(*args);
 
 	ret = dev_priv->gt.execbuf_submit(params, args, &eb->vmas);
 err_request:
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index bb5356f..902f84f 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -89,6 +89,7 @@ struct oa_sample_data {
 	u32 source;
 	u32 ctx_id;
 	u32 pid;
+	u32 tag;
 	const u8 *report;
 };
 
@@ -145,6 +146,7 @@ static const enum intel_engine_id user_ring_map[I915_USER_RINGS + 1] = {
 #define SAMPLE_OA_SOURCE_INFO	(1<<1)
 #define SAMPLE_CTX_ID		(1<<2)
 #define SAMPLE_PID		(1<<3)
+#define SAMPLE_TAG		(1<<4)
 
 struct perf_open_properties {
 	u32 sample_flags;
@@ -169,7 +171,8 @@ struct perf_open_properties {
  * perf mutex lock.
  */
 
-void i915_perf_command_stream_hook(struct drm_i915_gem_request *request)
+void i915_perf_command_stream_hook(struct drm_i915_gem_request *request,
+					u32 tag)
 {
 	struct intel_engine_cs *engine = request->engine;
 	struct drm_i915_private *dev_priv = engine->dev->dev_private;
@@ -182,7 +185,7 @@ void i915_perf_command_stream_hook(struct drm_i915_gem_request *request)
 	list_for_each_entry(stream, &dev_priv->perf.streams, link) {
 		if ((stream->state == I915_PERF_STREAM_ENABLED) &&
 					stream->cs_mode)
-			stream->ops->command_stream_hook(request);
+			stream->ops->command_stream_hook(request, tag);
 	}
 	mutex_unlock(&dev_priv->perf.streams_lock);
 }
@@ -296,7 +299,8 @@ out_unlock:
 	return ret;
 }
 
-static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req)
+static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req,
+						u32 tag)
 {
 	struct intel_engine_cs *engine = req->engine;
 	struct intel_ringbuffer *ringbuf = req->ringbuf;
@@ -325,6 +329,7 @@ static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req)
 
 	entry->ctx_id = ctx->hw_id;
 	entry->pid = current->pid;
+	entry->tag = tag;
 	i915_gem_request_assign(&entry->request, req);
 
 	addr = dev_priv->perf.command_stream_buf.vma->node.start +
@@ -591,6 +596,12 @@ static int append_oa_sample(struct i915_perf_stream *stream,
 		buf += 4;
 	}
 
+	if (sample_flags & SAMPLE_TAG) {
+		if (copy_to_user(buf, &data->tag, 4))
+			return -EFAULT;
+		buf += 4;
+	}
+
 	if (sample_flags & SAMPLE_OA_REPORT) {
 		if (copy_to_user(buf, data->report, report_size))
 			return -EFAULT;
@@ -636,6 +647,9 @@ static int append_oa_buffer_sample(struct i915_perf_stream *stream,
 	if (sample_flags & SAMPLE_PID)
 		data.pid = dev_priv->perf.last_pid;
 
+	if (sample_flags & SAMPLE_TAG)
+		data.tag = dev_priv->perf.last_tag;
+
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
 
@@ -1109,6 +1123,11 @@ static int append_oa_rcs_sample(struct i915_perf_stream *stream,
 		dev_priv->perf.last_pid = node->pid;
 	}
 
+	if (sample_flags & SAMPLE_TAG) {
+		data.tag = node->tag;
+		dev_priv->perf.last_tag = node->tag;
+	}
+
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
 
@@ -1890,7 +1909,8 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	bool require_oa_unit = props->sample_flags & (SAMPLE_OA_REPORT |
 						      SAMPLE_OA_SOURCE_INFO);
-	bool require_cs_mode = props->sample_flags & SAMPLE_PID;
+	bool require_cs_mode = props->sample_flags & (SAMPLE_PID |
+						      SAMPLE_TAG);
 	bool cs_sample_data = props->sample_flags & SAMPLE_OA_REPORT;
 	int ret;
 
@@ -2034,7 +2054,7 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	}
 
 	if (require_cs_mode && !props->cs_mode) {
-		DRM_ERROR("PID sampling requires a ring to be specified");
+		DRM_ERROR("PID or TAG sampling require a ring to be specified");
 		ret = -EINVAL;
 		goto cs_error;
 	}
@@ -2067,6 +2087,11 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 			stream->sample_size += 4;
 		}
 
+		if (props->sample_flags & SAMPLE_TAG) {
+			stream->sample_flags |= SAMPLE_TAG;
+			stream->sample_size += 4;
+		}
+
 		ret = alloc_command_stream_buf(dev_priv);
 		if (ret)
 			goto cs_error;
@@ -2707,6 +2732,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_PROP_SAMPLE_PID:
 			props->sample_flags |= SAMPLE_PID;
 			break;
+		case DRM_I915_PERF_PROP_SAMPLE_TAG:
+			props->sample_flags |= SAMPLE_TAG;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 2751f3a..cffaf7b 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -869,13 +869,13 @@ int intel_execlists_submission(struct i915_execbuffer_params *params,
 	exec_start = params->batch_obj_vm_offset +
 		     args->batch_start_offset;
 
-	i915_perf_command_stream_hook(params->request);
+	i915_perf_command_stream_hook(params->request, params->tag);
 
 	ret = engine->emit_bb_start(params->request, exec_start, params->dispatch_flags);
 	if (ret)
 		return ret;
 
-	i915_perf_command_stream_hook(params->request);
+	i915_perf_command_stream_hook(params->request, params->tag);
 
 	trace_i915_gem_ring_dispatch(params->request, params->dispatch_flags);
 
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 8069790..bd0e48d 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -796,6 +796,11 @@ struct drm_i915_gem_execbuffer2 {
 #define i915_execbuffer2_get_context_id(eb2) \
 	((eb2).rsvd1 & I915_EXEC_CONTEXT_ID_MASK)
 
+/* upper 32 bits of rsvd1 field contain tag */
+#define I915_EXEC_TAG_MASK		(0xffffffff00000000UL)
+#define i915_execbuffer2_get_tag(eb2) \
+	((eb2).rsvd1 & I915_EXEC_TAG_MASK)
+
 struct drm_i915_gem_pin {
 	/** Handle of the buffer to be pinned. */
 	__u32 handle;
@@ -1261,6 +1266,12 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_PROP_SAMPLE_PID,
 
+	/**
+	 * The value of this property set to 1 requests inclusion of tag in the
+	 * perf sample data.
+	 */
+	DRM_I915_PERF_PROP_SAMPLE_TAG,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1328,6 +1339,7 @@ enum drm_i915_perf_record_type {
 	 *     { u32 source_info; } && DRM_I915_PERF_PROP_SAMPLE_OA_SOURCE
 	 *     { u32 ctx_id; } && DRM_I915_PERF_PROP_SAMPLE_CTX_ID
 	 *     { u32 pid; } && DRM_I915_PERF_PROP_SAMPLE_PID
+	 *     { u32 tag; } && DRM_I915_PERF_PROP_SAMPLE_TAG
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 09/15] drm/i915: Extend i915 perf framework for collecting timestamps on all gpu engines
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (7 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 08/15] drm/i915: Add support for emitting execbuffer tags through OA counter reports sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  5:18 ` [PATCH 10/15] drm/i915: Extract raw GPU timestamps from OA reports to forward in perf samples sourab.gupta
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Sourab Gupta, Deepak S

From: Sourab Gupta <sourab.gupta@intel.com>

This patch extends the i915  perf framework to handle the perf sample
collection for any given gpu engine. Particularly, the support
for collecting timestamp sample type is added, which can be requested for
any engine.
With this, for RCS, timestamps and OA reports can be collected together,
and provided to userspace in separate sample fields. For other engines,
the capabilility to collect timestamps is added.

The thing to note is that, still only a single stream instance can be
opened at any particular time. Though that stream may now be opened for any
gpu engine, for collection of timestamp samples.

So, this patch doesn't add the support to open multiple concurrent streams,
as yet. Though it lays the groundwork for this support to be added
susequently. Part of this groundwork involves having separate command
stream buffers, per engine, for holding the samples generated.
Likewise for a few other data structures maintaining per-engine state.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |  32 +-
 drivers/gpu/drm/i915/i915_perf.c | 648 ++++++++++++++++++++++++++-------------
 drivers/gpu/drm/i915/i915_reg.h  |   2 +
 include/uapi/drm/i915_drm.h      |   7 +
 4 files changed, 465 insertions(+), 224 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 9da5007..2a31b79 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1797,7 +1797,8 @@ struct i915_perf_stream_ops {
 	 * Routine to emit the commands in the command streamer associated
 	 * with the corresponding gpu engine.
 	 */
-	void (*command_stream_hook)(struct drm_i915_gem_request *req, u32 tag);
+	void (*command_stream_hook)(struct i915_perf_stream *stream,
+				struct drm_i915_gem_request *req, u32 tag);
 };
 
 enum i915_perf_stream_state {
@@ -1821,6 +1822,9 @@ struct i915_perf_stream {
 	/* Whether command stream based data collection is enabled */
 	bool cs_mode;
 
+	/* Whether the OA unit is in use */
+	bool using_oa;
+
 	const struct i915_perf_stream_ops *ops;
 };
 
@@ -1850,7 +1854,16 @@ struct i915_oa_ops {
 struct i915_perf_cs_data_node {
 	struct list_head link;
 	struct drm_i915_gem_request *request;
-	u32 offset;
+
+	/* Offsets into the GEM obj holding the data */
+	u32 start_offset;
+	u32 oa_offset;
+	u32 ts_offset;
+
+	/* buffer size corresponding to this entry */
+	u32 size;
+
+	/* Other metadata */
 	u32 ctx_id;
 	u32 pid;
 	u32 tag;
@@ -2147,14 +2160,13 @@ struct drm_i915_private {
 
 		spinlock_t hook_lock;
 
-		struct {
-			struct i915_perf_stream *exclusive_stream;
+		struct hrtimer poll_check_timer;
+		struct i915_perf_stream *exclusive_stream;
+		wait_queue_head_t poll_wq[I915_NUM_ENGINES];
 
+		struct {
 			u32 specific_ctx_id;
 
-			struct hrtimer poll_check_timer;
-			wait_queue_head_t poll_wq;
-
 			bool periodic;
 			int period_exponent;
 			int timestamp_frequency;
@@ -2197,13 +2209,13 @@ struct drm_i915_private {
 			u8 *addr;
 #define I915_PERF_CMD_STREAM_BUF_STATUS_OVERFLOW (1<<0)
 			u32 status;
-		} command_stream_buf;
+		} command_stream_buf[I915_NUM_ENGINES];
 
 		u32 last_ctx_id;
 		u32 last_pid;
 		u32 last_tag;
-		struct list_head node_list;
-		spinlock_t node_list_lock;
+		struct list_head node_list[I915_NUM_ENGINES];
+		spinlock_t node_list_lock[I915_NUM_ENGINES];
 	} perf;
 
 	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 902f84f..4a6fc5e 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -84,12 +84,17 @@ static u32 i915_perf_stream_paranoid = true;
 /* For determining the behavior on overflow of command stream samples */
 #define CMD_STREAM_BUF_OVERFLOW_ALLOWED
 
-/* Data common to periodic and RCS based samples */
-struct oa_sample_data {
+#define OA_ADDR_ALIGN 64
+#define TS_ADDR_ALIGN 8
+#define I915_PERF_TS_SAMPLE_SIZE 8
+
+/* Data common to all samples (periodic OA / CS based OA / Timestamps) */
+struct sample_data {
 	u32 source;
 	u32 ctx_id;
 	u32 pid;
 	u32 tag;
+	u64 ts;
 	const u8 *report;
 };
 
@@ -147,6 +152,7 @@ static const enum intel_engine_id user_ring_map[I915_USER_RINGS + 1] = {
 #define SAMPLE_CTX_ID		(1<<2)
 #define SAMPLE_PID		(1<<3)
 #define SAMPLE_TAG		(1<<4)
+#define SAMPLE_TS		(1<<5)
 
 struct perf_open_properties {
 	u32 sample_flags;
@@ -184,8 +190,9 @@ void i915_perf_command_stream_hook(struct drm_i915_gem_request *request,
 	mutex_lock(&dev_priv->perf.streams_lock);
 	list_for_each_entry(stream, &dev_priv->perf.streams, link) {
 		if ((stream->state == I915_PERF_STREAM_ENABLED) &&
-					stream->cs_mode)
-			stream->ops->command_stream_hook(request, tag);
+				stream->cs_mode &&
+				(stream->engine == engine->id))
+			stream->ops->command_stream_hook(stream, request, tag);
 	}
 	mutex_unlock(&dev_priv->perf.streams_lock);
 }
@@ -199,16 +206,15 @@ void i915_perf_command_stream_hook(struct drm_i915_gem_request *request,
  * eventually, when the request associated with new entry completes.
  */
 static void release_some_perf_entries(struct drm_i915_private *dev_priv,
-					u32 target_size)
+				enum intel_engine_id id, u32 target_size)
 {
 	struct i915_perf_cs_data_node *entry, *next;
-	u32 entry_size = dev_priv->perf.oa.oa_buffer.format_size;
 	u32 size = 0;
 
 	list_for_each_entry_safe
-		(entry, next, &dev_priv->perf.node_list, link) {
+		(entry, next, &dev_priv->perf.node_list[id], link) {
 
-		size += entry_size;
+		size += entry->size;
 		i915_gem_request_unreference(entry->request);
 		list_del(&entry->link);
 		kfree(entry);
@@ -226,43 +232,61 @@ static void release_some_perf_entries(struct drm_i915_private *dev_priv,
  * appropriate status flag is set, and function returns -ENOSPC.
  */
 static int insert_perf_entry(struct drm_i915_private *dev_priv,
+				struct i915_perf_stream *stream,
 				struct i915_perf_cs_data_node *entry)
 {
 	struct i915_perf_cs_data_node *first_entry, *last_entry;
-	int max_offset = dev_priv->perf.command_stream_buf.obj->base.size;
-	u32 entry_size = dev_priv->perf.oa.oa_buffer.format_size;
+	u32 sample_flags = stream->sample_flags;
+	enum intel_engine_id id = stream->engine;
+	int max_offset = dev_priv->perf.command_stream_buf[id].obj->base.size;
+	u32 offset, entry_size = 0;
+	bool sample_ts = false;
 	int ret = 0;
 
-	spin_lock(&dev_priv->perf.node_list_lock);
-	if (list_empty(&dev_priv->perf.node_list)) {
-		entry->offset = 0;
+	if (stream->sample_flags & SAMPLE_OA_REPORT)
+		entry_size += dev_priv->perf.oa.oa_buffer.format_size;
+	else if (sample_flags & SAMPLE_TS) {
+		/*
+		 * XXX: Since TS data can anyways be derived from OA report, so
+		 * no need to capture it for RCS engine, if capture oa data is
+		 * called already.
+		 */
+		entry_size += I915_PERF_TS_SAMPLE_SIZE;
+		sample_ts = true;
+	}
+
+	spin_lock(&dev_priv->perf.node_list_lock[id]);
+	if (list_empty(&dev_priv->perf.node_list[id])) {
+		offset = 0;
 		goto out;
 	}
 
-	first_entry = list_first_entry(&dev_priv->perf.node_list,
+	first_entry = list_first_entry(&dev_priv->perf.node_list[id],
 				       typeof(*first_entry), link);
-	last_entry = list_last_entry(&dev_priv->perf.node_list,
+	last_entry = list_last_entry(&dev_priv->perf.node_list[id],
 				     typeof(*last_entry), link);
 
-	if (last_entry->offset >= first_entry->offset) {
+	if (last_entry->start_offset >= first_entry->start_offset) {
 		/* Sufficient space available at the end of buffer? */
-		if (last_entry->offset + 2*entry_size < max_offset)
-			entry->offset = last_entry->offset + entry_size;
+		if (last_entry->start_offset + last_entry->size + entry_size
+							< max_offset)
+			offset = last_entry->start_offset + last_entry->size;
 		/*
 		 * Wraparound condition. Is sufficient space available at
 		 * beginning of buffer?
 		 */
-		else if (entry_size < first_entry->offset)
-			entry->offset = 0;
+		else if (entry_size < first_entry->start_offset)
+			offset = 0;
 		/* Insufficient space */
 		else {
 #ifdef CMD_STREAM_BUF_OVERFLOW_ALLOWED
-			u32 target_size = entry_size - first_entry->offset;
+			u32 target_size = entry_size -
+						first_entry->start_offset;
 
-			release_some_perf_entries(dev_priv, target_size);
-			entry->offset = 0;
+			release_some_perf_entries(dev_priv, id, target_size);
+			offset = 0;
 #else
-			dev_priv->perf.command_stream_buf.status |=
+			dev_priv->perf.command_stream_buf[id].status |=
 				I915_PERF_CMD_STREAM_BUF_STATUS_OVERFLOW;
 			ret = -ENOSPC;
 			goto out_unlock;
@@ -270,19 +294,21 @@ static int insert_perf_entry(struct drm_i915_private *dev_priv,
 		}
 	} else {
 		/* Sufficient space available? */
-		if (last_entry->offset + 2*entry_size < first_entry->offset)
-			entry->offset = last_entry->offset + entry_size;
+		if (last_entry->start_offset + last_entry->size + entry_size
+						< first_entry->start_offset)
+			offset = last_entry->start_offset + last_entry->size;
 		/* Insufficient space */
 		else {
 #ifdef CMD_STREAM_BUF_OVERFLOW_ALLOWED
 			u32 target_size = entry_size -
-				(first_entry->offset - last_entry->offset -
-				entry_size);
+				(first_entry->start_offset -
+					last_entry->start_offset -
+					last_entry->size);
 
-			release_some_perf_entries(dev_priv, target_size);
-			entry->offset = last_entry->offset + entry_size;
+			release_some_perf_entries(dev_priv, id, target_size);
+			offset = last_entry->start_offset + last_entry->size;
 #else
-			dev_priv->perf.command_stream_buf.status |=
+			dev_priv->perf.command_stream_buf[id].status |=
 				I915_PERF_CMD_STREAM_BUF_STATUS_OVERFLOW;
 			ret = -ENOSPC;
 			goto out_unlock;
@@ -291,49 +317,47 @@ static int insert_perf_entry(struct drm_i915_private *dev_priv,
 	}
 
 out:
-	list_add_tail(&entry->link, &dev_priv->perf.node_list);
+	entry->start_offset = offset;
+	entry->size = entry_size;
+	if (stream->sample_flags & SAMPLE_OA_REPORT) {
+		entry->oa_offset = offset;
+		/* Ensure 64 byte alignment of oa_offset */
+		entry->oa_offset = ALIGN(entry->oa_offset, OA_ADDR_ALIGN);
+		offset = entry->oa_offset +
+				dev_priv->perf.oa.oa_buffer.format_size;
+	}
+	if (sample_ts) {
+		entry->ts_offset = offset;
+		/* Ensure 8 byte alignment of ts_offset */
+		entry->ts_offset = ALIGN(entry->ts_offset, TS_ADDR_ALIGN);
+		offset = entry->ts_offset + I915_PERF_TS_SAMPLE_SIZE;
+	}
+
+	list_add_tail(&entry->link, &dev_priv->perf.node_list[id]);
 #ifndef CMD_STREAM_BUF_OVERFLOW_ALLOWED
 out_unlock:
 #endif
-	spin_unlock(&dev_priv->perf.node_list_lock);
+	spin_unlock(&dev_priv->perf.node_list_lock[id]);
 	return ret;
 }
 
-static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req,
-						u32 tag)
+static int i915_ring_stream_capture_oa(struct drm_i915_gem_request *req,
+				u32 offset)
 {
 	struct intel_engine_cs *engine = req->engine;
 	struct intel_ringbuffer *ringbuf = req->ringbuf;
-	struct intel_context *ctx = req->ctx;
 	struct drm_i915_private *dev_priv = engine->dev->dev_private;
-	struct i915_perf_cs_data_node *entry;
 	u32 addr = 0;
 	int ret;
 
 	/* OA counters are only supported on the render engine */
 	BUG_ON(engine->id != RCS);
 
-	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
-	if (entry == NULL) {
-		DRM_ERROR("alloc failed\n");
-		return;
-	}
-
-	ret = insert_perf_entry(dev_priv, entry);
-	if (ret)
-		goto out_free;
-
 	ret = intel_ring_begin(req, 4);
 	if (ret)
-		goto out;
-
-	entry->ctx_id = ctx->hw_id;
-	entry->pid = current->pid;
-	entry->tag = tag;
-	i915_gem_request_assign(&entry->request, req);
+		return ret;
 
-	addr = dev_priv->perf.command_stream_buf.vma->node.start +
-		entry->offset;
+	addr = dev_priv->perf.command_stream_buf[RCS].vma->node.start + offset;
 
 	/* addr should be 64 byte aligned */
 	BUG_ON(addr & 0x3f);
@@ -361,17 +385,153 @@ static void i915_perf_command_stream_hook_oa(struct drm_i915_gem_request *req,
 		}
 		intel_ring_advance(engine);
 	}
-	i915_vma_move_to_active(dev_priv->perf.command_stream_buf.vma, req);
+	return 0;
+}
+
+static int i915_ring_stream_capture_ts(struct drm_i915_gem_request *req,
+						u32 offset)
+{
+	struct intel_engine_cs *engine = req->engine;
+	struct intel_ringbuffer *ringbuf = req->ringbuf;
+	struct drm_i915_private *dev_priv = engine->dev->dev_private;
+	u32 addr = 0;
+	int ret;
+
+	ret = intel_ring_begin(req, 6);
+	if (ret)
+		return ret;
+
+	addr = dev_priv->perf.command_stream_buf[engine->id].vma->node.start +
+		offset;
+
+	if (i915.enable_execlists) {
+		if (engine->id == RCS) {
+			intel_logical_ring_emit(ringbuf,
+						GFX_OP_PIPE_CONTROL(6));
+			intel_logical_ring_emit(ringbuf,
+						PIPE_CONTROL_GLOBAL_GTT_IVB |
+						PIPE_CONTROL_TIMESTAMP_WRITE);
+			intel_logical_ring_emit(ringbuf, addr |
+						PIPE_CONTROL_GLOBAL_GTT);
+			intel_logical_ring_emit(ringbuf, 0);
+			intel_logical_ring_emit(ringbuf, 0);
+			intel_logical_ring_emit(ringbuf, 0);
+		} else {
+			uint32_t cmd;
+
+			cmd = MI_FLUSH_DW + 2; /* Gen8+ */
+
+			cmd |= MI_FLUSH_DW_OP_STAMP;
+
+			intel_logical_ring_emit(ringbuf, cmd);
+			intel_logical_ring_emit(ringbuf, addr |
+						MI_FLUSH_DW_USE_GTT);
+			intel_logical_ring_emit(ringbuf, 0);
+			intel_logical_ring_emit(ringbuf, 0);
+			intel_logical_ring_emit(ringbuf, 0);
+			intel_logical_ring_emit(ringbuf, MI_NOOP);
+		}
+		intel_logical_ring_advance(ringbuf);
+	} else {
+		if (engine->id == RCS) {
+			if (INTEL_INFO(engine->dev)->gen >= 8)
+				intel_ring_emit(engine, GFX_OP_PIPE_CONTROL(6));
+			else
+				intel_ring_emit(engine, GFX_OP_PIPE_CONTROL(5));
+			intel_ring_emit(engine,
+					PIPE_CONTROL_GLOBAL_GTT_IVB |
+					PIPE_CONTROL_TIMESTAMP_WRITE);
+			intel_ring_emit(engine, addr | PIPE_CONTROL_GLOBAL_GTT);
+			intel_ring_emit(engine, 0);
+			if (INTEL_INFO(engine->dev)->gen >= 8) {
+				intel_ring_emit(engine, 0);
+				intel_ring_emit(engine, 0);
+			} else {
+				intel_ring_emit(engine, 0);
+				intel_ring_emit(engine, MI_NOOP);
+			}
+		} else {
+			uint32_t cmd;
+
+			cmd = MI_FLUSH_DW + 1;
+			if (INTEL_INFO(engine->dev)->gen >= 8)
+				cmd += 1;
+
+			cmd |= MI_FLUSH_DW_OP_STAMP;
+
+			intel_ring_emit(engine, cmd);
+			intel_ring_emit(engine, addr | MI_FLUSH_DW_USE_GTT);
+			if (INTEL_INFO(engine->dev)->gen >= 8) {
+				intel_ring_emit(engine, 0);
+				intel_ring_emit(engine, 0);
+				intel_ring_emit(engine, 0);
+			} else {
+				intel_ring_emit(engine, 0);
+				intel_ring_emit(engine, 0);
+				intel_ring_emit(engine, MI_NOOP);
+			}
+			intel_ring_emit(engine, MI_NOOP);
+		}
+		intel_ring_advance(engine);
+	}
+	return 0;
+}
+
+static void i915_ring_stream_cs_hook(struct i915_perf_stream *stream,
+				struct drm_i915_gem_request *req, u32 tag)
+{
+	struct intel_engine_cs *engine = req->engine;
+	struct intel_context *ctx = req->ctx;
+	struct drm_i915_private *dev_priv = engine->dev->dev_private;
+	enum intel_engine_id id = stream->engine;
+	u32 sample_flags = stream->sample_flags;
+	struct i915_perf_cs_data_node *entry;
+	int ret = 0;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (entry == NULL) {
+		DRM_ERROR("alloc failed\n");
+		return;
+	}
+
+	ret = insert_perf_entry(dev_priv, stream, entry);
+	if (ret)
+		goto err_free;
+
+	entry->ctx_id = ctx->hw_id;
+	entry->pid = current->pid;
+	entry->tag = tag;
+	i915_gem_request_assign(&entry->request, req);
+
+	if (sample_flags & SAMPLE_OA_REPORT) {
+		ret = i915_ring_stream_capture_oa(req, entry->oa_offset);
+		if (ret)
+			goto err_unref;
+	} else if (sample_flags & SAMPLE_TS) {
+		/*
+		 * XXX: Since TS data can anyways be derived from OA report, so
+		 * no need to capture it for RCS engine, if capture oa data is
+		 * called already.
+		 */
+		ret = i915_ring_stream_capture_ts(req, entry->ts_offset);
+		if (ret)
+			goto err_unref;
+	}
+
+	i915_vma_move_to_active(dev_priv->perf.command_stream_buf[id].vma, req);
 	return;
-out:
-	spin_lock(&dev_priv->perf.node_list_lock);
+
+err_unref:
+	i915_gem_request_unreference(entry->request);
+	spin_lock(&dev_priv->perf.node_list_lock[id]);
 	list_del(&entry->link);
-	spin_unlock(&dev_priv->perf.node_list_lock);
-out_free:
+	spin_unlock(&dev_priv->perf.node_list_lock[id]);
+err_free:
 	kfree(entry);
 }
 
-static int i915_oa_rcs_wait_gpu(struct drm_i915_private *dev_priv)
+static int i915_ring_stream_wait_gpu(struct drm_i915_private *dev_priv,
+				enum intel_engine_id id)
 {
 	struct i915_perf_cs_data_node *last_entry = NULL;
 	struct drm_i915_gem_request *req = NULL;
@@ -382,14 +542,14 @@ static int i915_oa_rcs_wait_gpu(struct drm_i915_private *dev_priv)
 	 * implicitly wait for the prior submitted requests. The refcount
 	 * of the requests is not decremented here.
 	 */
-	spin_lock(&dev_priv->perf.node_list_lock);
+	spin_lock(&dev_priv->perf.node_list_lock[id]);
 
-	if (!list_empty(&dev_priv->perf.node_list)) {
-		last_entry = list_last_entry(&dev_priv->perf.node_list,
+	if (!list_empty(&dev_priv->perf.node_list[id])) {
+		last_entry = list_last_entry(&dev_priv->perf.node_list[id],
 			struct i915_perf_cs_data_node, link);
 		req = last_entry->request;
 	}
-	spin_unlock(&dev_priv->perf.node_list_lock);
+	spin_unlock(&dev_priv->perf.node_list_lock[id]);
 
 	if (!req)
 		return 0;
@@ -402,17 +562,18 @@ static int i915_oa_rcs_wait_gpu(struct drm_i915_private *dev_priv)
 	return 0;
 }
 
-static void i915_oa_rcs_free_requests(struct drm_i915_private *dev_priv)
+static void i915_ring_stream_free_requests(struct drm_i915_private *dev_priv,
+				enum intel_engine_id id)
 {
 	struct i915_perf_cs_data_node *entry, *next;
 
 	list_for_each_entry_safe
-		(entry, next, &dev_priv->perf.node_list, link) {
+		(entry, next, &dev_priv->perf.node_list[id], link) {
 		i915_gem_request_unreference(entry->request);
 
-		spin_lock(&dev_priv->perf.node_list_lock);
+		spin_lock(&dev_priv->perf.node_list_lock[id]);
 		list_del(&entry->link);
-		spin_unlock(&dev_priv->perf.node_list_lock);
+		spin_unlock(&dev_priv->perf.node_list_lock[id]);
 		kfree(entry);
 	}
 }
@@ -555,11 +716,11 @@ static int append_oa_status(struct i915_perf_stream *stream,
 }
 
 /**
- * Copies single OA report into userspace read() buffer.
+ * Copies single sample into userspace read() buffer.
  */
-static int append_oa_sample(struct i915_perf_stream *stream,
+static int append_sample(struct i915_perf_stream *stream,
 			    struct i915_perf_read_state *read_state,
-			    struct oa_sample_data *data)
+			    struct sample_data *data)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
@@ -602,6 +763,12 @@ static int append_oa_sample(struct i915_perf_stream *stream,
 		buf += 4;
 	}
 
+	if (sample_flags & SAMPLE_TS) {
+		if (copy_to_user(buf, &data->ts, I915_PERF_TS_SAMPLE_SIZE))
+			return -EFAULT;
+		buf += I915_PERF_TS_SAMPLE_SIZE;
+	}
+
 	if (sample_flags & SAMPLE_OA_REPORT) {
 		if (copy_to_user(buf, data->report, report_size))
 			return -EFAULT;
@@ -620,7 +787,7 @@ static int append_oa_buffer_sample(struct i915_perf_stream *stream,
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	u32 sample_flags = stream->sample_flags;
-	struct oa_sample_data data = { 0 };
+	struct sample_data data = { 0 };
 
 	if (sample_flags & SAMPLE_OA_SOURCE_INFO) {
 		enum drm_i915_perf_oa_event_source source;
@@ -650,10 +817,15 @@ static int append_oa_buffer_sample(struct i915_perf_stream *stream,
 	if (sample_flags & SAMPLE_TAG)
 		data.tag = dev_priv->perf.last_tag;
 
+	/* Derive timestamp from OA report, after scaling with the ts base */
+#warning "FIXME: append_oa_buffer_sample: derive the timestamp from OA report"
+	if (sample_flags & SAMPLE_TS)
+		data.ts = 0;
+
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
 
-	return append_oa_sample(stream, read_state, &data);
+	return append_sample(stream, read_state, &data);
 }
 
 /**
@@ -755,7 +927,7 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
 		 * an invalid ID. It could be good to annotate these
 		 * reports with a _CTX_SWITCH_AWAY reason later.
 		 */
-		if (!dev_priv->perf.oa.exclusive_stream->ctx ||
+		if (!dev_priv->perf.exclusive_stream->ctx ||
 		    dev_priv->perf.oa.specific_ctx_id == ctx_id ||
 		    dev_priv->perf.oa.oa_buffer.last_ctx_id == ctx_id) {
 
@@ -766,7 +938,7 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
 			 * the switch-away reports with an invalid
 			 * context id to be recognisable by userspace.
 			 */
-			if (dev_priv->perf.oa.exclusive_stream->ctx &&
+			if (dev_priv->perf.exclusive_stream->ctx &&
 			    dev_priv->perf.oa.specific_ctx_id != ctx_id)
 				report32[2] = 0x1fffff;
 
@@ -1084,31 +1256,39 @@ static int gen7_oa_read(struct i915_perf_stream *stream,
 }
 
 /**
- * Copies a command stream OA report into userspace read() buffer, while also
- * forwarding the periodic OA reports with timestamp lower than CS report.
+ * Copy one command stream report into userspace read() buffer.
+ * For OA reports, also forward the periodic OA reports with timestamp
+ * lower than current CS OA sample.
  *
  * NB: some data may be successfully copied to the userspace buffer
  * even if an error is returned, and this is reflected in the
  * updated @read_state.
  */
-static int append_oa_rcs_sample(struct i915_perf_stream *stream,
+static int append_one_cs_sample(struct i915_perf_stream *stream,
 				 struct i915_perf_read_state *read_state,
 				 struct i915_perf_cs_data_node *node)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
-	struct oa_sample_data data = { 0 };
-	const u8 *report = dev_priv->perf.command_stream_buf.addr +
-				node->offset;
+	enum intel_engine_id id = stream->engine;
+	struct sample_data data = { 0 };
 	u32 sample_flags = stream->sample_flags;
-	u32 report_ts;
-	int ret;
+	int ret = 0;
 
-	/* First, append the periodic OA samples having lower timestamps */
-	report_ts = *(u32 *)(report + 4);
-	ret = dev_priv->perf.oa.ops.read(stream, read_state,
-					report_ts, U32_MAX);
-	if (ret)
-		return ret;
+	if (sample_flags & SAMPLE_OA_REPORT) {
+		const u8 *report = dev_priv->perf.command_stream_buf[id].addr +
+				   node->oa_offset;
+		u32 sample_ts = *(u32 *)(report + 4);
+
+		data.report = report;
+
+		/* First, append the periodic OA samples having lower
+		 * timestamp values
+		 */
+		ret = dev_priv->perf.oa.ops.read(stream, read_state, sample_ts,
+						U32_MAX);
+		if (ret)
+			return ret;
+	}
 
 	if (sample_flags & SAMPLE_OA_SOURCE_INFO)
 		data.source = I915_PERF_OA_EVENT_SOURCE_RCS;
@@ -1128,25 +1308,37 @@ static int append_oa_rcs_sample(struct i915_perf_stream *stream,
 		dev_priv->perf.last_tag = node->tag;
 	}
 
-	if (sample_flags & SAMPLE_OA_REPORT)
-		data.report = report;
+	if (sample_flags & SAMPLE_TS) {
+		/* For RCS, if OA samples are also being collected, derive the
+		 * timestamp from OA report, after scaling with the TS base.
+		 * Else, forward the timestamp collected via command stream.
+		 */
+#warning "FIXME: append_one_cs_sample: derive the timestamp from OA report"
+		if (sample_flags & SAMPLE_OA_REPORT)
+			data.ts = 0;
+		else
+			data.ts = *(u64 *)
+				(dev_priv->perf.command_stream_buf[id].addr +
+					node->ts_offset);
+	}
 
-	return append_oa_sample(stream, read_state, &data);
+	return append_sample(stream, read_state, &data);
 }
 
 /**
- * Copies all OA reports into userspace read() buffer. This includes command
- * stream as well as periodic OA reports.
+ * Copies all samples into userspace read() buffer. This includes command
+ * stream samples as well as periodic OA reports (if enabled).
  *
  * NB: some data may be successfully copied to the userspace buffer
  * even if an error is returned, and this is reflected in the
  * updated @read_state.
  */
-static int oa_rcs_append_reports(struct i915_perf_stream *stream,
+static int append_command_stream_samples(struct i915_perf_stream *stream,
 				  struct i915_perf_read_state *read_state)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	struct i915_perf_cs_data_node *entry, *next;
+	enum intel_engine_id id = stream->engine;
 	LIST_HEAD(free_list);
 	int ret = 0;
 #ifndef CMD_STREAM_BUF_OVERFLOW_ALLOWED
@@ -1163,24 +1355,24 @@ static int oa_rcs_append_reports(struct i915_perf_stream *stream,
 				~I915_PERF_CMD_STREAM_BUF_STATUS_OVERFLOW;
 	}
 #endif
-	spin_lock(&dev_priv->perf.node_list_lock);
-	if (list_empty(&dev_priv->perf.node_list)) {
-		spin_unlock(&dev_priv->perf.node_list_lock);
+	spin_lock(&dev_priv->perf.node_list_lock[id]);
+	if (list_empty(&dev_priv->perf.node_list[id])) {
+		spin_unlock(&dev_priv->perf.node_list_lock[id]);
 		goto pending_periodic;
 	}
 	list_for_each_entry_safe(entry, next,
-				 &dev_priv->perf.node_list, link) {
+				 &dev_priv->perf.node_list[id], link) {
 		if (!i915_gem_request_completed(entry->request, true))
 			break;
 		list_move_tail(&entry->link, &free_list);
 	}
-	spin_unlock(&dev_priv->perf.node_list_lock);
+	spin_unlock(&dev_priv->perf.node_list_lock[id]);
 
 	if (list_empty(&free_list))
 		goto pending_periodic;
 
 	list_for_each_entry_safe(entry, next, &free_list, link) {
-		ret = append_oa_rcs_sample(stream, read_state, entry);
+		ret = append_one_cs_sample(stream, read_state, entry);
 		if (ret)
 			break;
 
@@ -1190,14 +1382,15 @@ static int oa_rcs_append_reports(struct i915_perf_stream *stream,
 	}
 
 	/* Don't discard remaining entries, keep them for next read */
-	spin_lock(&dev_priv->perf.node_list_lock);
-	list_splice(&free_list, &dev_priv->perf.node_list);
-	spin_unlock(&dev_priv->perf.node_list_lock);
+	spin_lock(&dev_priv->perf.node_list_lock[id]);
+	list_splice(&free_list, &dev_priv->perf.node_list[id]);
+	spin_unlock(&dev_priv->perf.node_list_lock[id]);
 
 	return ret;
 
 pending_periodic:
-	if (!dev_priv->perf.oa.n_pending_periodic_samples)
+	if (!((stream->sample_flags & SAMPLE_OA_REPORT) &&
+			dev_priv->perf.oa.n_pending_periodic_samples))
 		return 0;
 
 	ret = dev_priv->perf.oa.ops.read(stream, read_state,
@@ -1226,15 +1419,16 @@ static enum cs_buf_data_state command_stream_buf_state(
 				struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
+	enum intel_engine_id id = stream->engine;
 	struct i915_perf_cs_data_node *entry = NULL;
 	struct drm_i915_gem_request *request = NULL;
 
-	spin_lock(&dev_priv->perf.node_list_lock);
-	entry = list_first_entry_or_null(&dev_priv->perf.node_list,
+	spin_lock(&dev_priv->perf.node_list_lock[id]);
+	entry = list_first_entry_or_null(&dev_priv->perf.node_list[id],
 			struct i915_perf_cs_data_node, link);
 	if (entry)
 		request = entry->request;
-	spin_unlock(&dev_priv->perf.node_list_lock);
+	spin_unlock(&dev_priv->perf.node_list_lock[id]);
 
 	if (!entry)
 		return CS_BUF_EMPTY;
@@ -1252,23 +1446,23 @@ static enum cs_buf_data_state command_stream_buf_state(
 static bool stream_have_data__unlocked(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
-	enum cs_buf_data_state cs_buf_state;
-	u32 num_samples, last_ts = 0;
-
-	/* Note: oa_buffer_num_samples() is ok to run unlocked as it just
-	 * performs mmio reads of the OA buffer head + tail pointers and
-	 * it's assumed we're handling some operation that implies the stream
-	 * can't be destroyed until completion (such as a read()) that ensures
-	 * the device + OA buffer can't disappear
-	 */
-	dev_priv->perf.oa.n_pending_periodic_samples = 0;
-	dev_priv->perf.oa.pending_periodic_ts = 0;
-	num_samples = dev_priv->perf.oa.ops.oa_buffer_num_samples(dev_priv,
-								&last_ts);
-	if (stream->cs_mode)
+	enum cs_buf_data_state cs_buf_state = CS_BUF_EMPTY;
+	u32 num_samples = 0, last_ts = 0;
+
+	if (stream->sample_flags & SAMPLE_OA_REPORT) {
+		/* Note: oa_buffer_num_samples() is ok to run unlocked as it
+		 * just performs mmio reads of the OA buffer head + tail
+		 * pointers and it's assumed we're handling some operation that
+		 * implies the stream can't be destroyed until completion (such
+		 * as a read()) that ensures the device + OA buffer can't
+		 * disappear
+		 */
+		dev_priv->perf.oa.n_pending_periodic_samples = 0;
+		dev_priv->perf.oa.pending_periodic_ts = 0;
+		num_samples = dev_priv->perf.oa.ops.oa_buffer_num_samples(
+							dev_priv, &last_ts);
+	} else if (stream->cs_mode)
 		cs_buf_state = command_stream_buf_state(stream);
-	else
-		cs_buf_state = CS_BUF_EMPTY;
 
 	/*
 	 * Note: We can safely forward the periodic OA samples in the case we
@@ -1280,9 +1474,13 @@ static bool stream_have_data__unlocked(struct i915_perf_stream *stream)
 	 */
 	switch (cs_buf_state) {
 	case CS_BUF_EMPTY:
-		dev_priv->perf.oa.n_pending_periodic_samples = num_samples;
-		dev_priv->perf.oa.pending_periodic_ts = last_ts;
-		return (num_samples != 0);
+		if (stream->sample_flags & SAMPLE_OA_REPORT) {
+			dev_priv->perf.oa.n_pending_periodic_samples =
+								num_samples;
+			dev_priv->perf.oa.pending_periodic_ts = last_ts;
+			return (num_samples != 0);
+		} else
+			return false;
 
 	case CS_BUF_HAVE_DATA:
 		return true;
@@ -1293,15 +1491,16 @@ static bool stream_have_data__unlocked(struct i915_perf_stream *stream)
 	}
 }
 
-static bool i915_oa_can_read_unlocked(struct i915_perf_stream *stream)
+static bool i915_ring_stream_can_read_unlocked(struct i915_perf_stream *stream)
 {
 
 	return stream_have_data__unlocked(stream);
 }
 
-static int i915_oa_wait_unlocked(struct i915_perf_stream *stream)
+static int i915_ring_stream_wait_unlocked(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
+	enum intel_engine_id id = stream->engine;
 	int ret;
 
 	/* We would wait indefinitly if periodic sampling is not enabled */
@@ -1309,49 +1508,52 @@ static int i915_oa_wait_unlocked(struct i915_perf_stream *stream)
 		return -EIO;
 
 	if (stream->cs_mode) {
-		ret = i915_oa_rcs_wait_gpu(dev_priv);
+		ret = i915_ring_stream_wait_gpu(dev_priv, id);
 		if (ret)
 			return ret;
 	}
 
-	return wait_event_interruptible(dev_priv->perf.oa.poll_wq,
+	return wait_event_interruptible(dev_priv->perf.poll_wq[id],
 					stream_have_data__unlocked(stream));
 }
 
-static void i915_oa_poll_wait(struct i915_perf_stream *stream,
+static void i915_ring_stream_poll_wait(struct i915_perf_stream *stream,
 			      struct file *file,
 			      poll_table *wait)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	poll_wait(file, &dev_priv->perf.oa.poll_wq, wait);
+	poll_wait(file, &dev_priv->perf.poll_wq[stream->engine], wait);
 }
 
-static int i915_oa_read(struct i915_perf_stream *stream,
+static int i915_ring_stream_read(struct i915_perf_stream *stream,
 			struct i915_perf_read_state *read_state)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
 	if (stream->cs_mode)
-		return oa_rcs_append_reports(stream, read_state);
-	else
+		return append_command_stream_samples(stream, read_state);
+	else if (stream->sample_flags & SAMPLE_OA_REPORT)
 		return dev_priv->perf.oa.ops.read(stream, read_state,
 						U32_MAX, U32_MAX);
+	else
+		return -EINVAL;
 }
 
 static void
-free_command_stream_buf(struct drm_i915_private *dev_priv)
+free_command_stream_buf(struct drm_i915_private *dev_priv,
+				enum intel_engine_id id)
 {
 	mutex_lock(&dev_priv->dev->struct_mutex);
 
-	i915_gem_object_unpin_map(dev_priv->perf.command_stream_buf.obj);
-	i915_gem_object_ggtt_unpin(dev_priv->perf.command_stream_buf.obj);
+	i915_gem_object_unpin_map(dev_priv->perf.command_stream_buf[id].obj);
+	i915_gem_object_ggtt_unpin(dev_priv->perf.command_stream_buf[id].obj);
 	drm_gem_object_unreference(
-			&dev_priv->perf.command_stream_buf.obj->base);
+			&dev_priv->perf.command_stream_buf[id].obj->base);
 
-	dev_priv->perf.command_stream_buf.obj = NULL;
-	dev_priv->perf.command_stream_buf.vma = NULL;
-	dev_priv->perf.command_stream_buf.addr = NULL;
+	dev_priv->perf.command_stream_buf[id].obj = NULL;
+	dev_priv->perf.command_stream_buf[id].vma = NULL;
+	dev_priv->perf.command_stream_buf[id].addr = NULL;
 
 	mutex_unlock(&dev_priv->dev->struct_mutex);
 }
@@ -1372,16 +1574,13 @@ free_oa_buffer(struct drm_i915_private *i915)
 	mutex_unlock(&i915->dev->struct_mutex);
 }
 
-static void i915_oa_stream_destroy(struct i915_perf_stream *stream)
+static void i915_ring_stream_destroy(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	BUG_ON(stream != dev_priv->perf.oa.exclusive_stream);
-
-	if (stream->cs_mode)
-		free_command_stream_buf(dev_priv);
+	BUG_ON(stream != dev_priv->perf.exclusive_stream);
 
-	if (dev_priv->perf.oa.oa_buffer.obj) {
+	if (stream->using_oa) {
 		dev_priv->perf.oa.ops.disable_metric_set(dev_priv);
 
 		free_oa_buffer(dev_priv);
@@ -1390,7 +1589,10 @@ static void i915_oa_stream_destroy(struct i915_perf_stream *stream)
 		intel_runtime_pm_put(dev_priv);
 	}
 
-	dev_priv->perf.oa.exclusive_stream = NULL;
+	if (stream->cs_mode)
+		free_command_stream_buf(dev_priv, stream->engine);
+
+	dev_priv->perf.exclusive_stream = NULL;
 }
 
 static void gen7_init_oa_buffer(struct drm_i915_private *dev_priv)
@@ -1521,29 +1723,30 @@ static int alloc_oa_buffer(struct drm_i915_private *dev_priv)
 	return 0;
 }
 
-static int alloc_command_stream_buf(struct drm_i915_private *dev_priv)
+static int alloc_command_stream_buf(struct drm_i915_private *dev_priv,
+					enum intel_engine_id id)
 {
 	struct drm_i915_gem_object *bo;
 	u8 *obj_addr;
 	int ret;
 
-	BUG_ON(dev_priv->perf.command_stream_buf.obj);
+	BUG_ON(dev_priv->perf.command_stream_buf[id].obj);
 
 	ret = alloc_obj(dev_priv, &bo, &obj_addr);
 	if (ret)
 		return ret;
 
-	dev_priv->perf.command_stream_buf.obj = bo;
-	dev_priv->perf.command_stream_buf.addr = obj_addr;
-	dev_priv->perf.command_stream_buf.vma = i915_gem_obj_to_ggtt(bo);
-	if (WARN_ON(!list_empty(&dev_priv->perf.node_list)))
-		INIT_LIST_HEAD(&dev_priv->perf.node_list);
+	dev_priv->perf.command_stream_buf[id].obj = bo;
+	dev_priv->perf.command_stream_buf[id].addr = obj_addr;
+	dev_priv->perf.command_stream_buf[id].vma = i915_gem_obj_to_ggtt(bo);
+	if (WARN_ON(!list_empty(&dev_priv->perf.node_list[id])))
+		INIT_LIST_HEAD(&dev_priv->perf.node_list[id]);
 
 	DRM_DEBUG_DRIVER(
 		"command stream buf initialized, gtt offset = 0x%x, vaddr = %p",
 		 (unsigned int)
-		 dev_priv->perf.command_stream_buf.vma->node.start,
-		 dev_priv->perf.command_stream_buf.addr);
+		 dev_priv->perf.command_stream_buf[id].vma->node.start,
+		 dev_priv->perf.command_stream_buf[id].addr);
 
 	return 0;
 }
@@ -1791,14 +1994,14 @@ static void gen7_update_oacontrol_locked(struct drm_i915_private *dev_priv)
 {
 	assert_spin_locked(&dev_priv->perf.hook_lock);
 
-	if (dev_priv->perf.oa.exclusive_stream->state !=
+	if (dev_priv->perf.exclusive_stream->state !=
 					I915_PERF_STREAM_DISABLED) {
 		unsigned long ctx_id = 0;
 
-		if (dev_priv->perf.oa.exclusive_stream->ctx)
+		if (dev_priv->perf.exclusive_stream->ctx)
 			ctx_id = dev_priv->perf.oa.specific_ctx_id;
 
-		if (dev_priv->perf.oa.exclusive_stream->ctx == NULL || ctx_id) {
+		if (dev_priv->perf.exclusive_stream->ctx == NULL || ctx_id) {
 			bool periodic = dev_priv->perf.oa.periodic;
 			u32 period_exponent = dev_priv->perf.oa.period_exponent;
 			u32 report_format = dev_priv->perf.oa.oa_buffer.format;
@@ -1848,14 +2051,15 @@ static void gen8_oa_enable(struct drm_i915_private *dev_priv)
 				   GEN8_OA_COUNTER_ENABLE);
 }
 
-static void i915_oa_stream_enable(struct i915_perf_stream *stream)
+static void i915_ring_stream_enable(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	dev_priv->perf.oa.ops.oa_enable(dev_priv);
+	if (stream->sample_flags & SAMPLE_OA_REPORT)
+		dev_priv->perf.oa.ops.oa_enable(dev_priv);
 
-	if (dev_priv->perf.oa.periodic)
-		hrtimer_start(&dev_priv->perf.oa.poll_check_timer,
+	if (stream->cs_mode || dev_priv->perf.oa.periodic)
+		hrtimer_start(&dev_priv->perf.poll_check_timer,
 			      ns_to_ktime(POLL_PERIOD),
 			      HRTIMER_MODE_REL_PINNED);
 }
@@ -1870,19 +2074,20 @@ static void gen8_oa_disable(struct drm_i915_private *dev_priv)
 	I915_WRITE(GEN8_OACONTROL, 0);
 }
 
-static void i915_oa_stream_disable(struct i915_perf_stream *stream)
+static void i915_ring_stream_disable(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	if (dev_priv->perf.oa.periodic)
-		hrtimer_cancel(&dev_priv->perf.oa.poll_check_timer);
+	if (stream->cs_mode || dev_priv->perf.oa.periodic)
+		hrtimer_cancel(&dev_priv->perf.poll_check_timer);
 
 	if (stream->cs_mode) {
-		i915_oa_rcs_wait_gpu(dev_priv);
-		i915_oa_rcs_free_requests(dev_priv);
+		i915_ring_stream_wait_gpu(dev_priv, stream->engine);
+		i915_ring_stream_free_requests(dev_priv, stream->engine);
 	}
 
-	dev_priv->perf.oa.ops.oa_disable(dev_priv);
+	if (stream->sample_flags & SAMPLE_OA_REPORT)
+		dev_priv->perf.oa.ops.oa_disable(dev_priv);
 }
 
 static u64 oa_exponent_to_ns(struct drm_i915_private *dev_priv, int exponent)
@@ -1892,17 +2097,17 @@ static u64 oa_exponent_to_ns(struct drm_i915_private *dev_priv, int exponent)
 }
 
 static const struct i915_perf_stream_ops i915_oa_stream_ops = {
-	.destroy = i915_oa_stream_destroy,
-	.enable = i915_oa_stream_enable,
-	.disable = i915_oa_stream_disable,
-	.can_read_unlocked = i915_oa_can_read_unlocked,
-	.wait_unlocked = i915_oa_wait_unlocked,
-	.poll_wait = i915_oa_poll_wait,
-	.read = i915_oa_read,
-	.command_stream_hook = i915_perf_command_stream_hook_oa,
+	.destroy = i915_ring_stream_destroy,
+	.enable = i915_ring_stream_enable,
+	.disable = i915_ring_stream_disable,
+	.can_read_unlocked = i915_ring_stream_can_read_unlocked,
+	.wait_unlocked = i915_ring_stream_wait_unlocked,
+	.poll_wait = i915_ring_stream_poll_wait,
+	.read = i915_ring_stream_read,
+	.command_stream_hook = i915_ring_stream_cs_hook,
 };
 
-static int i915_oa_stream_init(struct i915_perf_stream *stream,
+static int i915_ring_stream_init(struct i915_perf_stream *stream,
 			       struct drm_i915_perf_open_param *param,
 			       struct perf_open_properties *props)
 {
@@ -1911,15 +2116,16 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 						      SAMPLE_OA_SOURCE_INFO);
 	bool require_cs_mode = props->sample_flags & (SAMPLE_PID |
 						      SAMPLE_TAG);
-	bool cs_sample_data = props->sample_flags & SAMPLE_OA_REPORT;
+	bool cs_sample_data = props->sample_flags & (SAMPLE_OA_REPORT |
+							SAMPLE_TS);
 	int ret;
 
 	/* To avoid the complexity of having to accurately filter
 	 * counter reports and marshal to the appropriate client
 	 * we currently only allow exclusive access
 	 */
-	if (dev_priv->perf.oa.exclusive_stream) {
-		DRM_ERROR("OA unit already in use\n");
+	if (dev_priv->perf.exclusive_stream) {
+		DRM_ERROR("Stream already in use\n");
 		return -EBUSY;
 	}
 
@@ -1961,6 +2167,7 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 			return -EINVAL;
 		}
 		stream->engine= RCS;
+		stream->using_oa = true;
 
 		format_size =
 			dev_priv->perf.oa.oa_formats[props->oa_format].size;
@@ -2053,8 +2260,22 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 			require_cs_mode = true;
 	}
 
+	if (props->sample_flags & SAMPLE_TS) {
+		stream->sample_flags |= SAMPLE_TS;
+		stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
+
+		/*
+		 * NB: it's meaningful to request SAMPLE_TS with just CS
+		 * mode or periodic OA mode sampling but we don't allow
+		 * SAMPLE_TS without either mode
+		 */
+		if (!require_oa_unit)
+			require_cs_mode = true;
+	}
+
 	if (require_cs_mode && !props->cs_mode) {
-		DRM_ERROR("PID or TAG sampling require a ring to be specified");
+		DRM_ERROR(
+			"PID, TAG or TS sampling require a ring to be specified");
 		ret = -EINVAL;
 		goto cs_error;
 	}
@@ -2069,11 +2290,11 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 
 		/*
 		 * The only time we should allow enabling CS mode if it's not
-		 * strictly required, is if SAMPLE_CTX_ID has been requested
-		 * as it's usable with periodic OA or CS sampling.
+		 * strictly required, is if SAMPLE_CTX_ID  or SAMPLE_TS has been
+		 * requested, as they're usable with periodic OA or CS sampling.
 		 */
 		if (!require_cs_mode &&
-		    !(props->sample_flags & SAMPLE_CTX_ID)) {
+		    !(props->sample_flags & (SAMPLE_CTX_ID|SAMPLE_TS))) {
 			DRM_ERROR(
 				"Ring given without requesting any CS specific property");
 			ret = -EINVAL;
@@ -2081,6 +2302,7 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 		}
 
 		stream->cs_mode = true;
+		stream->engine = props->engine;
 
 		if (props->sample_flags & SAMPLE_PID) {
 			stream->sample_flags |= SAMPLE_PID;
@@ -2092,14 +2314,14 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 			stream->sample_size += 4;
 		}
 
-		ret = alloc_command_stream_buf(dev_priv);
+		ret = alloc_command_stream_buf(dev_priv, stream->engine);
 		if (ret)
 			goto cs_error;
 	}
 
 	stream->ops = &i915_oa_stream_ops;
 
-	dev_priv->perf.oa.exclusive_stream = stream;
+	dev_priv->perf.exclusive_stream = stream;
 
 	return 0;
 
@@ -2133,8 +2355,8 @@ static void i915_oa_context_pin_notify_locked(struct drm_i915_private *dev_priv,
 	    dev_priv->perf.oa.ops.update_hw_ctx_id_locked == NULL)
 		return;
 
-	if (dev_priv->perf.oa.exclusive_stream &&
-	    dev_priv->perf.oa.exclusive_stream->ctx == context) {
+	if (dev_priv->perf.exclusive_stream &&
+	    dev_priv->perf.exclusive_stream->ctx == context) {
 		struct drm_i915_gem_object *obj =
 			context->legacy_hw_ctx.rcs_state;
 		u32 ctx_id = i915_gem_obj_ggtt_offset(obj);
@@ -2203,8 +2425,8 @@ void i915_oa_legacy_ctx_switch_notify(struct drm_i915_gem_request *req)
 	if (dev_priv->perf.oa.ops.legacy_ctx_switch_unlocked == NULL)
 		return;
 
-	if (dev_priv->perf.oa.exclusive_stream &&
-	    dev_priv->perf.oa.exclusive_stream->state !=
+	if (dev_priv->perf.exclusive_stream &&
+	    dev_priv->perf.exclusive_stream->state !=
 				I915_PERF_STREAM_DISABLED) {
 
 		/* XXX: We don't take a lock here and this may run
@@ -2345,13 +2567,13 @@ static ssize_t i915_perf_read(struct file *file,
 	return ret;
 }
 
-static enum hrtimer_restart oa_poll_check_timer_cb(struct hrtimer *hrtimer)
+static enum hrtimer_restart poll_check_timer_cb(struct hrtimer *hrtimer)
 {
 	struct i915_perf_stream *stream;
 
 	struct drm_i915_private *dev_priv =
 		container_of(hrtimer, typeof(*dev_priv),
-			     perf.oa.poll_check_timer);
+			     perf.poll_check_timer);
 
 	/* No need to protect the streams list here, since the hrtimer is
 	 * disabled before the stream is removed from list, and currently a
@@ -2360,7 +2582,7 @@ static enum hrtimer_restart oa_poll_check_timer_cb(struct hrtimer *hrtimer)
 	 */
 	list_for_each_entry(stream, &dev_priv->perf.streams, link) {
 		if (stream_have_data__unlocked(stream))
-			wake_up(&dev_priv->perf.oa.poll_wq);
+			wake_up(&dev_priv->perf.poll_wq[stream->engine]);
 	}
 
 	hrtimer_forward_now(hrtimer, ns_to_ktime(POLL_PERIOD));
@@ -2564,7 +2786,7 @@ int i915_perf_open_ioctl_locked(struct drm_device *dev,
 	stream->dev_priv = dev_priv;
 	stream->ctx = specific_ctx;
 
-	ret = i915_oa_stream_init(stream, param, props);
+	ret = i915_ring_stream_init(stream, param, props);
 	if (ret)
 		goto err_alloc;
 
@@ -2709,21 +2931,12 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_PROP_ENGINE: {
 				unsigned int user_ring_id =
 					value & I915_EXEC_RING_MASK;
-				enum intel_engine_id engine;
 
 				if (user_ring_id > I915_USER_RINGS)
 					return -EINVAL;
 
-				/* XXX: Currently only RCS is supported.
-				 * Remove this check when support for other
-				 * engines is added
-				 */
-				engine = user_ring_map[user_ring_id];
-				if (engine != RCS)
-					return -EINVAL;
-
 				props->cs_mode = true;
-				props->engine = engine;
+				props->engine = user_ring_map[user_ring_id];
 			}
 			break;
 		case DRM_I915_PERF_PROP_SAMPLE_CTX_ID:
@@ -2735,6 +2948,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_PROP_SAMPLE_TAG:
 			props->sample_flags |= SAMPLE_TAG;
 			break;
+		case DRM_I915_PERF_PROP_SAMPLE_TS:
+			props->sample_flags |= SAMPLE_TS;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
@@ -2825,6 +3041,7 @@ static struct ctl_table dev_root[] = {
 void i915_perf_init(struct drm_device *dev)
 {
 	struct drm_i915_private *dev_priv = to_i915(dev);
+	int i;
 
 	if (!(IS_HASWELL(dev) ||
 	      IS_BROADWELL(dev) || IS_CHERRYVIEW(dev) ||
@@ -2836,17 +3053,20 @@ void i915_perf_init(struct drm_device *dev)
 	if (!dev_priv->perf.metrics_kobj)
 		return;
 
-	hrtimer_init(&dev_priv->perf.oa.poll_check_timer,
+	hrtimer_init(&dev_priv->perf.poll_check_timer,
 		     CLOCK_MONOTONIC, HRTIMER_MODE_REL);
-	dev_priv->perf.oa.poll_check_timer.function = oa_poll_check_timer_cb;
-	init_waitqueue_head(&dev_priv->perf.oa.poll_wq);
+	dev_priv->perf.poll_check_timer.function = poll_check_timer_cb;
+
+	for (i = 0; i < I915_NUM_ENGINES; i++) {
+		INIT_LIST_HEAD(&dev_priv->perf.node_list[i]);
+		spin_lock_init(&dev_priv->perf.node_list_lock[i]);
+		init_waitqueue_head(&dev_priv->perf.poll_wq[i]);
+	}
 
 	INIT_LIST_HEAD(&dev_priv->perf.streams);
-	INIT_LIST_HEAD(&dev_priv->perf.node_list);
 	mutex_init(&dev_priv->perf.lock);
 	mutex_init(&dev_priv->perf.streams_lock);
 	spin_lock_init(&dev_priv->perf.hook_lock);
-	spin_lock_init(&dev_priv->perf.node_list_lock);
 
 	if (IS_HASWELL(dev)) {
 		dev_priv->perf.oa.ops.init_oa_buffer = gen7_init_oa_buffer;
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 4330c9b..92f9eaa 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -410,6 +410,7 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define   MI_FLUSH_DW_STORE_INDEX	(1<<21)
 #define   MI_INVALIDATE_TLB		(1<<18)
 #define   MI_FLUSH_DW_OP_STOREDW	(1<<14)
+#define   MI_FLUSH_DW_OP_STAMP		(3<<14)
 #define   MI_FLUSH_DW_OP_MASK		(3<<14)
 #define   MI_FLUSH_DW_NOTIFY		(1<<8)
 #define   MI_INVALIDATE_BSD		(1<<7)
@@ -491,6 +492,7 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define   PIPE_CONTROL_TLB_INVALIDATE			(1<<18)
 #define   PIPE_CONTROL_MEDIA_STATE_CLEAR		(1<<16)
 #define   PIPE_CONTROL_QW_WRITE				(1<<14)
+#define   PIPE_CONTROL_TIMESTAMP_WRITE			(3<<14)
 #define   PIPE_CONTROL_POST_SYNC_OP_MASK                (3<<14)
 #define   PIPE_CONTROL_DEPTH_STALL			(1<<13)
 #define   PIPE_CONTROL_WRITE_FLUSH			(1<<12)
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index bd0e48d..02d43a9 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1272,6 +1272,12 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_PROP_SAMPLE_TAG,
 
+	/**
+	 * The value of this property set to 1 requests inclusion of timestamp
+	 * in the perf sample data.
+	 */
+	DRM_I915_PERF_PROP_SAMPLE_TS,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1340,6 +1346,7 @@ enum drm_i915_perf_record_type {
 	 *     { u32 ctx_id; } && DRM_I915_PERF_PROP_SAMPLE_CTX_ID
 	 *     { u32 pid; } && DRM_I915_PERF_PROP_SAMPLE_PID
 	 *     { u32 tag; } && DRM_I915_PERF_PROP_SAMPLE_TAG
+	 *     { u64 timestamp; } && DRM_I915_PERF_PROP_SAMPLE_TS
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 10/15] drm/i915: Extract raw GPU timestamps from OA reports to forward in perf samples
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (8 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 09/15] drm/i915: Extend i915 perf framework for collecting timestamps on all gpu engines sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  5:18 ` [PATCH 11/15] drm/i915: Support opening multiple concurrent perf streams sourab.gupta
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Sourab Gupta, Deepak S

From: Sourab Gupta <sourab.gupta@intel.com>

The OA reports contain the least significant 32 bits of the gpu timestamp.
This patch enables retrieval of the timestamp field from OA reports, to
forward as 64 bit raw gpu timestamps in the perf samples.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |  1 +
 drivers/gpu/drm/i915/i915_perf.c | 46 ++++++++++++++++++++++++++++++----------
 drivers/gpu/drm/i915/i915_reg.h  |  4 ++++
 3 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 2a31b79..a9a123b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2196,6 +2196,7 @@ struct drm_i915_private {
 			u32 ctx_flexeu0_off;
 			u32 n_pending_periodic_samples;
 			u32 pending_periodic_ts;
+			u64 last_gpu_ts;
 
 			struct i915_oa_ops ops;
 			const struct i915_oa_format *oa_formats;
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 4a6fc5e..65b4af6 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -781,6 +781,24 @@ static int append_sample(struct i915_perf_stream *stream,
 	return 0;
 }
 
+static u64 get_gpu_ts_from_oa_report(struct drm_i915_private *dev_priv,
+					const u8 *report)
+{
+	u32 sample_ts = *(u32 *)(report + 4);
+	u32 delta;
+
+	/*
+	 * NB: We have to assume we're updating last_gpu_ts frequently
+	 * enough that it's never possible to see multiple overflows before
+	 * we compare sample_ts to last_gpu_ts. Since this is significantly
+	 * large duration (~6min for 80ns ts base), we can safely assume so.
+	 */
+	delta = sample_ts - (u32)dev_priv->perf.oa.last_gpu_ts;
+	dev_priv->perf.oa.last_gpu_ts += delta;
+
+	return dev_priv->perf.oa.last_gpu_ts;
+}
+
 static int append_oa_buffer_sample(struct i915_perf_stream *stream,
 				    struct i915_perf_read_state *read_state,
 				    const u8 *report)
@@ -817,10 +835,9 @@ static int append_oa_buffer_sample(struct i915_perf_stream *stream,
 	if (sample_flags & SAMPLE_TAG)
 		data.tag = dev_priv->perf.last_tag;
 
-	/* Derive timestamp from OA report, after scaling with the ts base */
-#warning "FIXME: append_oa_buffer_sample: derive the timestamp from OA report"
+	/* Derive timestamp from OA report */
 	if (sample_flags & SAMPLE_TS)
-		data.ts = 0;
+		data.ts = get_gpu_ts_from_oa_report(dev_priv, report);
 
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
@@ -1272,6 +1289,7 @@ static int append_one_cs_sample(struct i915_perf_stream *stream,
 	enum intel_engine_id id = stream->engine;
 	struct sample_data data = { 0 };
 	u32 sample_flags = stream->sample_flags;
+	u64 gpu_ts = 0;
 	int ret = 0;
 
 	if (sample_flags & SAMPLE_OA_REPORT) {
@@ -1288,6 +1306,9 @@ static int append_one_cs_sample(struct i915_perf_stream *stream,
 						U32_MAX);
 		if (ret)
 			return ret;
+
+		if (sample_flags & SAMPLE_TS)
+			gpu_ts = get_gpu_ts_from_oa_report(dev_priv, report);
 	}
 
 	if (sample_flags & SAMPLE_OA_SOURCE_INFO)
@@ -1309,17 +1330,14 @@ static int append_one_cs_sample(struct i915_perf_stream *stream,
 	}
 
 	if (sample_flags & SAMPLE_TS) {
-		/* For RCS, if OA samples are also being collected, derive the
-		 * timestamp from OA report, after scaling with the TS base.
+		/* If OA sampling is enabled, derive the ts from OA report.
 		 * Else, forward the timestamp collected via command stream.
 		 */
-#warning "FIXME: append_one_cs_sample: derive the timestamp from OA report"
-		if (sample_flags & SAMPLE_OA_REPORT)
-			data.ts = 0;
-		else
-			data.ts = *(u64 *)
+		if (!(sample_flags & SAMPLE_OA_REPORT))
+			gpu_ts = *(u64 *)
 				(dev_priv->perf.command_stream_buf[id].addr +
 					node->ts_offset);
+		data.ts = gpu_ts;
 	}
 
 	return append_sample(stream, read_state, &data);
@@ -2055,9 +2073,15 @@ static void i915_ring_stream_enable(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	if (stream->sample_flags & SAMPLE_OA_REPORT)
+	if (stream->sample_flags & SAMPLE_OA_REPORT) {
 		dev_priv->perf.oa.ops.oa_enable(dev_priv);
 
+		if (stream->sample_flags & SAMPLE_TS)
+			dev_priv->perf.oa.last_gpu_ts =
+				I915_READ64_2x32(GT_TIMESTAMP_COUNT,
+					GT_TIMESTAMP_COUNT_UDW);
+	}
+
 	if (stream->cs_mode || dev_priv->perf.oa.periodic)
 		hrtimer_start(&dev_priv->perf.poll_check_timer,
 			      ns_to_ktime(POLL_PERIOD),
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 92f9eaa..be7e008 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -591,6 +591,10 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define PS_DEPTH_COUNT                  _MMIO(0x2350)
 #define PS_DEPTH_COUNT_UDW		_MMIO(0x2350 + 4)
 
+/* Timestamp count register */
+#define GT_TIMESTAMP_COUNT		_MMIO(0x2358)
+#define GT_TIMESTAMP_COUNT_UDW		_MMIO(0x2358 + 4)
+
 /* There are the 4 64-bit counter registers, one for each stream output */
 #define GEN7_SO_NUM_PRIMS_WRITTEN(n)		_MMIO(0x5200 + (n) * 8)
 #define GEN7_SO_NUM_PRIMS_WRITTEN_UDW(n)	_MMIO(0x5200 + (n) * 8 + 4)
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 11/15] drm/i915: Support opening multiple concurrent perf streams
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (9 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 10/15] drm/i915: Extract raw GPU timestamps from OA reports to forward in perf samples sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  5:18 ` [PATCH 12/15] time: Expose current clocksource in use by timekeeping framework sourab.gupta
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Sourab Gupta, Deepak S

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds support for opening multiple concurrent perf streams for
different gpu engines, while having the restriction to open only a single
stream open for a particular gpu engine.
This enables userspace client to open multiple streams, one per engine,
at any time to capture sample data for multiple gpu engines.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |  2 +-
 drivers/gpu/drm/i915/i915_perf.c | 69 ++++++++++++++++++++++------------------
 2 files changed, 39 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a9a123b..9ccac83 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2161,7 +2161,7 @@ struct drm_i915_private {
 		spinlock_t hook_lock;
 
 		struct hrtimer poll_check_timer;
-		struct i915_perf_stream *exclusive_stream;
+		struct i915_perf_stream *ring_stream[I915_NUM_ENGINES];
 		wait_queue_head_t poll_wq[I915_NUM_ENGINES];
 
 		struct {
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 65b4af6..aa3589e 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -944,7 +944,7 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
 		 * an invalid ID. It could be good to annotate these
 		 * reports with a _CTX_SWITCH_AWAY reason later.
 		 */
-		if (!dev_priv->perf.exclusive_stream->ctx ||
+		if (!stream->ctx ||
 		    dev_priv->perf.oa.specific_ctx_id == ctx_id ||
 		    dev_priv->perf.oa.oa_buffer.last_ctx_id == ctx_id) {
 
@@ -955,7 +955,7 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
 			 * the switch-away reports with an invalid
 			 * context id to be recognisable by userspace.
 			 */
-			if (dev_priv->perf.exclusive_stream->ctx &&
+			if (stream->ctx &&
 			    dev_priv->perf.oa.specific_ctx_id != ctx_id)
 				report32[2] = 0x1fffff;
 
@@ -1596,7 +1596,7 @@ static void i915_ring_stream_destroy(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
-	BUG_ON(stream != dev_priv->perf.exclusive_stream);
+	BUG_ON(stream != dev_priv->perf.ring_stream[stream->engine]);
 
 	if (stream->using_oa) {
 		dev_priv->perf.oa.ops.disable_metric_set(dev_priv);
@@ -1610,7 +1610,7 @@ static void i915_ring_stream_destroy(struct i915_perf_stream *stream)
 	if (stream->cs_mode)
 		free_command_stream_buf(dev_priv, stream->engine);
 
-	dev_priv->perf.exclusive_stream = NULL;
+	dev_priv->perf.ring_stream[stream->engine] = NULL;
 }
 
 static void gen7_init_oa_buffer(struct drm_i915_private *dev_priv)
@@ -2012,14 +2012,14 @@ static void gen7_update_oacontrol_locked(struct drm_i915_private *dev_priv)
 {
 	assert_spin_locked(&dev_priv->perf.hook_lock);
 
-	if (dev_priv->perf.exclusive_stream->state !=
+	if (dev_priv->perf.ring_stream[RCS]->state !=
 					I915_PERF_STREAM_DISABLED) {
 		unsigned long ctx_id = 0;
 
-		if (dev_priv->perf.exclusive_stream->ctx)
+		if (dev_priv->perf.ring_stream[RCS]->ctx)
 			ctx_id = dev_priv->perf.oa.specific_ctx_id;
 
-		if (dev_priv->perf.exclusive_stream->ctx == NULL || ctx_id) {
+		if (dev_priv->perf.ring_stream[RCS]->ctx == NULL || ctx_id) {
 			bool periodic = dev_priv->perf.oa.periodic;
 			u32 period_exponent = dev_priv->perf.oa.period_exponent;
 			u32 report_format = dev_priv->perf.oa.oa_buffer.format;
@@ -2144,15 +2144,6 @@ static int i915_ring_stream_init(struct i915_perf_stream *stream,
 							SAMPLE_TS);
 	int ret;
 
-	/* To avoid the complexity of having to accurately filter
-	 * counter reports and marshal to the appropriate client
-	 * we currently only allow exclusive access
-	 */
-	if (dev_priv->perf.exclusive_stream) {
-		DRM_ERROR("Stream already in use\n");
-		return -EBUSY;
-	}
-
 	if ((props->sample_flags & SAMPLE_CTX_ID) && !props->cs_mode) {
 		if (IS_HASWELL(dev_priv->dev)) {
 			DRM_ERROR(
@@ -2170,6 +2161,12 @@ static int i915_ring_stream_init(struct i915_perf_stream *stream,
 	if (require_oa_unit) {
 		int format_size;
 
+		/* Only allow exclusive access per stream */
+		if (dev_priv->perf.ring_stream[RCS]) {
+			DRM_ERROR("Stream:0 already in use\n");
+			return -EBUSY;
+		}
+
 		if (!dev_priv->perf.oa.ops.init_oa_buffer) {
 			DRM_ERROR("OA unit not supported\n");
 			return -ENODEV;
@@ -2305,6 +2302,13 @@ static int i915_ring_stream_init(struct i915_perf_stream *stream,
 	}
 
 	if (props->cs_mode) {
+		/* Only allow exclusive access per stream */
+		if (dev_priv->perf.ring_stream[props->engine]) {
+			DRM_ERROR("Stream:%d already in use\n", props->engine);
+			ret = -EBUSY;
+			goto cs_error;
+		}
+
 		if (!cs_sample_data) {
 			DRM_ERROR(
 				"Ring given without requesting any CS data to sample");
@@ -2345,7 +2349,7 @@ static int i915_ring_stream_init(struct i915_perf_stream *stream,
 
 	stream->ops = &i915_oa_stream_ops;
 
-	dev_priv->perf.exclusive_stream = stream;
+	dev_priv->perf.ring_stream[stream->engine] = stream;
 
 	return 0;
 
@@ -2379,8 +2383,8 @@ static void i915_oa_context_pin_notify_locked(struct drm_i915_private *dev_priv,
 	    dev_priv->perf.oa.ops.update_hw_ctx_id_locked == NULL)
 		return;
 
-	if (dev_priv->perf.exclusive_stream &&
-	    dev_priv->perf.exclusive_stream->ctx == context) {
+	if (dev_priv->perf.ring_stream[RCS] &&
+	    dev_priv->perf.ring_stream[RCS]->ctx == context) {
 		struct drm_i915_gem_object *obj =
 			context->legacy_hw_ctx.rcs_state;
 		u32 ctx_id = i915_gem_obj_ggtt_offset(obj);
@@ -2449,8 +2453,8 @@ void i915_oa_legacy_ctx_switch_notify(struct drm_i915_gem_request *req)
 	if (dev_priv->perf.oa.ops.legacy_ctx_switch_unlocked == NULL)
 		return;
 
-	if (dev_priv->perf.exclusive_stream &&
-	    dev_priv->perf.exclusive_stream->state !=
+	if (dev_priv->perf.ring_stream[RCS] &&
+	    dev_priv->perf.ring_stream[RCS]->state !=
 				I915_PERF_STREAM_DISABLED) {
 
 		/* XXX: We don't take a lock here and this may run
@@ -2591,23 +2595,26 @@ static ssize_t i915_perf_read(struct file *file,
 	return ret;
 }
 
-static enum hrtimer_restart poll_check_timer_cb(struct hrtimer *hrtimer)
+static void wake_up_perf_streams(void *data, async_cookie_t cookie)
 {
+	struct drm_i915_private *dev_priv = data;
 	struct i915_perf_stream *stream;
 
-	struct drm_i915_private *dev_priv =
-		container_of(hrtimer, typeof(*dev_priv),
-			     perf.poll_check_timer);
-
-	/* No need to protect the streams list here, since the hrtimer is
-	 * disabled before the stream is removed from list, and currently a
-	 * single exclusive_stream is supported.
-	 * XXX: revisit this when multiple concurrent streams are supported.
-	 */
+	mutex_lock(&dev_priv->perf.streams_lock);
 	list_for_each_entry(stream, &dev_priv->perf.streams, link) {
 		if (stream_have_data__unlocked(stream))
 			wake_up(&dev_priv->perf.poll_wq[stream->engine]);
 	}
+	mutex_unlock(&dev_priv->perf.streams_lock);
+}
+
+static enum hrtimer_restart poll_check_timer_cb(struct hrtimer *hrtimer)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(hrtimer, typeof(*dev_priv),
+			     perf.poll_check_timer);
+
+	async_schedule(wake_up_perf_streams, dev_priv);
 
 	hrtimer_forward_now(hrtimer, ns_to_ktime(POLL_PERIOD));
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 12/15] time: Expose current clocksource in use by timekeeping framework
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (10 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 11/15] drm/i915: Support opening multiple concurrent perf streams sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  5:18 ` [PATCH 13/15] time: export clocks_calc_mult_shift sourab.gupta
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx
  Cc: Christopher S . Hall, Daniel Vetter, Sourab Gupta, Deepak S,
	Thomas Gleixner

From: Sourab Gupta <sourab.gupta@intel.com>

For the drivers to be able to use the cross timestamp framework,
they need the information of current clocksource being used by the
kernel timekeeping. This is needed since the callback given by driver
into the get_device_system_crosststamp(), in order to synchronously read
the device time and system counter value, requires the knowledge of
the clocksource being used to read system counter value (as a part of
struct system_counterval_t).

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 include/linux/timekeeping.h |  5 +++++
 kernel/time/timekeeping.c   | 12 ++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 96f37be..d5a8cd6 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -320,6 +320,11 @@ extern int get_device_system_crosststamp(
 			struct system_device_crosststamp *xtstamp);
 
 /*
+ * Get current clocksource used by system timekeeping framework
+ */
+struct clocksource *get_current_clocksource(void);
+
+/*
  * Simultaneously snapshot realtime and monotonic raw clocks
  */
 extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 479d25c..e92d466 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1133,6 +1133,18 @@ int get_device_system_crosststamp(int (*get_time_fn)
 EXPORT_SYMBOL_GPL(get_device_system_crosststamp);
 
 /**
+ * get_current_clocksource - Returns the current clocksource in used by tk_core
+ *
+ */
+struct clocksource *get_current_clocksource(void)
+{
+	struct timekeeper *tk = &tk_core.timekeeper;
+
+	return tk->tkr_mono.clock;
+}
+EXPORT_SYMBOL_GPL(get_current_clocksource);
+
+/**
  * do_gettimeofday - Returns the time of day in a timeval
  * @tv:		pointer to the timeval to be set
  *
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 13/15] time: export clocks_calc_mult_shift
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (11 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 12/15] time: Expose current clocksource in use by timekeeping framework sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  5:18 ` [PATCH 14/15] drm/i915: Mechanism to forward clock monotonic raw time in perf samples sourab.gupta
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx
  Cc: Christopher S . Hall, Daniel Vetter, Sourab Gupta, Deepak S,
	Thomas Gleixner

From: Sourab Gupta <sourab.gupta@intel.com>

Exporting clocks_calc_mult_shift is helpful for drivers to calculate
the mult/shift values for their clocks, given their frequency.
This is particularly useful when such drivers may want to associate
timecounter/cyclecounter abstraction for their clock sources, in order
to use the cross timestamp infrastructure for syncing device time with
system time.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 kernel/time/clocksource.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 56ece14..fef256f 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -89,6 +89,7 @@ clocks_calc_mult_shift(u32 *mult, u32 *shift, u32 from, u32 to, u32 maxsec)
 	*mult = tmp;
 	*shift = sft;
 }
+EXPORT_SYMBOL_GPL(clocks_calc_mult_shift);
 
 /*[Clocksource internal variables]---------
  * curr_clocksource:
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 14/15] drm/i915: Mechanism to forward clock monotonic raw time in perf samples
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (12 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 13/15] time: export clocks_calc_mult_shift sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-02  5:18 ` [PATCH 15/15] drm/i915: Support for capturing MMIO register values sourab.gupta
  2016-06-03 12:08 ` ✗ Ro.CI.BAT: failure for Framework to collect command stream gpu metrics using i915 perf (rev2) Patchwork
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx
  Cc: Christopher S . Hall, Daniel Vetter, Sourab Gupta, Deepak S,
	Thomas Gleixner

From: Sourab Gupta <sourab.gupta@intel.com>

Currently, we have the ability to only forward the GPU timestamps in the
samples (which are generated via OA reports or PIPE_CONTROL commands
inserted in the ring). This limits the ability to correlate these samples
with the system events. If we scale the GPU timestamps according the
timestamp base/frequency info present in bspec, it is observed that the
timestamps drift really quickly from the system time.

An ability is therefore needed to report timestamps in different clock
domains, such as CLOCK_MONOTONIC (or _MONO_RAW), in the perf samples to
be of more practical use to the userspace. This ability becomes important
when we want to correlate/plot GPU events/samples with other system events
on the same timeline (e.g. vblank events, or timestamps when work was
submitted to kernel, etc.)

The patch here proposes a mechanism to achieve this. The correlation between
gpu time and system time is established using the cross timestamp framework.
For this purpose, the timestamp clock associated with the command stream, is
abstracted as timecounter/cyclecounter, before utilizing cross timestamp
framework to retrieve gpu/system time correlated values.
Different such gpu/system time values are then used to detect and correct
the error in published gpu timestamp clock frequency. The userspace can
request CLOCK_MONOTONIC_RAW timestamps in samples by requesting the
corresponding property while opening the stream.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_dma.c  |   2 +
 drivers/gpu/drm/i915/i915_drv.h  |  24 +++-
 drivers/gpu/drm/i915/i915_perf.c | 273 +++++++++++++++++++++++++++++++++++----
 include/uapi/drm/i915_drm.h      |   9 +-
 4 files changed, 284 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index ab1f6c4..01f3559 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1327,6 +1327,8 @@ static int i915_driver_init_hw(struct drm_i915_private *dev_priv)
 			DRM_DEBUG_DRIVER("can't enable MSI");
 	}
 
+	i915_perf_init_late(dev_priv);
+
 	return 0;
 
 out_ggtt:
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 9ccac83..d99ea73 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -42,6 +42,9 @@
 #include <linux/kref.h>
 #include <linux/pm_qos.h>
 #include <linux/shmem_fs.h>
+#include <linux/timecounter.h>
+#include <linux/clocksource.h>
+#include <linux/timekeeping.h>
 
 #include <drm/drmP.h>
 #include <drm/intel-gtt.h>
@@ -1825,6 +1828,9 @@ struct i915_perf_stream {
 	/* Whether the OA unit is in use */
 	bool using_oa;
 
+	/* monotonic_raw clk timestamp (in ns) for last sample */
+	u64 last_sample_ts;
+
 	const struct i915_perf_stream_ops *ops;
 };
 
@@ -1869,6 +1875,20 @@ struct i915_perf_cs_data_node {
 	u32 tag;
 };
 
+/**
+ * struct i915_clock_info - decribes i915 timestamp clock
+ *
+ */
+struct i915_clock_info {
+	struct cyclecounter cc;
+	struct timecounter tc;
+	struct system_device_crosststamp xtstamp;
+	ktime_t clk_offset; /* Offset (in ns) between monoraw clk and gpu time */
+	u32 timestamp_frequency;
+	u32 resync_period; /* in msecs */
+	struct delayed_work clk_sync_work;
+};
+
 struct drm_i915_private {
 	struct drm_device *dev;
 	struct kmem_cache *objects;
@@ -2147,6 +2167,8 @@ struct drm_i915_private {
 
 	struct i915_runtime_pm pm;
 
+	struct i915_clock_info ts_clk_info;
+
 	struct {
 		bool initialized;
 
@@ -2169,7 +2191,6 @@ struct drm_i915_private {
 
 			bool periodic;
 			int period_exponent;
-			int timestamp_frequency;
 
 			int tail_margin;
 
@@ -3699,6 +3720,7 @@ int i915_parse_cmds(struct intel_engine_cs *engine,
 
 /* i915_perf.c */
 extern void i915_perf_init(struct drm_device *dev);
+extern void i915_perf_init_late(struct drm_i915_private *dev_priv);
 extern void i915_perf_fini(struct drm_device *dev);
 
 /* i915_suspend.c */
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index aa3589e..e340cf9f 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -23,6 +23,7 @@
 
 #include <linux/anon_inodes.h>
 #include <linux/sizes.h>
+#include <linux/ktime.h>
 
 #include "i915_drv.h"
 #include "intel_ringbuffer.h"
@@ -62,6 +63,9 @@
 #define POLL_FREQUENCY 200
 #define POLL_PERIOD (NSEC_PER_SEC / POLL_FREQUENCY)
 
+#define MAX_CLK_SYNC_PERIOD (60*MSEC_PER_SEC)
+#define INIT_CLK_SYNC_PERIOD (20) /* in msecs */
+
 static u32 i915_perf_stream_paranoid = true;
 
 /* The maximum exponent the hardware accepts is 63 (essentially it selects one
@@ -88,13 +92,24 @@ static u32 i915_perf_stream_paranoid = true;
 #define TS_ADDR_ALIGN 8
 #define I915_PERF_TS_SAMPLE_SIZE 8
 
+/* Published frequency of GT command stream timestamp clock */
+#define FREQUENCY_12_5_MHZ	(12500000)
+#define FREQUENCY_12_0_MHZ	(12000000)
+#define FREQUENCY_19_2_MHZ	(19200000)
+#define GT_CS_TIMESTAMP_FREQUENCY(dev_priv) (IS_GEN9(dev_priv) ? \
+				(IS_BROXTON(dev_priv) ? \
+				FREQUENCY_19_2_MHZ : \
+				FREQUENCY_12_0_MHZ) : \
+				FREQUENCY_12_5_MHZ)
+
 /* Data common to all samples (periodic OA / CS based OA / Timestamps) */
 struct sample_data {
 	u32 source;
 	u32 ctx_id;
 	u32 pid;
 	u32 tag;
-	u64 ts;
+	u64 gpu_ts;
+	u64 clk_monoraw;
 	const u8 *report;
 };
 
@@ -153,6 +168,7 @@ static const enum intel_engine_id user_ring_map[I915_USER_RINGS + 1] = {
 #define SAMPLE_PID		(1<<3)
 #define SAMPLE_TAG		(1<<4)
 #define SAMPLE_TS		(1<<5)
+#define SAMPLE_CLK_MONO_RAW	(1<<6)
 
 struct perf_open_properties {
 	u32 sample_flags;
@@ -171,6 +187,136 @@ struct perf_open_properties {
 	enum intel_engine_id engine;
 };
 
+/**
+ * i915_tstamp_clk_cyclecounter_read - read raw cycle counter
+ * @cc: cyclecounter structure
+ **/
+static cycle_t i915_tstamp_clk_cyclecounter_read(
+				const struct cyclecounter *cc)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(cc, typeof(*dev_priv),
+			     ts_clk_info.cc);
+
+	return I915_READ64_2x32(GT_TIMESTAMP_COUNT,
+					GT_TIMESTAMP_COUNT_UDW);
+}
+
+static void i915_tstamp_clk_calc_update_mult_shift(struct cyclecounter *cc,
+					u32 frequency)
+{
+	clocks_calc_mult_shift(&cc->mult, &cc->shift,
+			frequency, NSEC_PER_SEC, 3600);
+}
+
+static void i915_tstamp_clk_init_base_freq(struct drm_i915_private *dev_priv,
+				struct cyclecounter *cc)
+{
+	cc->read = i915_tstamp_clk_cyclecounter_read;
+	cc->mask = CYCLECOUNTER_MASK(64);
+	i915_tstamp_clk_calc_update_mult_shift(cc,
+			dev_priv->ts_clk_info.timestamp_frequency);
+}
+
+/**
+ * i915_get_syncdevicetime - Callback given to timekeeping code to read
+	device time and sytem clk counter value
+ * @device_time: current device time
+ * @system: system counter value read synchronously with device time
+ * @ctx: context provided by timekeeping code
+ *
+ **/
+static int i915_get_syncdevicetime(ktime_t *device_time,
+					 struct system_counterval_t *system,
+					 void *ctx)
+{
+	struct drm_i915_private *dev_priv = (struct drm_i915_private *)ctx;
+	struct timecounter *tc = &dev_priv->ts_clk_info.tc;
+	struct clocksource *curr_clksource;
+
+	*device_time = ns_to_ktime(timecounter_read(tc));
+
+	curr_clksource = get_current_clocksource();
+	system->cycles = curr_clksource->read(curr_clksource);
+	system->cs = curr_clksource;
+
+	return 0;
+}
+
+static void i915_perf_clock_sync_work(struct work_struct *work)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(work, typeof(*dev_priv),
+		ts_clk_info.clk_sync_work.work);
+	struct system_device_crosststamp *xtstamp =
+			&dev_priv->ts_clk_info.xtstamp;
+	ktime_t last_sys_time = xtstamp->sys_monoraw;
+	ktime_t last_gpu_time = xtstamp->device;
+	ktime_t clk_mono_offset, gpu_time_offset;
+	s64 delta;
+	u32 gpu_freq = dev_priv->ts_clk_info.timestamp_frequency;
+	u32 freq_delta = 0;
+
+	get_device_system_crosststamp(i915_get_syncdevicetime, dev_priv,
+					NULL, xtstamp);
+
+	clk_mono_offset = ktime_sub(xtstamp->sys_monoraw, last_sys_time);
+	gpu_time_offset = ktime_sub(xtstamp->device, last_gpu_time);
+
+	/* delta time in ns */
+	delta = ktime_to_ns(ktime_sub(gpu_time_offset, clk_mono_offset));
+
+	/* If time delta < 1 us, we can assume gpu frequency is correct */
+	if (abs(delta) < NSEC_PER_USEC)
+		goto out;
+
+	/* The two clocks shouldn't deviate more than 1 second during the
+	 * resync period. If this is the case (which may happen due to
+	 * suspend/resume), then don't apply frequency correction, and
+	 * fast forward/rewind the clocks to resync immediately
+	 */
+	if (abs(delta) > NSEC_PER_SEC)
+		goto out;
+
+	/* Calculate frequency delta */
+	freq_delta = abs(delta)*gpu_freq;
+	do_div(freq_delta, ktime_to_ns(clk_mono_offset));
+
+	if (freq_delta == 0)
+		goto out;
+
+	if (delta < 0)
+		freq_delta = -freq_delta;
+
+	dev_priv->ts_clk_info.timestamp_frequency += freq_delta;
+	i915_tstamp_clk_calc_update_mult_shift(&dev_priv->ts_clk_info.cc,
+			dev_priv->ts_clk_info.timestamp_frequency);
+
+	/*
+	 * Get updated device/system times based on corrected frequency.
+	 * Note that this may cause jumps in device time depending on whether
+	 * frequency delta is positive or negative.
+	 * NB: Take care that monotonicity of sample timestamps is maintained
+	 * even with these jumps.
+	 */
+	get_device_system_crosststamp(i915_get_syncdevicetime, dev_priv,
+					NULL, xtstamp);
+
+out:
+	dev_priv->ts_clk_info.clk_offset = ktime_sub(xtstamp->sys_monoraw,
+						xtstamp->device);
+
+	/* We can schedule next synchronization at incrementally higher
+	 * durations, so that the accuracy of our calculated frequency
+	 * can improve over time. The max resync period is arbitrarily
+	 * set as one hour.
+	 */
+	dev_priv->ts_clk_info.resync_period *= 2;
+	if (dev_priv->ts_clk_info.resync_period < MAX_CLK_SYNC_PERIOD)
+		schedule_delayed_work(&dev_priv->ts_clk_info.clk_sync_work,
+			msecs_to_jiffies(dev_priv->ts_clk_info.resync_period));
+}
+
 /*
  * Emit the commands to capture metrics, into the command stream. This function
  * can be called concurrently with the stream operations and doesn't require
@@ -245,7 +391,7 @@ static int insert_perf_entry(struct drm_i915_private *dev_priv,
 
 	if (stream->sample_flags & SAMPLE_OA_REPORT)
 		entry_size += dev_priv->perf.oa.oa_buffer.format_size;
-	else if (sample_flags & SAMPLE_TS) {
+	else if (sample_flags & (SAMPLE_TS|SAMPLE_CLK_MONO_RAW)) {
 		/*
 		 * XXX: Since TS data can anyways be derived from OA report, so
 		 * no need to capture it for RCS engine, if capture oa data is
@@ -507,7 +653,7 @@ static void i915_ring_stream_cs_hook(struct i915_perf_stream *stream,
 		ret = i915_ring_stream_capture_oa(req, entry->oa_offset);
 		if (ret)
 			goto err_unref;
-	} else if (sample_flags & SAMPLE_TS) {
+	} else if (sample_flags & (SAMPLE_TS|SAMPLE_CLK_MONO_RAW)) {
 		/*
 		 * XXX: Since TS data can anyways be derived from OA report, so
 		 * no need to capture it for RCS engine, if capture oa data is
@@ -764,7 +910,14 @@ static int append_sample(struct i915_perf_stream *stream,
 	}
 
 	if (sample_flags & SAMPLE_TS) {
-		if (copy_to_user(buf, &data->ts, I915_PERF_TS_SAMPLE_SIZE))
+		if (copy_to_user(buf, &data->gpu_ts, I915_PERF_TS_SAMPLE_SIZE))
+			return -EFAULT;
+		buf += I915_PERF_TS_SAMPLE_SIZE;
+	}
+
+	if (sample_flags & SAMPLE_CLK_MONO_RAW) {
+		if (copy_to_user(buf, &data->clk_monoraw,
+					I915_PERF_TS_SAMPLE_SIZE))
 			return -EFAULT;
 		buf += I915_PERF_TS_SAMPLE_SIZE;
 	}
@@ -781,6 +934,27 @@ static int append_sample(struct i915_perf_stream *stream,
 	return 0;
 }
 
+static u64 get_clk_monoraw_from_gpu_ts(struct i915_perf_stream *stream,
+					u64 gpu_ts)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	struct timecounter *tc = &dev_priv->ts_clk_info.tc;
+	u64 gpu_time, clk_monoraw;
+
+	gpu_time = cyclecounter_cyc2ns(tc->cc, gpu_ts, tc->mask, &tc->frac);
+
+	clk_monoraw = gpu_time + ktime_to_ns(dev_priv->ts_clk_info.clk_offset);
+
+	/* Ensure monotonicity by clamping the system time in case it goes
+	 * backwards.
+	 */
+	if (clk_monoraw < stream->last_sample_ts)
+		clk_monoraw = stream->last_sample_ts;
+
+	stream->last_sample_ts = clk_monoraw;
+	return clk_monoraw;
+}
+
 static u64 get_gpu_ts_from_oa_report(struct drm_i915_private *dev_priv,
 					const u8 *report)
 {
@@ -837,7 +1011,13 @@ static int append_oa_buffer_sample(struct i915_perf_stream *stream,
 
 	/* Derive timestamp from OA report */
 	if (sample_flags & SAMPLE_TS)
-		data.ts = get_gpu_ts_from_oa_report(dev_priv, report);
+		data.gpu_ts = get_gpu_ts_from_oa_report(dev_priv, report);
+
+	if (sample_flags & SAMPLE_CLK_MONO_RAW) {
+		u64 gpu_ts = get_gpu_ts_from_oa_report(dev_priv, report);
+
+		data.clk_monoraw = get_clk_monoraw_from_gpu_ts(stream, gpu_ts);
+	}
 
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
@@ -1307,7 +1487,7 @@ static int append_one_cs_sample(struct i915_perf_stream *stream,
 		if (ret)
 			return ret;
 
-		if (sample_flags & SAMPLE_TS)
+		if (sample_flags & (SAMPLE_TS|SAMPLE_CLK_MONO_RAW))
 			gpu_ts = get_gpu_ts_from_oa_report(dev_priv, report);
 	}
 
@@ -1329,7 +1509,7 @@ static int append_one_cs_sample(struct i915_perf_stream *stream,
 		dev_priv->perf.last_tag = node->tag;
 	}
 
-	if (sample_flags & SAMPLE_TS) {
+	if (sample_flags & (SAMPLE_TS|SAMPLE_CLK_MONO_RAW)) {
 		/* If OA sampling is enabled, derive the ts from OA report.
 		 * Else, forward the timestamp collected via command stream.
 		 */
@@ -1337,7 +1517,12 @@ static int append_one_cs_sample(struct i915_perf_stream *stream,
 			gpu_ts = *(u64 *)
 				(dev_priv->perf.command_stream_buf[id].addr +
 					node->ts_offset);
-		data.ts = gpu_ts;
+
+		if (sample_flags & SAMPLE_TS)
+			data.gpu_ts = gpu_ts;
+		if (sample_flags & SAMPLE_CLK_MONO_RAW)
+			data.clk_monoraw = get_clk_monoraw_from_gpu_ts(
+							stream,	gpu_ts);
 	}
 
 	return append_sample(stream, read_state, &data);
@@ -2076,12 +2261,28 @@ static void i915_ring_stream_enable(struct i915_perf_stream *stream)
 	if (stream->sample_flags & SAMPLE_OA_REPORT) {
 		dev_priv->perf.oa.ops.oa_enable(dev_priv);
 
-		if (stream->sample_flags & SAMPLE_TS)
+		if (stream->sample_flags & (SAMPLE_TS|SAMPLE_CLK_MONO_RAW))
 			dev_priv->perf.oa.last_gpu_ts =
 				I915_READ64_2x32(GT_TIMESTAMP_COUNT,
 					GT_TIMESTAMP_COUNT_UDW);
 	}
 
+	if (stream->sample_flags & SAMPLE_CLK_MONO_RAW) {
+		struct system_device_crosststamp *xtstamp =
+			&dev_priv->ts_clk_info.xtstamp;
+
+		get_device_system_crosststamp(i915_get_syncdevicetime, dev_priv,
+					NULL, xtstamp);
+		dev_priv->ts_clk_info.clk_offset = ktime_sub(xtstamp->sys_monoraw,
+						xtstamp->device);
+
+		if (dev_priv->ts_clk_info.resync_period < MAX_CLK_SYNC_PERIOD)
+			schedule_delayed_work(
+				&dev_priv->ts_clk_info.clk_sync_work,
+				msecs_to_jiffies(
+					dev_priv->ts_clk_info.resync_period));
+	}
+
 	if (stream->cs_mode || dev_priv->perf.oa.periodic)
 		hrtimer_start(&dev_priv->perf.poll_check_timer,
 			      ns_to_ktime(POLL_PERIOD),
@@ -2102,6 +2303,8 @@ static void i915_ring_stream_disable(struct i915_perf_stream *stream)
 {
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 
+	cancel_delayed_work_sync(&dev_priv->ts_clk_info.clk_sync_work);
+
 	if (stream->cs_mode || dev_priv->perf.oa.periodic)
 		hrtimer_cancel(&dev_priv->perf.poll_check_timer);
 
@@ -2117,7 +2320,7 @@ static void i915_ring_stream_disable(struct i915_perf_stream *stream)
 static u64 oa_exponent_to_ns(struct drm_i915_private *dev_priv, int exponent)
 {
 	return 1000000000ULL * (2ULL << exponent) /
-		dev_priv->perf.oa.timestamp_frequency;
+		dev_priv->ts_clk_info.timestamp_frequency;
 }
 
 static const struct i915_perf_stream_ops i915_oa_stream_ops = {
@@ -2141,7 +2344,8 @@ static int i915_ring_stream_init(struct i915_perf_stream *stream,
 	bool require_cs_mode = props->sample_flags & (SAMPLE_PID |
 						      SAMPLE_TAG);
 	bool cs_sample_data = props->sample_flags & (SAMPLE_OA_REPORT |
-							SAMPLE_TS);
+							SAMPLE_TS |
+							SAMPLE_CLK_MONO_RAW);
 	int ret;
 
 	if ((props->sample_flags & SAMPLE_CTX_ID) && !props->cs_mode) {
@@ -2294,6 +2498,19 @@ static int i915_ring_stream_init(struct i915_perf_stream *stream,
 			require_cs_mode = true;
 	}
 
+	if (props->sample_flags & SAMPLE_CLK_MONO_RAW) {
+		stream->sample_flags |= SAMPLE_CLK_MONO_RAW;
+		stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
+
+		/*
+		 * NB: it's meaningful to request SAMPLE_CLK_MONO with just CS
+		 * mode or periodic OA mode sampling but we don't allow
+		 * SAMPLE_CLK_MONO without either mode
+		 */
+		if (!require_oa_unit)
+			require_cs_mode = true;
+	}
+
 	if (require_cs_mode && !props->cs_mode) {
 		DRM_ERROR(
 			"PID, TAG or TS sampling require a ring to be specified");
@@ -2318,11 +2535,13 @@ static int i915_ring_stream_init(struct i915_perf_stream *stream,
 
 		/*
 		 * The only time we should allow enabling CS mode if it's not
-		 * strictly required, is if SAMPLE_CTX_ID  or SAMPLE_TS has been
-		 * requested, as they're usable with periodic OA or CS sampling.
+		 * strictly required, is if SAMPLE_CTX_ID, SAMPLE_TS or
+		 * SAMPLE_CLK_MONO_RAW has been requested, as they're usable
+		 * with periodic OA or CS sampling.
 		 */
 		if (!require_cs_mode &&
-		    !(props->sample_flags & (SAMPLE_CTX_ID|SAMPLE_TS))) {
+		    !(props->sample_flags &
+				(SAMPLE_CTX_ID|SAMPLE_TS|SAMPLE_CLK_MONO_RAW))) {
 			DRM_ERROR(
 				"Ring given without requesting any CS specific property");
 			ret = -EINVAL;
@@ -2982,6 +3201,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_PROP_SAMPLE_TS:
 			props->sample_flags |= SAMPLE_TS;
 			break;
+		case DRM_I915_PERF_PROP_SAMPLE_CLOCK_MONOTONIC_RAW:
+			props->sample_flags |= SAMPLE_CLK_MONO_RAW;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
@@ -3072,6 +3294,7 @@ static struct ctl_table dev_root[] = {
 void i915_perf_init(struct drm_device *dev)
 {
 	struct drm_i915_private *dev_priv = to_i915(dev);
+	struct i915_clock_info *clk_info = &dev_priv->ts_clk_info;
 	int i;
 
 	if (!(IS_HASWELL(dev) ||
@@ -3084,6 +3307,12 @@ void i915_perf_init(struct drm_device *dev)
 	if (!dev_priv->perf.metrics_kobj)
 		return;
 
+	clk_info->timestamp_frequency = GT_CS_TIMESTAMP_FREQUENCY(dev_priv);
+	clk_info->resync_period = INIT_CLK_SYNC_PERIOD;
+	INIT_DELAYED_WORK(&clk_info->clk_sync_work, i915_perf_clock_sync_work);
+
+	i915_tstamp_clk_init_base_freq(dev_priv, &clk_info->cc);
+
 	hrtimer_init(&dev_priv->perf.poll_check_timer,
 		     CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	dev_priv->perf.poll_check_timer.function = poll_check_timer_cb;
@@ -3113,8 +3342,6 @@ void i915_perf_init(struct drm_device *dev)
 		dev_priv->perf.oa.ops.oa_buffer_get_ctx_id =
 					gen7_oa_buffer_get_ctx_id;
 
-		dev_priv->perf.oa.timestamp_frequency = 12500000;
-
 		dev_priv->perf.oa.oa_formats = hsw_oa_formats;
 
 		dev_priv->perf.oa.n_builtin_sets =
@@ -3149,8 +3376,6 @@ void i915_perf_init(struct drm_device *dev)
 			dev_priv->perf.oa.n_builtin_sets =
 				i915_oa_n_builtin_metric_sets_bdw;
 
-			dev_priv->perf.oa.timestamp_frequency = 12500000;
-
 			if (i915_perf_init_sysfs_bdw(dev_priv))
 				goto sysfs_error;
 		} else if (IS_CHERRYVIEW(dev)) {
@@ -3163,8 +3388,6 @@ void i915_perf_init(struct drm_device *dev)
 			dev_priv->perf.oa.n_builtin_sets =
 				i915_oa_n_builtin_metric_sets_chv;
 
-			dev_priv->perf.oa.timestamp_frequency = 12500000;
-
 			if (i915_perf_init_sysfs_chv(dev_priv))
 				goto sysfs_error;
 		} else if (IS_SKYLAKE(dev)) {
@@ -3177,8 +3400,6 @@ void i915_perf_init(struct drm_device *dev)
 			dev_priv->perf.oa.n_builtin_sets =
 				i915_oa_n_builtin_metric_sets_skl;
 
-			dev_priv->perf.oa.timestamp_frequency = 12000000;
-
 			if (i915_perf_init_sysfs_skl(dev_priv))
 				goto sysfs_error;
 		}
@@ -3197,6 +3418,14 @@ sysfs_error:
 	return;
 }
 
+void i915_perf_init_late(struct drm_i915_private *dev_priv)
+{
+	struct i915_clock_info *clk_info = &dev_priv->ts_clk_info;
+
+	timecounter_init(&clk_info->tc,	&clk_info->cc,
+				ktime_to_ns(ktime_get_boottime()));
+}
+
 void i915_perf_fini(struct drm_device *dev)
 {
 	struct drm_i915_private *dev_priv = to_i915(dev);
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 02d43a9..c4cae53 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1278,6 +1278,12 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_PROP_SAMPLE_TS,
 
+	/**
+	 * This property requests inclusion of CLOCK_MONOTONIC system time in
+	 * the perf sample data.
+	 */
+	DRM_I915_PERF_PROP_SAMPLE_CLOCK_MONOTONIC_RAW,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1346,7 +1352,8 @@ enum drm_i915_perf_record_type {
 	 *     { u32 ctx_id; } && DRM_I915_PERF_PROP_SAMPLE_CTX_ID
 	 *     { u32 pid; } && DRM_I915_PERF_PROP_SAMPLE_PID
 	 *     { u32 tag; } && DRM_I915_PERF_PROP_SAMPLE_TAG
-	 *     { u64 timestamp; } && DRM_I915_PERF_PROP_SAMPLE_TS
+	 *     { u64 gpu_ts; } && DRM_I915_PERF_PROP_SAMPLE_TS
+	 *     { u64 clk_mono; } && DRM_I915_PERF_PROP_SAMPLE_CLOCK_MONOTONIC
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 15/15] drm/i915: Support for capturing MMIO register values
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (13 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 14/15] drm/i915: Mechanism to forward clock monotonic raw time in perf samples sourab.gupta
@ 2016-06-02  5:18 ` sourab.gupta
  2016-06-03 12:08 ` ✗ Ro.CI.BAT: failure for Framework to collect command stream gpu metrics using i915 perf (rev2) Patchwork
  15 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-06-02  5:18 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Sourab Gupta, Deepak S

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds support for capturing MMIO register values through
i915 perf interface.
The userspace can request upto 8 MMIO register values to be dumped.
The addresses of these registers can be passed through the corresponding
property 'value' field while opening the stream.
The commands to dump the values of these MMIO registers are then
inserted into the ring alongwith other commands.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |   4 +
 drivers/gpu/drm/i915/i915_perf.c | 173 ++++++++++++++++++++++++++++++++++++++-
 include/uapi/drm/i915_drm.h      |  14 ++++
 3 files changed, 188 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d99ea73..bfa52dc 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1865,6 +1865,7 @@ struct i915_perf_cs_data_node {
 	u32 start_offset;
 	u32 oa_offset;
 	u32 ts_offset;
+	u32 mmio_offset;
 
 	/* buffer size corresponding to this entry */
 	u32 size;
@@ -2186,6 +2187,9 @@ struct drm_i915_private {
 		struct i915_perf_stream *ring_stream[I915_NUM_ENGINES];
 		wait_queue_head_t poll_wq[I915_NUM_ENGINES];
 
+		u32 num_mmio;
+		u32 mmio_list[I915_PERF_MMIO_NUM_MAX];
+
 		struct {
 			u32 specific_ctx_id;
 
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index e340cf9f..e661d8d 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -111,6 +111,7 @@ struct sample_data {
 	u64 gpu_ts;
 	u64 clk_monoraw;
 	const u8 *report;
+	const u8 *mmio;
 };
 
 /* for sysctl proc_dointvec_minmax of i915_oa_min_timer_exponent */
@@ -169,6 +170,7 @@ static const enum intel_engine_id user_ring_map[I915_USER_RINGS + 1] = {
 #define SAMPLE_TAG		(1<<4)
 #define SAMPLE_TS		(1<<5)
 #define SAMPLE_CLK_MONO_RAW	(1<<6)
+#define SAMPLE_MMIO		(1<<7)
 
 struct perf_open_properties {
 	u32 sample_flags;
@@ -401,6 +403,9 @@ static int insert_perf_entry(struct drm_i915_private *dev_priv,
 		sample_ts = true;
 	}
 
+	if (sample_flags & SAMPLE_MMIO)
+		entry_size += 4*dev_priv->perf.num_mmio;
+
 	spin_lock(&dev_priv->perf.node_list_lock[id]);
 	if (list_empty(&dev_priv->perf.node_list[id])) {
 		offset = 0;
@@ -478,6 +483,10 @@ out:
 		entry->ts_offset = ALIGN(entry->ts_offset, TS_ADDR_ALIGN);
 		offset = entry->ts_offset + I915_PERF_TS_SAMPLE_SIZE;
 	}
+	if (sample_flags & SAMPLE_MMIO) {
+		entry->mmio_offset = offset;
+		offset = entry->mmio_offset + 4*dev_priv->perf.num_mmio;
+	}
 
 	list_add_tail(&entry->link, &dev_priv->perf.node_list[id]);
 #ifndef CMD_STREAM_BUF_OVERFLOW_ALLOWED
@@ -623,6 +632,66 @@ static int i915_ring_stream_capture_ts(struct drm_i915_gem_request *req,
 	return 0;
 }
 
+static int i915_ring_stream_capture_mmio(struct drm_i915_gem_request *req,
+						u32 offset)
+{
+	struct intel_engine_cs *engine = req->engine;
+	struct intel_ringbuffer *ringbuf = req->ringbuf;
+	struct drm_i915_private *dev_priv = engine->dev->dev_private;
+	int num_mmio = dev_priv->perf.num_mmio;
+	u32 mmio_addr, addr = 0;
+	int ret, i;
+
+	ret = intel_ring_begin(req, 4*num_mmio);
+	if (ret)
+		return ret;
+
+	mmio_addr =
+		dev_priv->perf.command_stream_buf[engine->id].vma->node.start +
+			offset;
+
+	if (i915.enable_execlists) {
+		for (i = 0; i < num_mmio; i++) {
+			uint32_t cmd;
+
+			addr = mmio_addr + 4*i;
+
+			cmd = MI_STORE_REGISTER_MEM_GEN8 |
+						MI_SRM_LRM_GLOBAL_GTT;
+
+			intel_logical_ring_emit(ringbuf, cmd);
+			intel_logical_ring_emit(ringbuf,
+					dev_priv->perf.mmio_list[i]);
+			intel_logical_ring_emit(ringbuf, addr);
+			intel_logical_ring_emit(ringbuf, 0);
+		}
+		intel_logical_ring_advance(ringbuf);
+	} else {
+		for (i = 0; i < num_mmio; i++) {
+			uint32_t cmd;
+
+			addr = mmio_addr + 4*i;
+
+			if (INTEL_INFO(engine->dev)->gen >= 8)
+				cmd = MI_STORE_REGISTER_MEM_GEN8 |
+					MI_SRM_LRM_GLOBAL_GTT;
+			else
+				cmd = MI_STORE_REGISTER_MEM |
+					MI_SRM_LRM_GLOBAL_GTT;
+
+			intel_ring_emit(engine, cmd);
+			intel_ring_emit(engine, dev_priv->perf.mmio_list[i]);
+			intel_ring_emit(engine, addr);
+			if (INTEL_INFO(engine->dev)->gen >= 8)
+				intel_ring_emit(engine, 0);
+			else
+				intel_ring_emit(engine, MI_NOOP);
+		}
+		intel_ring_advance(engine);
+	}
+	return 0;
+}
+
 static void i915_ring_stream_cs_hook(struct i915_perf_stream *stream,
 				struct drm_i915_gem_request *req, u32 tag)
 {
@@ -664,6 +733,13 @@ static void i915_ring_stream_cs_hook(struct i915_perf_stream *stream,
 			goto err_unref;
 	}
 
+	if (sample_flags & SAMPLE_MMIO) {
+		ret = i915_ring_stream_capture_mmio(req,
+				entry->mmio_offset);
+		if (ret)
+			goto err_unref;
+	}
+
 	i915_vma_move_to_active(dev_priv->perf.command_stream_buf[id].vma, req);
 	return;
 
@@ -922,6 +998,12 @@ static int append_sample(struct i915_perf_stream *stream,
 		buf += I915_PERF_TS_SAMPLE_SIZE;
 	}
 
+	if (sample_flags & SAMPLE_MMIO) {
+		if (copy_to_user(buf, data->mmio, 4*dev_priv->perf.num_mmio))
+			return -EFAULT;
+		buf += 4*dev_priv->perf.num_mmio;
+	}
+
 	if (sample_flags & SAMPLE_OA_REPORT) {
 		if (copy_to_user(buf, data->report, report_size))
 			return -EFAULT;
@@ -980,6 +1062,7 @@ static int append_oa_buffer_sample(struct i915_perf_stream *stream,
 	struct drm_i915_private *dev_priv = stream->dev_priv;
 	u32 sample_flags = stream->sample_flags;
 	struct sample_data data = { 0 };
+	u32 mmio_list_dummy[I915_PERF_MMIO_NUM_MAX] = { 0 };
 
 	if (sample_flags & SAMPLE_OA_SOURCE_INFO) {
 		enum drm_i915_perf_oa_event_source source;
@@ -1019,6 +1102,10 @@ static int append_oa_buffer_sample(struct i915_perf_stream *stream,
 		data.clk_monoraw = get_clk_monoraw_from_gpu_ts(stream, gpu_ts);
 	}
 
+	/* Periodic OA samples don't have mmio associated with them */
+	if (sample_flags & SAMPLE_MMIO)
+		data.mmio = (u8 *)mmio_list_dummy;
+
 	if (sample_flags & SAMPLE_OA_REPORT)
 		data.report = report;
 
@@ -1525,6 +1612,10 @@ static int append_one_cs_sample(struct i915_perf_stream *stream,
 							stream,	gpu_ts);
 	}
 
+	if (sample_flags & SAMPLE_MMIO)
+		data.mmio = dev_priv->perf.command_stream_buf[id].addr +
+				node->mmio_offset;
+
 	return append_sample(stream, read_state, &data);
 }
 
@@ -2342,10 +2433,12 @@ static int i915_ring_stream_init(struct i915_perf_stream *stream,
 	bool require_oa_unit = props->sample_flags & (SAMPLE_OA_REPORT |
 						      SAMPLE_OA_SOURCE_INFO);
 	bool require_cs_mode = props->sample_flags & (SAMPLE_PID |
-						      SAMPLE_TAG);
+						      SAMPLE_TAG |
+						      SAMPLE_MMIO);
 	bool cs_sample_data = props->sample_flags & (SAMPLE_OA_REPORT |
 							SAMPLE_TS |
-							SAMPLE_CLK_MONO_RAW);
+							SAMPLE_CLK_MONO_RAW |
+							SAMPLE_MMIO);
 	int ret;
 
 	if ((props->sample_flags & SAMPLE_CTX_ID) && !props->cs_mode) {
@@ -2513,7 +2606,7 @@ static int i915_ring_stream_init(struct i915_perf_stream *stream,
 
 	if (require_cs_mode && !props->cs_mode) {
 		DRM_ERROR(
-			"PID, TAG or TS sampling require a ring to be specified");
+			"PID, TAG, TS or MMIO sampling require a ring to be specified");
 		ret = -EINVAL;
 		goto cs_error;
 	}
@@ -2561,6 +2654,11 @@ static int i915_ring_stream_init(struct i915_perf_stream *stream,
 			stream->sample_size += 4;
 		}
 
+		if (props->sample_flags & SAMPLE_MMIO) {
+			stream->sample_flags |= SAMPLE_MMIO;
+			stream->sample_size += 4 * dev_priv->perf.num_mmio;
+		}
+
 		ret = alloc_command_stream_buf(dev_priv, stream->engine);
 		if (ret)
 			goto cs_error;
@@ -3084,6 +3182,69 @@ err:
 	return ret;
 }
 
+static int check_mmio_whitelist(struct drm_i915_private *dev_priv, u32 num_mmio)
+{
+#define GEN_RANGE(l, h) GENMASK(h, l)
+	static const struct register_whitelist {
+		i915_reg_t mmio;
+		uint32_t size;
+		/* supported gens, 0x10 for 4, 0x30 for 4 and 5, etc. */
+		uint32_t gen_bitmask;
+	} whitelist[] = {
+		{ GEN6_GT_GFX_RC6, 4, GEN_RANGE(7, 9) },
+		{ GEN6_GT_GFX_RC6p, 4, GEN_RANGE(7, 9) },
+	};
+	int i, count;
+
+	for (count = 0; count < num_mmio; count++) {
+		/* Coarse check on mmio reg addresses being non zero */
+		if (!dev_priv->perf.mmio_list[count])
+			return -EINVAL;
+
+		for (i = 0; i < ARRAY_SIZE(whitelist); i++) {
+			if (( i915_mmio_reg_offset(whitelist[i].mmio) ==
+				dev_priv->perf.mmio_list[count]) &&
+			    (1 << INTEL_INFO(dev_priv->dev)->gen &
+					whitelist[i].gen_bitmask))
+				break;
+		}
+
+		if (i == ARRAY_SIZE(whitelist))
+			return -EINVAL;
+	}
+	return 0;
+}
+
+static int copy_mmio_list(struct drm_i915_private *dev_priv,
+				void __user *mmio)
+{
+	void __user *mmio_list = ((u8 __user *)mmio + 4);
+	u32 num_mmio;
+	int ret;
+
+	if (!mmio)
+		return -EINVAL;
+
+	ret = get_user(num_mmio, (u32 __user *)mmio);
+	if (ret)
+		return ret;
+
+	if (num_mmio > I915_PERF_MMIO_NUM_MAX)
+		return -EINVAL;
+
+	memset(dev_priv->perf.mmio_list, 0, I915_PERF_MMIO_NUM_MAX);
+	if (copy_from_user(dev_priv->perf.mmio_list, mmio_list, 4*num_mmio))
+		return -EINVAL;
+
+	ret = check_mmio_whitelist(dev_priv, num_mmio);
+	if (ret)
+		return ret;
+
+	dev_priv->perf.num_mmio = num_mmio;
+
+	return 0;
+}
+
 /* Note we copy the properties from userspace outside of the i915 perf
  * mutex to avoid an awkward lockdep with mmap_sem.
  *
@@ -3204,6 +3365,12 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_PROP_SAMPLE_CLOCK_MONOTONIC_RAW:
 			props->sample_flags |= SAMPLE_CLK_MONO_RAW;
 			break;
+		case DRM_I915_PERF_PROP_SAMPLE_MMIO:
+			ret = copy_mmio_list(dev_priv, (u64 __user *)value);
+			if (ret)
+				return ret;
+			props->sample_flags |= SAMPLE_MMIO;
+			break;
 		case DRM_I915_PERF_PROP_MAX:
 			BUG();
 		}
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index c4cae53..7fd048c 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1206,6 +1206,12 @@ enum drm_i915_perf_oa_event_source {
 	I915_PERF_OA_EVENT_SOURCE_MAX	/* non-ABI */
 };
 
+#define I915_PERF_MMIO_NUM_MAX	8
+struct drm_i915_perf_mmio_list {
+	__u32 num_mmio;
+	__u32 mmio_list[I915_PERF_MMIO_NUM_MAX];
+};
+
 enum drm_i915_perf_property_id {
 	/**
 	 * Open the stream for a specific context handle (as used with
@@ -1284,6 +1290,13 @@ enum drm_i915_perf_property_id {
 	 */
 	DRM_I915_PERF_PROP_SAMPLE_CLOCK_MONOTONIC_RAW,
 
+	/**
+	 * This property requests inclusion of mmio register values in the perf
+	 * sample data. The value of this property specifies the address of user
+	 * struct having the register addresses.
+	 */
+	DRM_I915_PERF_PROP_SAMPLE_MMIO,
+
 	DRM_I915_PERF_PROP_MAX /* non-ABI */
 };
 
@@ -1354,6 +1367,7 @@ enum drm_i915_perf_record_type {
 	 *     { u32 tag; } && DRM_I915_PERF_PROP_SAMPLE_TAG
 	 *     { u64 gpu_ts; } && DRM_I915_PERF_PROP_SAMPLE_TS
 	 *     { u64 clk_mono; } && DRM_I915_PERF_PROP_SAMPLE_CLOCK_MONOTONIC
+	 *     { u32 mmio[]; } && DRM_I915_PERF_PROP_SAMPLE_MMIO
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
 	 * };
 	 */
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 03/15] drm/i915: Framework for capturing command stream based OA reports
  2016-06-02  5:18 ` [PATCH 03/15] drm/i915: Framework for capturing command stream based OA reports sourab.gupta
@ 2016-06-02  6:00   ` Martin Peres
  2016-06-02  6:28     ` sourab gupta
  0 siblings, 1 reply; 24+ messages in thread
From: Martin Peres @ 2016-06-02  6:00 UTC (permalink / raw)
  To: sourab.gupta, intel-gfx; +Cc: Daniel Vetter, Deepak S

On 02/06/16 08:18, sourab.gupta@intel.com wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
>
> This patch introduces a framework to enable OA counter reports associated
> with Render command stream. We can then associate the reports captured
> through this mechanism with their corresponding context id's. This can be
> further extended to associate any other metadata information with the
> corresponding samples (since the association with Render command stream
> gives us the ability to capture these information while inserting the
> corresponding capture commands into the command stream).
>
> The OA reports generated in this way are associated with a corresponding
> workload, and thus can be used the delimit the workload (i.e. sample the
> counters at the workload boundaries), within an ongoing stream of periodic
> counter snapshots.
>
> There may be usecases wherein we need more than periodic OA capture mode
> which is supported currently. This mode is primarily used for two usecases:
>     - Ability to capture system wide metrics, alongwith the ability to map
>       the reports back to individual contexts (particularly for HSW).
>     - Ability to inject tags for work, into the reports. This provides
>       visibility into the multiple stages of work within single context.
>
> The userspace will be able to distinguish between the periodic and CS based
> OA reports by the virtue of source_info sample field.
>
> The command MI_REPORT_PERF_COUNT can be used to capture snapshots of OA
> counters, and is inserted at BB boundaries.

So, it is possible to trigger a read of a set of counters (all?) from 
the pushbuffer buffer?

If so, I like this because I was wondering how would this work break 
when we move to the GuC-submission model.

Thanks for working on this!
Martin
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 03/15] drm/i915: Framework for capturing command stream based OA reports
  2016-06-02  6:00   ` Martin Peres
@ 2016-06-02  6:28     ` sourab gupta
  0 siblings, 0 replies; 24+ messages in thread
From: sourab gupta @ 2016-06-02  6:28 UTC (permalink / raw)
  To: Martin Peres; +Cc: Daniel Vetter, intel-gfx, S, Deepak

On Thu, 2016-06-02 at 11:30 +0530, Martin Peres wrote:
> On 02/06/16 08:18, sourab.gupta@intel.com wrote:
> > From: Sourab Gupta <sourab.gupta@intel.com>
> >
> > This patch introduces a framework to enable OA counter reports associated
> > with Render command stream. We can then associate the reports captured
> > through this mechanism with their corresponding context id's. This can be
> > further extended to associate any other metadata information with the
> > corresponding samples (since the association with Render command stream
> > gives us the ability to capture these information while inserting the
> > corresponding capture commands into the command stream).
> >
> > The OA reports generated in this way are associated with a corresponding
> > workload, and thus can be used the delimit the workload (i.e. sample the
> > counters at the workload boundaries), within an ongoing stream of periodic
> > counter snapshots.
> >
> > There may be usecases wherein we need more than periodic OA capture mode
> > which is supported currently. This mode is primarily used for two usecases:
> >     - Ability to capture system wide metrics, alongwith the ability to map
> >       the reports back to individual contexts (particularly for HSW).
> >     - Ability to inject tags for work, into the reports. This provides
> >       visibility into the multiple stages of work within single context.
> >
> > The userspace will be able to distinguish between the periodic and CS based
> > OA reports by the virtue of source_info sample field.
> >
> > The command MI_REPORT_PERF_COUNT can be used to capture snapshots of OA
> > counters, and is inserted at BB boundaries.
> 
> So, it is possible to trigger a read of a set of counters (all?) from 
> the pushbuffer buffer?
> 
> If so, I like this because I was wondering how would this work break 
> when we move to the GuC-submission model.
> 
> Thanks for working on this!
> Martin

Yeah, we can trigger the capture of counter snapshot using command
MI_REPORT_PERF_COUNT inserted in the ringbuffer. The capture is
initiated on encountering this command, at an address specified in the
command arguments. The exact details of counters captured (report format
etc.) depends on the OA unit configuration done earlier.

-Sourab


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* ✗ Ro.CI.BAT: failure for Framework to collect command stream gpu metrics using i915 perf (rev2)
  2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
                   ` (14 preceding siblings ...)
  2016-06-02  5:18 ` [PATCH 15/15] drm/i915: Support for capturing MMIO register values sourab.gupta
@ 2016-06-03 12:08 ` Patchwork
  15 siblings, 0 replies; 24+ messages in thread
From: Patchwork @ 2016-06-03 12:08 UTC (permalink / raw)
  To: sourab.gupta; +Cc: intel-gfx

== Series Details ==

Series: Framework to collect command stream gpu metrics using i915 perf (rev2)
URL   : https://patchwork.freedesktop.org/series/6154/
State : failure

== Summary ==

Applying: drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id
Applying: drm/i915: Expose OA sample source to userspace
fatal: sha1 information is lacking or useless (drivers/gpu/drm/i915/i915_perf.c).
error: could not build fake ancestor
Patch failed at 0002 drm/i915: Expose OA sample source to userspace
The copy of the patch that failed is found in: .git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01/15] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id
  2016-06-02  5:18 ` [PATCH 01/15] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id sourab.gupta
@ 2016-07-27  9:18   ` Deepak
  2016-07-27 10:19     ` Daniel Vetter
  0 siblings, 1 reply; 24+ messages in thread
From: Deepak @ 2016-07-27  9:18 UTC (permalink / raw)
  To: intel-gfx




On 06/02/2016 10:48 AM, sourab.gupta@intel.com wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
>
> This patch adds a new ctx getparam ioctl parameter, which can be used to
> retrieve ctx unique id by userspace.
>
> This can be used by userspace to map the i915 perf samples with their
> particular ctx's, since those would be having ctx unique id's.
> Otherwise the userspace has no way of maintaining this association,
> since it has the knowledge of only per-drm file specific ctx handles.
>
> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem_context.c | 3 +++
>   include/uapi/drm/i915_drm.h             | 1 +
>   2 files changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index e974451..09f5178 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -1001,6 +1001,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   		else
>   			args->value = to_i915(dev)->ggtt.base.total;
>   		break;
> +	case I915_CONTEXT_PARAM_HW_ID:
> +		args->value = ctx->hw_id;
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 4a1bcfd8..0badc16 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1171,6 +1171,7 @@ struct drm_i915_gem_context_param {
>   #define I915_CONTEXT_PARAM_BAN_PERIOD	0x1
>   #define I915_CONTEXT_PARAM_NO_ZEROMAP	0x2
>   #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
> +#define I915_CONTEXT_PARAM_HW_ID	0x4
>   	__u64 value;
>   };
>   
Patch looks good to me

Reviewed-by: Deepak S <deepak.s@linux.intel.com>


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01/15] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id
  2016-07-27  9:18   ` Deepak
@ 2016-07-27 10:19     ` Daniel Vetter
  2016-07-27 10:50       ` Chris Wilson
  0 siblings, 1 reply; 24+ messages in thread
From: Daniel Vetter @ 2016-07-27 10:19 UTC (permalink / raw)
  To: Deepak; +Cc: intel-gfx

On Wed, Jul 27, 2016 at 02:48:38PM +0530, Deepak wrote:
> 
> 
> 
> On 06/02/2016 10:48 AM, sourab.gupta@intel.com wrote:
> > From: Sourab Gupta <sourab.gupta@intel.com>
> > 
> > This patch adds a new ctx getparam ioctl parameter, which can be used to
> > retrieve ctx unique id by userspace.
> > 
> > This can be used by userspace to map the i915 perf samples with their
> > particular ctx's, since those would be having ctx unique id's.
> > Otherwise the userspace has no way of maintaining this association,
> > since it has the knowledge of only per-drm file specific ctx handles.
> > 
> > Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem_context.c | 3 +++
> >   include/uapi/drm/i915_drm.h             | 1 +
> >   2 files changed, 4 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > index e974451..09f5178 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > @@ -1001,6 +1001,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> >   		else
> >   			args->value = to_i915(dev)->ggtt.base.total;
> >   		break;
> > +	case I915_CONTEXT_PARAM_HW_ID:
> > +		args->value = ctx->hw_id;
> > +		break;
> >   	default:
> >   		ret = -EINVAL;
> >   		break;
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index 4a1bcfd8..0badc16 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -1171,6 +1171,7 @@ struct drm_i915_gem_context_param {
> >   #define I915_CONTEXT_PARAM_BAN_PERIOD	0x1
> >   #define I915_CONTEXT_PARAM_NO_ZEROMAP	0x2
> >   #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
> > +#define I915_CONTEXT_PARAM_HW_ID	0x4
> >   	__u64 value;
> >   };
> Patch looks good to me
> 
> Reviewed-by: Deepak S <deepak.s@linux.intel.com>

ctx->hw_id gets recycled without any involvement from userspace. How
exactly is this supposed to work?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01/15] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id
  2016-07-27 10:19     ` Daniel Vetter
@ 2016-07-27 10:50       ` Chris Wilson
  2016-07-28  9:37         ` Daniel Vetter
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Wilson @ 2016-07-27 10:50 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Wed, Jul 27, 2016 at 12:19:00PM +0200, Daniel Vetter wrote:
> On Wed, Jul 27, 2016 at 02:48:38PM +0530, Deepak wrote:
> > 
> > 
> > 
> > On 06/02/2016 10:48 AM, sourab.gupta@intel.com wrote:
> > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > 
> > > This patch adds a new ctx getparam ioctl parameter, which can be used to
> > > retrieve ctx unique id by userspace.
> > > 
> > > This can be used by userspace to map the i915 perf samples with their
> > > particular ctx's, since those would be having ctx unique id's.
> > > Otherwise the userspace has no way of maintaining this association,
> > > since it has the knowledge of only per-drm file specific ctx handles.
> > > 
> > > Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> > > ---
> > >   drivers/gpu/drm/i915/i915_gem_context.c | 3 +++
> > >   include/uapi/drm/i915_drm.h             | 1 +
> > >   2 files changed, 4 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > > index e974451..09f5178 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > > @@ -1001,6 +1001,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> > >   		else
> > >   			args->value = to_i915(dev)->ggtt.base.total;
> > >   		break;
> > > +	case I915_CONTEXT_PARAM_HW_ID:
> > > +		args->value = ctx->hw_id;
> > > +		break;
> > >   	default:
> > >   		ret = -EINVAL;
> > >   		break;
> > > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > > index 4a1bcfd8..0badc16 100644
> > > --- a/include/uapi/drm/i915_drm.h
> > > +++ b/include/uapi/drm/i915_drm.h
> > > @@ -1171,6 +1171,7 @@ struct drm_i915_gem_context_param {
> > >   #define I915_CONTEXT_PARAM_BAN_PERIOD	0x1
> > >   #define I915_CONTEXT_PARAM_NO_ZEROMAP	0x2
> > >   #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
> > > +#define I915_CONTEXT_PARAM_HW_ID	0x4
> > >   	__u64 value;
> > >   };
> > Patch looks good to me
> > 
> > Reviewed-by: Deepak S <deepak.s@linux.intel.com>
> 
> ctx->hw_id gets recycled without any involvement from userspace. How
> exactly is this supposed to work?

ctx->hw_id is invariant for the lifetime of the context (and so we limit
the number of live contents to meet hw constraints, ~1 million).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01/15] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id
  2016-07-27 10:50       ` Chris Wilson
@ 2016-07-28  9:37         ` Daniel Vetter
  0 siblings, 0 replies; 24+ messages in thread
From: Daniel Vetter @ 2016-07-28  9:37 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Deepak, intel-gfx

On Wed, Jul 27, 2016 at 11:50:36AM +0100, Chris Wilson wrote:
> On Wed, Jul 27, 2016 at 12:19:00PM +0200, Daniel Vetter wrote:
> > On Wed, Jul 27, 2016 at 02:48:38PM +0530, Deepak wrote:
> > > 
> > > 
> > > 
> > > On 06/02/2016 10:48 AM, sourab.gupta@intel.com wrote:
> > > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > > 
> > > > This patch adds a new ctx getparam ioctl parameter, which can be used to
> > > > retrieve ctx unique id by userspace.
> > > > 
> > > > This can be used by userspace to map the i915 perf samples with their
> > > > particular ctx's, since those would be having ctx unique id's.
> > > > Otherwise the userspace has no way of maintaining this association,
> > > > since it has the knowledge of only per-drm file specific ctx handles.
> > > > 
> > > > Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> > > > ---
> > > >   drivers/gpu/drm/i915/i915_gem_context.c | 3 +++
> > > >   include/uapi/drm/i915_drm.h             | 1 +
> > > >   2 files changed, 4 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > > > index e974451..09f5178 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > > > @@ -1001,6 +1001,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
> > > >   		else
> > > >   			args->value = to_i915(dev)->ggtt.base.total;
> > > >   		break;
> > > > +	case I915_CONTEXT_PARAM_HW_ID:
> > > > +		args->value = ctx->hw_id;
> > > > +		break;
> > > >   	default:
> > > >   		ret = -EINVAL;
> > > >   		break;
> > > > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > > > index 4a1bcfd8..0badc16 100644
> > > > --- a/include/uapi/drm/i915_drm.h
> > > > +++ b/include/uapi/drm/i915_drm.h
> > > > @@ -1171,6 +1171,7 @@ struct drm_i915_gem_context_param {
> > > >   #define I915_CONTEXT_PARAM_BAN_PERIOD	0x1
> > > >   #define I915_CONTEXT_PARAM_NO_ZEROMAP	0x2
> > > >   #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
> > > > +#define I915_CONTEXT_PARAM_HW_ID	0x4
> > > >   	__u64 value;
> > > >   };
> > > Patch looks good to me
> > > 
> > > Reviewed-by: Deepak S <deepak.s@linux.intel.com>
> > 
> > ctx->hw_id gets recycled without any involvement from userspace. How
> > exactly is this supposed to work?
> 
> ctx->hw_id is invariant for the lifetime of the context (and so we limit
> the number of live contents to meet hw constraints, ~1 million).

Argh, I was tricked by the comment in assign_hw_id - I thought it's
talking about the hw_id only, not about the entire ctx.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/15] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id
  2016-11-04  9:30 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
@ 2016-11-04  9:30 ` sourab.gupta
  0 siblings, 0 replies; 24+ messages in thread
From: sourab.gupta @ 2016-11-04  9:30 UTC (permalink / raw)
  To: intel-gfx
  Cc: Christopher S . Hall, Daniel Vetter, Sourab Gupta, Matthew Auld,
	Thomas Gleixner

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds a new ctx getparam ioctl parameter, which can be used to
retrieve ctx unique id by userspace.

This can be used by userspace to map the i915 perf samples with their
particular ctx's, since those would be having ctx unique id's.
Otherwise the userspace has no way of maintaining this association,
since it has the knowledge of only per-drm file specific ctx handles.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 3 +++
 include/uapi/drm/i915_drm.h             | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index e6616ed..d0efa5e 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1078,6 +1078,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_NO_ERROR_CAPTURE:
 		args->value = !!(ctx->flags & CONTEXT_NO_ERROR_CAPTURE);
 		break;
+	case I915_CONTEXT_PARAM_HW_ID:
+		args->value = ctx->hw_id;
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index f63a392..e95f666 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1223,6 +1223,7 @@ struct drm_i915_gem_context_param {
 #define I915_CONTEXT_PARAM_NO_ZEROMAP	0x2
 #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
 #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
+#define I915_CONTEXT_PARAM_HW_ID	0x5
 	__u64 value;
 };
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-11-04  9:29 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-02  5:18 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
2016-06-02  5:18 ` [PATCH 01/15] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id sourab.gupta
2016-07-27  9:18   ` Deepak
2016-07-27 10:19     ` Daniel Vetter
2016-07-27 10:50       ` Chris Wilson
2016-07-28  9:37         ` Daniel Vetter
2016-06-02  5:18 ` [PATCH 02/15] drm/i915: Expose OA sample source to userspace sourab.gupta
2016-06-02  5:18 ` [PATCH 03/15] drm/i915: Framework for capturing command stream based OA reports sourab.gupta
2016-06-02  6:00   ` Martin Peres
2016-06-02  6:28     ` sourab gupta
2016-06-02  5:18 ` [PATCH 04/15] drm/i915: flush periodic samples, in case of no pending CS sample requests sourab.gupta
2016-06-02  5:18 ` [PATCH 05/15] drm/i915: Handle the overflow condition for command stream buf sourab.gupta
2016-06-02  5:18 ` [PATCH 06/15] drm/i915: Populate ctx ID for periodic OA reports sourab.gupta
2016-06-02  5:18 ` [PATCH 07/15] drm/i915: Add support for having pid output with OA report sourab.gupta
2016-06-02  5:18 ` [PATCH 08/15] drm/i915: Add support for emitting execbuffer tags through OA counter reports sourab.gupta
2016-06-02  5:18 ` [PATCH 09/15] drm/i915: Extend i915 perf framework for collecting timestamps on all gpu engines sourab.gupta
2016-06-02  5:18 ` [PATCH 10/15] drm/i915: Extract raw GPU timestamps from OA reports to forward in perf samples sourab.gupta
2016-06-02  5:18 ` [PATCH 11/15] drm/i915: Support opening multiple concurrent perf streams sourab.gupta
2016-06-02  5:18 ` [PATCH 12/15] time: Expose current clocksource in use by timekeeping framework sourab.gupta
2016-06-02  5:18 ` [PATCH 13/15] time: export clocks_calc_mult_shift sourab.gupta
2016-06-02  5:18 ` [PATCH 14/15] drm/i915: Mechanism to forward clock monotonic raw time in perf samples sourab.gupta
2016-06-02  5:18 ` [PATCH 15/15] drm/i915: Support for capturing MMIO register values sourab.gupta
2016-06-03 12:08 ` ✗ Ro.CI.BAT: failure for Framework to collect command stream gpu metrics using i915 perf (rev2) Patchwork
2016-11-04  9:30 [PATCH 00/15] Framework to collect command stream gpu metrics using i915 perf sourab.gupta
2016-11-04  9:30 ` [PATCH 01/15] drm/i915: Add ctx getparam ioctl parameter to retrieve ctx unique id sourab.gupta

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.