All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/8] Introduce framework for forwarding generic non-OA performance
@ 2015-08-05  5:55 sourab.gupta
  2015-08-05  5:55 ` [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests sourab.gupta
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: sourab.gupta @ 2015-08-05  5:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This is an updated patch set (v3 - changes list at end), which builds upon the
multi context OA patch set introduced earlier at:

http://lists.freedesktop.org/archives/intel-gfx/2015-August/072949.html

The OA unit, as such, is specific to render ring and can't cater to performance
data requirements for other GPU engines.
Specifically, the media workloads may utilize other GPU engines, but there is
currently no framework which can be used to query performance statistics for
non-RCS workloads and provide this data to userspace tools. This patch set
tries to address this specific problem. The aim of this patch series is to
build upon the perf event framework developed earlier and use it for
forwarding performance data of non-RCS engine workloads.

Since the previous PMU is customized to handle OA reports, a new perf PMU is
added to handle generic non-OA performance data. An example of such non-OA
performance data is the timestamps/mmio registers.
This patch set enables framework for capturing the timestamps at
batch buffer boundaries, by inserting commands for the same in ringbuffer,
and forwarding the samples to userspace through perf interface.
Nevertheless, the framework and data structures can be extended to introduce
more performance data types (other than timestamps). The intention here is to
introduce a framework to enable capturing of generic performance data and
forwarding the same to userspace using perf apis.

The reports generated will again have an additional footer for metadata
information such as ctx_id, pid, ring id and tags (in the same way as done
for OA reports specified in the patch series earlier). This information can be
used by userspace tools such as MVP (Modular Video Profiler) to associate
reports with individual contexts and different stages of workload execution.

In this patch set, the timestamps are captured at BB boundaries by inserting
the commands in the ringbuffer at the batchbuffer boundaries. As specified
earlier, for a system wide GPU profiler, the relative complexity of doing this
in kernel is significantly less than supporting this usecase through userspace
command insertion by all the different components.

The final patch in the series tries to extend the data structures to enable
capture of upto 8 MMIO register values, in conjunction with timestamps

v2: This patch series has the following changes wrt the one floated earlier:
    - Removing synchronous waits during event stop/destroy
    - segregating the book-keeping data for the samples from destination buffer
      and collecting it into a separate list
    - managing the lifetime of destination buffer with the help of gem active
      reference tracking
    - having the scope of i915 device mutex limited to places of gem interaction
      and having the pmu data structures protected with a per pmu lock
    - userspace can now control the metadata it wants by requesting the same
      during event init. The sample is sent with the requested metadata in a
      packed format.
    - Some patches merged together and a few more introduced
    - mmio whitelist in place

v3: Changes made:
    - Meeting semantics for flush (ensuring to flush samples before returning).
    - spin_lock used in place of spin_lock_irqsave.
    - Using BUILD_BUG_ON macros to test the alignment/size requirements.
    - Some code restructuring/optimization, better nomenclature, and error
      handling.

Sourab Gupta (8):
  drm/i915: Add a new PMU for handling non-OA counter data profiling
    requests
  drm/i915: Add mechanism for forwarding the timestamp data through perf
  drm/i915: Handle event stop and destroy for GPU commands submitted
  drm/i915: Insert commands for capturing timestamps in the ring
  drm/i915: Add support for forwarding ring id in sample metadata
    through perf
  drm/i915: Add support for forwarding pid in timestamp sample metadata
    through perf
  drm/i915: Add support for forwarding execbuffer tags in timestamp
    sample metadata
  drm/i915: Support for retrieving MMIO register values alongwith
    timestamps through perf

 drivers/gpu/drm/i915/i915_dma.c     |   2 +
 drivers/gpu/drm/i915/i915_drv.h     |  43 +++
 drivers/gpu/drm/i915/i915_oa_perf.c | 690 ++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h     |   2 +
 include/uapi/drm/i915_drm.h         |  42 +++
 5 files changed, 779 insertions(+)

-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests
  2015-08-05  5:55 [RFC 0/8] Introduce framework for forwarding generic non-OA performance sourab.gupta
@ 2015-08-05  5:55 ` sourab.gupta
  2015-08-05  9:22   ` Chris Wilson
  2015-08-05  9:38   ` Chris Wilson
  2015-08-05  5:55 ` [RFC 2/8] drm/i915: Add mechanism for forwarding the timestamp data through perf sourab.gupta
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: sourab.gupta @ 2015-08-05  5:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

The current perf PMU driver is specific for collection of OA counter
statistics (which may be done in a periodic or asynchronous way). Since
this enables us (and limits us) to render ring, we have no means for
collection of data pertaining to other rings.

To overcome this limitation, we need to have a new PMU driver which enables
data collection for other rings also (in a non-OA specific mode).
This patch adds a new perf PMU to i915 device private, for handling
profiling requests for non-OA counter data.This data may encompass
timestamps, mmio register values, etc. for the relevant ring.
The new perf PMU will serve these purposes, without constraining itself to
type of data being dumped (which may restrict the user to specific ring
like in case of OA counters).

The patch introduces this PMU driver alongwith its associated callbacks.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_dma.c     |   2 +
 drivers/gpu/drm/i915/i915_drv.h     |  19 ++++
 drivers/gpu/drm/i915/i915_oa_perf.c | 215 ++++++++++++++++++++++++++++++++++++
 3 files changed, 236 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 0553f20..4b91504 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -822,6 +822,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
 	 * otherwise i915_oa_context_pin_notify() will lock an un-initialized
 	 * spinlock, upsetting lockdep checks */
 	i915_oa_pmu_register(dev);
+	i915_gen_pmu_register(dev);
 
 	intel_pm_setup(dev);
 
@@ -1072,6 +1073,7 @@ int i915_driver_unload(struct drm_device *dev)
 		return ret;
 	}
 
+	i915_gen_pmu_unregister(dev);
 	i915_oa_pmu_unregister(dev);
 	intel_power_domains_fini(dev_priv);
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d5d9156..66f9ee9 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1988,6 +1988,21 @@ struct drm_i915_private {
 		int sample_info_flags;
 	} oa_pmu;
 
+	struct {
+		struct pmu pmu;
+		spinlock_t lock;
+		struct hrtimer timer;
+		struct pt_regs dummy_regs;
+		struct perf_event *exclusive_event;
+		bool event_active;
+
+		struct {
+			struct drm_i915_gem_object *obj;
+			u32 gtt_offset;
+			u8 *addr;
+		} buffer;
+	} gen_pmu;
+
 	void (*emit_profiling_data[I915_PROFILE_MAX])
 		(struct drm_i915_gem_request *req, u32 global_ctx_id, u32 tag);
 #endif
@@ -3295,10 +3310,14 @@ int i915_parse_cmds(struct intel_engine_cs *ring,
 /* i915_oa_perf.c */
 #ifdef CONFIG_PERF_EVENTS
 extern void i915_oa_pmu_register(struct drm_device *dev);
+extern void i915_gen_pmu_register(struct drm_device *dev);
 extern void i915_oa_pmu_unregister(struct drm_device *dev);
+extern void i915_gen_pmu_unregister(struct drm_device *dev);
 #else
 static inline void i915_oa_pmu_register(struct drm_device *dev) {}
+static inline void i915_gen_pmu_register(struct drm_device *dev) {}
 static inline void i915_oa_pmu_unregister(struct drm_device *dev) {}
+static inline void i915_gen_pmu_unregister(struct drm_device *dev) {}
 #endif
 
 /* i915_suspend.c */
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index 48591fc..37ff0a9 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -414,6 +414,13 @@ static void forward_oa_rcs_work_fn(struct work_struct *__work)
 	forward_oa_rcs_snapshots(dev_priv);
 }
 
+static void forward_gen_pmu_snapshots(struct drm_i915_private *dev_priv)
+{
+	WARN_ON(!dev_priv->gen_pmu.buffer.addr);
+
+	/* TODO: routine for forwarding snapshots to userspace */
+}
+
 static void
 oa_rcs_buffer_destroy(struct drm_i915_private *i915)
 {
@@ -551,6 +558,34 @@ out:
 	spin_unlock(&dev_priv->oa_pmu.lock);
 }
 
+static void gen_buffer_destroy(struct drm_i915_private *i915)
+{
+	mutex_lock(&i915->dev->struct_mutex);
+	vunmap(i915->gen_pmu.buffer.addr);
+	i915_gem_object_ggtt_unpin(i915->gen_pmu.buffer.obj);
+	drm_gem_object_unreference(&i915->gen_pmu.buffer.obj->base);
+	mutex_unlock(&i915->dev->struct_mutex);
+
+	spin_lock(&i915->gen_pmu.lock);
+	i915->gen_pmu.buffer.obj = NULL;
+	i915->gen_pmu.buffer.gtt_offset = 0;
+	i915->gen_pmu.buffer.addr = NULL;
+	spin_unlock(&i915->gen_pmu.lock);
+}
+
+static void i915_gen_event_destroy(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), gen_pmu.pmu);
+
+	WARN_ON(event->parent);
+
+	gen_buffer_destroy(i915);
+
+	BUG_ON(i915->gen_pmu.exclusive_event != event);
+	i915->gen_pmu.exclusive_event = NULL;
+}
+
 static int alloc_obj(struct drm_i915_private *dev_priv,
 				struct drm_i915_gem_object **obj)
 {
@@ -712,6 +747,41 @@ static int init_oa_rcs_buffer(struct perf_event *event)
 	return 0;
 }
 
+static int init_gen_pmu_buffer(struct perf_event *event)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
+	struct drm_i915_gem_object *bo;
+	int ret;
+
+	BUG_ON(dev_priv->gen_pmu.buffer.obj);
+
+	ret = alloc_obj(dev_priv, &bo);
+	if (ret)
+		return ret;
+
+	dev_priv->gen_pmu.buffer.obj = bo;
+	dev_priv->gen_pmu.buffer.gtt_offset =
+				i915_gem_obj_ggtt_offset(bo);
+	dev_priv->gen_pmu.buffer.addr = vmap_oa_buffer(bo);
+
+	DRM_DEBUG_DRIVER("Gen PMU Buffer initialized, vaddr = %p",
+			 dev_priv->gen_pmu.buffer.addr);
+
+	return 0;
+}
+
+static enum hrtimer_restart hrtimer_sample_gen(struct hrtimer *hrtimer)
+{
+	struct drm_i915_private *i915 =
+		container_of(hrtimer, typeof(*i915), gen_pmu.timer);
+
+	forward_gen_pmu_snapshots(i915);
+
+	hrtimer_forward_now(hrtimer, ns_to_ktime(PERIOD));
+	return HRTIMER_RESTART;
+}
+
 static enum hrtimer_restart hrtimer_sample(struct hrtimer *hrtimer)
 {
 	struct drm_i915_private *i915 =
@@ -1224,6 +1294,106 @@ static int i915_oa_event_event_idx(struct perf_event *event)
 	return 0;
 }
 
+static int i915_gen_event_init(struct perf_event *event)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
+	int ret = 0;
+
+	if (event->attr.type != event->pmu->type)
+		return -ENOENT;
+
+	/* To avoid the complexity of having to accurately filter
+	 * data and marshal to the appropriate client
+	 * we currently only allow exclusive access */
+	if (dev_priv->gen_pmu.buffer.obj)
+		return -EBUSY;
+
+	/*
+	 * We need to check for CAP_SYS_ADMIN capability as we profile all
+	 * the running contexts
+	 */
+	if (!capable(CAP_SYS_ADMIN))
+			return -EACCES;
+
+	ret = init_gen_pmu_buffer(event);
+	if (ret)
+		return ret;
+
+	BUG_ON(dev_priv->gen_pmu.exclusive_event);
+	dev_priv->gen_pmu.exclusive_event = event;
+
+	event->destroy = i915_gen_event_destroy;
+
+	return 0;
+}
+
+static void i915_gen_event_start(struct perf_event *event, int flags)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
+
+	spin_lock(&dev_priv->gen_pmu.lock);
+	dev_priv->gen_pmu.event_active = true;
+	spin_unlock(&dev_priv->gen_pmu.lock);
+
+	__hrtimer_start_range_ns(&dev_priv->gen_pmu.timer, ns_to_ktime(PERIOD),
+					0, HRTIMER_MODE_REL_PINNED, 0);
+
+	event->hw.state = 0;
+}
+
+static void i915_gen_event_stop(struct perf_event *event, int flags)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
+
+	spin_lock(&dev_priv->gen_pmu.lock);
+	dev_priv->gen_pmu.event_active = false;
+	spin_unlock(&dev_priv->gen_pmu.lock);
+
+	hrtimer_cancel(&dev_priv->gen_pmu.timer);
+	forward_gen_pmu_snapshots(dev_priv);
+
+	event->hw.state = PERF_HES_STOPPED;
+}
+
+static int i915_gen_event_add(struct perf_event *event, int flags)
+{
+	if (flags & PERF_EF_START)
+		i915_gen_event_start(event, flags);
+
+	return 0;
+}
+
+static void i915_gen_event_del(struct perf_event *event, int flags)
+{
+	i915_gen_event_stop(event, flags);
+}
+
+static void i915_gen_event_read(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), gen_pmu.pmu);
+
+	/* XXX: What counter would be useful here? */
+	local64_set(&event->count, 0);
+}
+
+static int i915_gen_event_flush(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), gen_pmu.pmu);
+
+	forward_gen_pmu_snapshots(i915);
+	return 0;
+}
+
+static int i915_gen_event_event_idx(struct perf_event *event)
+{
+	return 0;
+}
+
 void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
 				struct intel_context *context)
 {
@@ -1352,3 +1522,48 @@ void i915_oa_pmu_unregister(struct drm_device *dev)
 	perf_pmu_unregister(&i915->oa_pmu.pmu);
 	i915->oa_pmu.pmu.event_init = NULL;
 }
+
+void i915_gen_pmu_register(struct drm_device *dev)
+{
+	struct drm_i915_private *i915 = to_i915(dev);
+
+	if (!(IS_HASWELL(dev) || IS_VALLEYVIEW(dev) || IS_BROADWELL(dev)))
+		return;
+
+	i915->gen_pmu.dummy_regs = *task_pt_regs(current);
+
+	hrtimer_init(&i915->gen_pmu.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	i915->gen_pmu.timer.function = hrtimer_sample_gen;
+
+	spin_lock_init(&i915->gen_pmu.lock);
+
+	i915->gen_pmu.pmu.capabilities  = PERF_PMU_CAP_IS_DEVICE;
+
+	/* Effectively disallow opening an event with a specific pid
+	 * since we aren't interested in processes running on the cpu...
+	 */
+	i915->gen_pmu.pmu.task_ctx_nr   = perf_invalid_context;
+
+	i915->gen_pmu.pmu.event_init    = i915_gen_event_init;
+	i915->gen_pmu.pmu.add	       = i915_gen_event_add;
+	i915->gen_pmu.pmu.del	       = i915_gen_event_del;
+	i915->gen_pmu.pmu.start	       = i915_gen_event_start;
+	i915->gen_pmu.pmu.stop	       = i915_gen_event_stop;
+	i915->gen_pmu.pmu.read	       = i915_gen_event_read;
+	i915->gen_pmu.pmu.flush	       = i915_gen_event_flush;
+	i915->gen_pmu.pmu.event_idx     = i915_gen_event_event_idx;
+
+	if (perf_pmu_register(&i915->gen_pmu.pmu, "i915_gen", -1))
+		i915->gen_pmu.pmu.event_init = NULL;
+}
+
+void i915_gen_pmu_unregister(struct drm_device *dev)
+{
+	struct drm_i915_private *i915 = to_i915(dev);
+
+	if (i915->gen_pmu.pmu.event_init == NULL)
+		return;
+
+	perf_pmu_unregister(&i915->gen_pmu.pmu);
+	i915->gen_pmu.pmu.event_init = NULL;
+}
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 2/8] drm/i915: Add mechanism for forwarding the timestamp data through perf
  2015-08-05  5:55 [RFC 0/8] Introduce framework for forwarding generic non-OA performance sourab.gupta
  2015-08-05  5:55 ` [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests sourab.gupta
@ 2015-08-05  5:55 ` sourab.gupta
  2015-08-05  9:55   ` Chris Wilson
  2015-08-05  5:55 ` [RFC 3/8] drm/i915: Handle event stop and destroy for GPU commands submitted sourab.gupta
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: sourab.gupta @ 2015-08-05  5:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds the mechanism for forwarding the timestamp data to
userspace using the Gen PMU perf event interface.

The timestamps will be captured in a gem buffer object. The metadata
information (ctx global id right now) pertaining to snapshot is maintained
in a list, whose each node has offsets into the gem buffer object for each
snapshot captured.
In order to track whether the gpu has completed processing the node,
a field pertaining to corresponding gem request is added. The request is
expected to be referenced whenever the gpu command is submitted.

Each snapshot collected is forwarded as a separate perf sample. The perf
sample will have raw timestamp data followed by metadata information
pertaining to that sample.
While forwarding the samples, we check whether the gem request is completed
and dereference the corresponding request. The need to dereference the
request necessitates a worker here, which will be scheduled when the
hrtimer triggers.
While flushing the samples, we have to wait for the requests already
scheduled, before forwarding the samples. This wait is done in a lockless
fashion.

v2: Changes here pertaining to (as suggested by Chris):
    - Forwarding functionality implemented in a separate fn. The work item
      (scheduled from hrtimer/event stop) would be calling that function.
      The event flush would directly call this forwarding fn. This meets
      the flush semantics.
    - use spin_lock instead of spin_lock_irqsave
    - Code restructuring & better nomenclature

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |  11 +++
 drivers/gpu/drm/i915/i915_oa_perf.c | 131 ++++++++++++++++++++++++++++++++++--
 include/uapi/drm/i915_drm.h         |  10 +++
 3 files changed, 147 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 66f9ee9..08235582 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1676,6 +1676,13 @@ struct i915_oa_rcs_node {
 	u32 tag;
 };
 
+struct i915_gen_pmu_node {
+	struct list_head head;
+	struct drm_i915_gem_request *req;
+	u32 offset;
+	u32 ctx_id;
+};
+
 extern const struct i915_oa_reg i915_oa_3d_mux_config_hsw[];
 extern const int i915_oa_3d_mux_config_hsw_len;
 extern const struct i915_oa_reg i915_oa_3d_b_counter_config_hsw[];
@@ -2000,7 +2007,11 @@ struct drm_i915_private {
 			struct drm_i915_gem_object *obj;
 			u32 gtt_offset;
 			u8 *addr;
+			u32 node_size;
+			u32 node_count;
 		} buffer;
+		struct list_head node_list;
+		struct work_struct forward_work;
 	} gen_pmu;
 
 	void (*emit_profiling_data[I915_PROFILE_MAX])
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index 37ff0a9..3add862 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -11,6 +11,9 @@
 #define FREQUENCY 200
 #define PERIOD max_t(u64, 10000, NSEC_PER_SEC / FREQUENCY)
 
+#define TS_DATA_SIZE sizeof(struct drm_i915_ts_data)
+#define CTX_INFO_SIZE sizeof(struct drm_i915_ts_node_ctx_id)
+
 static u32 i915_oa_event_paranoid = true;
 
 static int hsw_perf_format_sizes[] = {
@@ -414,11 +417,113 @@ static void forward_oa_rcs_work_fn(struct work_struct *__work)
 	forward_oa_rcs_snapshots(dev_priv);
 }
 
+static int i915_gen_pmu_wait_gpu(struct drm_i915_private *dev_priv)
+{
+	struct i915_gen_pmu_node *last_entry = NULL;
+	int ret;
+
+	/*
+	 * Wait for the last scheduled request to complete. This would
+	 * implicitly wait for the prior submitted requests. The refcount
+	 * of the requests is not decremented here.
+	 */
+	spin_lock(&dev_priv->gen_pmu.lock);
+
+	if (!list_empty(&dev_priv->gen_pmu.node_list)) {
+		last_entry = list_last_entry(&dev_priv->gen_pmu.node_list,
+			struct i915_gen_pmu_node, head);
+	}
+	spin_unlock(&dev_priv->gen_pmu.lock);
+
+	if (!last_entry)
+		return 0;
+
+	ret = __i915_wait_request(last_entry->req, atomic_read(
+			&dev_priv->gpu_error.reset_counter),
+			true, NULL, NULL);
+	if (ret) {
+		DRM_ERROR("failed to wait\n");
+		return ret;
+	}
+	return 0;
+}
+
+static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
+				struct i915_gen_pmu_node *node)
+{
+	struct perf_sample_data data;
+	struct perf_event *event = dev_priv->gen_pmu.exclusive_event;
+	int snapshot_size;
+	u8 *snapshot;
+	struct drm_i915_ts_node_ctx_id *ctx_info;
+	struct perf_raw_record raw;
+
+	BUILD_BUG_ON(TS_DATA_SIZE != 8);
+	BUILD_BUG_ON(CTX_INFO_SIZE != 8);
+
+	snapshot = dev_priv->gen_pmu.buffer.addr + node->offset;
+	snapshot_size = TS_DATA_SIZE + CTX_INFO_SIZE;
+
+	ctx_info = (struct drm_i915_ts_node_ctx_id *)(snapshot + TS_DATA_SIZE);
+	ctx_info->ctx_id = node->ctx_id;
+
+	/* Note: the raw sample consists of a u32 size member and raw data. The
+	 * combined size of these two fields is required to be 8 byte aligned.
+	 * The size of raw data field is assumed to be 8 byte aligned already.
+	 * Therefore, adding 4 bytes to the raw sample size here.
+	 */
+	BUILD_BUG_ON(((snapshot_size + 4 + sizeof(raw.size)) % 8) != 0);
+
+	perf_sample_data_init(&data, 0, event->hw.last_period);
+	raw.size = snapshot_size + 4;
+	raw.data = snapshot;
+
+	data.raw = &raw;
+	perf_event_overflow(event, &data, &dev_priv->gen_pmu.dummy_regs);
+}
+
+/*
+ * Routine to forward the samples to perf. This may be called from the event
+ * flush and worker thread. This function may sleep, hence can't be called from
+ * atomic contexts directly.
+ */
 static void forward_gen_pmu_snapshots(struct drm_i915_private *dev_priv)
 {
-	WARN_ON(!dev_priv->gen_pmu.buffer.addr);
+	struct i915_gen_pmu_node *entry, *next;
+	LIST_HEAD(deferred_list_free);
+	int ret;
 
-	/* TODO: routine for forwarding snapshots to userspace */
+	list_for_each_entry_safe
+		(entry, next, &dev_priv->gen_pmu.node_list, head) {
+		if (!i915_gem_request_completed(entry->req, true))
+			break;
+
+		forward_one_gen_pmu_sample(dev_priv, entry);
+
+		spin_lock(&dev_priv->gen_pmu.lock);
+		list_move_tail(&entry->head, &deferred_list_free);
+		spin_unlock(&dev_priv->gen_pmu.lock);
+	}
+
+	ret = i915_mutex_lock_interruptible(dev_priv->dev);
+	if (ret)
+		return;
+	while (!list_empty(&deferred_list_free)) {
+		entry = list_first_entry(&deferred_list_free,
+					struct i915_gen_pmu_node, head);
+		i915_gem_request_unreference(entry->req);
+		list_del(&entry->head);
+		kfree(entry);
+	}
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+}
+
+static void forward_gen_pmu_work_fn(struct work_struct *__work)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(__work, typeof(*dev_priv), gen_pmu.forward_work);
+
+	forward_gen_pmu_snapshots(dev_priv);
 }
 
 static void
@@ -752,7 +857,7 @@ static int init_gen_pmu_buffer(struct perf_event *event)
 	struct drm_i915_private *dev_priv =
 		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
 	struct drm_i915_gem_object *bo;
-	int ret;
+	int ret, node_size;
 
 	BUG_ON(dev_priv->gen_pmu.buffer.obj);
 
@@ -764,6 +869,14 @@ static int init_gen_pmu_buffer(struct perf_event *event)
 	dev_priv->gen_pmu.buffer.gtt_offset =
 				i915_gem_obj_ggtt_offset(bo);
 	dev_priv->gen_pmu.buffer.addr = vmap_oa_buffer(bo);
+	INIT_LIST_HEAD(&dev_priv->gen_pmu.node_list);
+
+	node_size = TS_DATA_SIZE + CTX_INFO_SIZE;
+
+	/* size has to be aligned to 8 bytes */
+	node_size = ALIGN(node_size, 8);
+	dev_priv->gen_pmu.buffer.node_size = node_size;
+	dev_priv->gen_pmu.buffer.node_count = bo->base.size / node_size;
 
 	DRM_DEBUG_DRIVER("Gen PMU Buffer initialized, vaddr = %p",
 			 dev_priv->gen_pmu.buffer.addr);
@@ -776,7 +889,7 @@ static enum hrtimer_restart hrtimer_sample_gen(struct hrtimer *hrtimer)
 	struct drm_i915_private *i915 =
 		container_of(hrtimer, typeof(*i915), gen_pmu.timer);
 
-	forward_gen_pmu_snapshots(i915);
+	schedule_work(&i915->gen_pmu.forward_work);
 
 	hrtimer_forward_now(hrtimer, ns_to_ktime(PERIOD));
 	return HRTIMER_RESTART;
@@ -1353,7 +1466,7 @@ static void i915_gen_event_stop(struct perf_event *event, int flags)
 	spin_unlock(&dev_priv->gen_pmu.lock);
 
 	hrtimer_cancel(&dev_priv->gen_pmu.timer);
-	forward_gen_pmu_snapshots(dev_priv);
+	schedule_work(&dev_priv->gen_pmu.forward_work);
 
 	event->hw.state = PERF_HES_STOPPED;
 }
@@ -1384,6 +1497,11 @@ static int i915_gen_event_flush(struct perf_event *event)
 {
 	struct drm_i915_private *i915 =
 		container_of(event->pmu, typeof(*i915), gen_pmu.pmu);
+	int ret;
+
+	ret = i915_gen_pmu_wait_gpu(i915);
+	if (ret)
+		return ret;
 
 	forward_gen_pmu_snapshots(i915);
 	return 0;
@@ -1535,6 +1653,7 @@ void i915_gen_pmu_register(struct drm_device *dev)
 	hrtimer_init(&i915->gen_pmu.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	i915->gen_pmu.timer.function = hrtimer_sample_gen;
 
+	INIT_WORK(&i915->gen_pmu.forward_work, forward_gen_pmu_work_fn);
 	spin_lock_init(&i915->gen_pmu.lock);
 
 	i915->gen_pmu.pmu.capabilities  = PERF_PMU_CAP_IS_DEVICE;
@@ -1564,6 +1683,8 @@ void i915_gen_pmu_unregister(struct drm_device *dev)
 	if (i915->gen_pmu.pmu.event_init == NULL)
 		return;
 
+	cancel_work_sync(&i915->gen_pmu.forward_work);
+
 	perf_pmu_unregister(&i915->gen_pmu.pmu);
 	i915->gen_pmu.pmu.event_init = NULL;
 }
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index abe5826..4a19c9b 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -140,6 +140,16 @@ struct drm_i915_oa_node_tag {
 	__u32 pad;
 };
 
+struct drm_i915_ts_data {
+	__u32 ts_low;
+	__u32 ts_high;
+};
+
+struct drm_i915_ts_node_ctx_id {
+	__u32 ctx_id;
+	__u32 pad;
+};
+
 /* Each region is a minimum of 16k, and there are at most 255 of them.
  */
 #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 3/8] drm/i915: Handle event stop and destroy for GPU commands submitted
  2015-08-05  5:55 [RFC 0/8] Introduce framework for forwarding generic non-OA performance sourab.gupta
  2015-08-05  5:55 ` [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests sourab.gupta
  2015-08-05  5:55 ` [RFC 2/8] drm/i915: Add mechanism for forwarding the timestamp data through perf sourab.gupta
@ 2015-08-05  5:55 ` sourab.gupta
  2015-08-05  5:55 ` [RFC 4/8] drm/i915: Insert commands for capturing timestamps in the ring sourab.gupta
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: sourab.gupta @ 2015-08-05  5:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch handles the event stop and destroy callbacks taking into account
the fact that there may be commands scheduled on GPU which may utilize the
destination buffer.

The event stop would just set the event state, and stop forwarding data to
userspace. From userspace perspective, for all purposes, the event sampling
is stopped.A subsequent event start (without event destroy) would start
forwarding samples again.

The event destroy releases the local copy of the dest buffer. But since it
is expected that the active reference of buffer is taken while inserting
commands, we can rest assured that buffer is freed up only after GPU is
done with it.
Still there is a need to schedule a worker from event destroy, because we
need to do some further stuff like freeing up request references.

The ideal solution here would be to have a callback when the last request
is finished on GPU, so that we can do this stuff there (WIP:Chris'
retire-notification mechanism). Till the time, a worker thread will do.

A subsequent event init would have to wait for previously submitted RPC
commands to complete or return -EBUSY. Currently, for the sake of
simplicity, we are returning -EBUSY if such a case is detected.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |  2 +
 drivers/gpu/drm/i915/i915_oa_perf.c | 87 ++++++++++++++++++++++++++++++++-----
 2 files changed, 77 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 08235582..5717cb0 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1680,6 +1680,7 @@ struct i915_gen_pmu_node {
 	struct list_head head;
 	struct drm_i915_gem_request *req;
 	u32 offset;
+	bool discard;
 	u32 ctx_id;
 };
 
@@ -2012,6 +2013,7 @@ struct drm_i915_private {
 		} buffer;
 		struct list_head node_list;
 		struct work_struct forward_work;
+		struct work_struct event_destroy_work;
 	} gen_pmu;
 
 	void (*emit_profiling_data[I915_PROFILE_MAX])
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index 3add862..06645f0 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -448,6 +448,21 @@ static int i915_gen_pmu_wait_gpu(struct drm_i915_private *dev_priv)
 	return 0;
 }
 
+static void i915_gen_pmu_release_request_ref(struct drm_i915_private *dev_priv)
+{
+	struct i915_gen_pmu_node *entry, *next;
+
+	list_for_each_entry_safe
+		(entry, next, &dev_priv->gen_pmu.node_list, head) {
+		i915_gem_request_unreference__unlocked(entry->req);
+
+		spin_lock(&dev_priv->gen_pmu.lock);
+		list_del(&entry->head);
+		spin_unlock(&dev_priv->gen_pmu.lock);
+		kfree(entry);
+	}
+}
+
 static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
 				struct i915_gen_pmu_node *node)
 {
@@ -498,7 +513,8 @@ static void forward_gen_pmu_snapshots(struct drm_i915_private *dev_priv)
 		if (!i915_gem_request_completed(entry->req, true))
 			break;
 
-		forward_one_gen_pmu_sample(dev_priv, entry);
+		if (!entry->discard)
+			forward_one_gen_pmu_sample(dev_priv, entry);
 
 		spin_lock(&dev_priv->gen_pmu.lock);
 		list_move_tail(&entry->head, &deferred_list_free);
@@ -523,6 +539,13 @@ static void forward_gen_pmu_work_fn(struct work_struct *__work)
 	struct drm_i915_private *dev_priv =
 		container_of(__work, typeof(*dev_priv), gen_pmu.forward_work);
 
+	spin_lock(&dev_priv->gen_pmu.lock);
+	if (dev_priv->gen_pmu.event_active != true) {
+		spin_unlock(&dev_priv->gen_pmu.lock);
+		return;
+	}
+	spin_unlock(&dev_priv->gen_pmu.lock);
+
 	forward_gen_pmu_snapshots(dev_priv);
 }
 
@@ -670,12 +693,6 @@ static void gen_buffer_destroy(struct drm_i915_private *i915)
 	i915_gem_object_ggtt_unpin(i915->gen_pmu.buffer.obj);
 	drm_gem_object_unreference(&i915->gen_pmu.buffer.obj->base);
 	mutex_unlock(&i915->dev->struct_mutex);
-
-	spin_lock(&i915->gen_pmu.lock);
-	i915->gen_pmu.buffer.obj = NULL;
-	i915->gen_pmu.buffer.gtt_offset = 0;
-	i915->gen_pmu.buffer.addr = NULL;
-	spin_unlock(&i915->gen_pmu.lock);
 }
 
 static void i915_gen_event_destroy(struct perf_event *event)
@@ -685,10 +702,44 @@ static void i915_gen_event_destroy(struct perf_event *event)
 
 	WARN_ON(event->parent);
 
-	gen_buffer_destroy(i915);
+	cancel_work_sync(&i915->gen_pmu.forward_work);
 
 	BUG_ON(i915->gen_pmu.exclusive_event != event);
 	i915->gen_pmu.exclusive_event = NULL;
+
+	/* We can deference our local copy of dest buffer here, since
+	 * an active reference of buffer would be taken while
+	 * inserting commands. So the buffer would be freed up only
+	 * after GPU is done with it.
+	 */
+	gen_buffer_destroy(i915);
+
+	schedule_work(&i915->gen_pmu.event_destroy_work);
+}
+
+static void i915_gen_pmu_event_destroy_work(struct work_struct *__work)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(__work, typeof(*dev_priv),
+			gen_pmu.event_destroy_work);
+	int ret;
+
+	ret = i915_gen_pmu_wait_gpu(dev_priv);
+	if (ret)
+		goto out;
+
+	i915_gen_pmu_release_request_ref(dev_priv);
+
+out:
+	/*
+	 * Done here, as this excludes a new event till we've done processing
+	 * the old one
+	 */
+	spin_lock(&dev_priv->gen_pmu.lock);
+	dev_priv->gen_pmu.buffer.obj = NULL;
+	dev_priv->gen_pmu.buffer.gtt_offset = 0;
+	dev_priv->gen_pmu.buffer.addr = NULL;
+	spin_unlock(&dev_priv->gen_pmu.lock);
 }
 
 static int alloc_obj(struct drm_i915_private *dev_priv,
@@ -1411,6 +1462,7 @@ static int i915_gen_event_init(struct perf_event *event)
 {
 	struct drm_i915_private *dev_priv =
 		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
+	unsigned long lock_flags;
 	int ret = 0;
 
 	if (event->attr.type != event->pmu->type)
@@ -1419,8 +1471,12 @@ static int i915_gen_event_init(struct perf_event *event)
 	/* To avoid the complexity of having to accurately filter
 	 * data and marshal to the appropriate client
 	 * we currently only allow exclusive access */
-	if (dev_priv->gen_pmu.buffer.obj)
+	spin_lock_irqsave(&dev_priv->gen_pmu.lock, lock_flags);
+	if (dev_priv->gen_pmu.buffer.obj) {
+		spin_unlock_irqrestore(&dev_priv->gen_pmu.lock, lock_flags);
 		return -EBUSY;
+	}
+	spin_unlock_irqrestore(&dev_priv->gen_pmu.lock, lock_flags);
 
 	/*
 	 * We need to check for CAP_SYS_ADMIN capability as we profile all
@@ -1460,14 +1516,17 @@ static void i915_gen_event_stop(struct perf_event *event, int flags)
 {
 	struct drm_i915_private *dev_priv =
 		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
+	struct i915_gen_pmu_node *entry;
+
+	hrtimer_cancel(&dev_priv->gen_pmu.timer);
+	schedule_work(&dev_priv->gen_pmu.forward_work);
 
 	spin_lock(&dev_priv->gen_pmu.lock);
 	dev_priv->gen_pmu.event_active = false;
+	list_for_each_entry(entry, &dev_priv->gen_pmu.node_list, head)
+		entry->discard = true;
 	spin_unlock(&dev_priv->gen_pmu.lock);
 
-	hrtimer_cancel(&dev_priv->gen_pmu.timer);
-	schedule_work(&dev_priv->gen_pmu.forward_work);
-
 	event->hw.state = PERF_HES_STOPPED;
 }
 
@@ -1654,6 +1713,9 @@ void i915_gen_pmu_register(struct drm_device *dev)
 	i915->gen_pmu.timer.function = hrtimer_sample_gen;
 
 	INIT_WORK(&i915->gen_pmu.forward_work, forward_gen_pmu_work_fn);
+	INIT_WORK(&i915->gen_pmu.event_destroy_work,
+			i915_gen_pmu_event_destroy_work);
+
 	spin_lock_init(&i915->gen_pmu.lock);
 
 	i915->gen_pmu.pmu.capabilities  = PERF_PMU_CAP_IS_DEVICE;
@@ -1684,6 +1746,7 @@ void i915_gen_pmu_unregister(struct drm_device *dev)
 		return;
 
 	cancel_work_sync(&i915->gen_pmu.forward_work);
+	cancel_work_sync(&i915->gen_pmu.event_destroy_work);
 
 	perf_pmu_unregister(&i915->gen_pmu.pmu);
 	i915->gen_pmu.pmu.event_init = NULL;
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 4/8] drm/i915: Insert commands for capturing timestamps in the ring
  2015-08-05  5:55 [RFC 0/8] Introduce framework for forwarding generic non-OA performance sourab.gupta
                   ` (2 preceding siblings ...)
  2015-08-05  5:55 ` [RFC 3/8] drm/i915: Handle event stop and destroy for GPU commands submitted sourab.gupta
@ 2015-08-05  5:55 ` sourab.gupta
  2015-08-05  9:30   ` Chris Wilson
  2015-08-05  5:55 ` [RFC 5/8] drm/i915: Add support for forwarding ring id in sample metadata through perf sourab.gupta
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: sourab.gupta @ 2015-08-05  5:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds the routines through which one can insert commands in the
ringbuf for capturing timestamps, which are used to insert these commands
around the batchbuffer.

While inserting the commands, we keep a reference of associated request.
This will be released when we are forwarding the samples to userspace
(or when the event is being destroyed).
Also, an active reference of the destination buffer is taken here, so that
we can be assured that the buffer is freed up only after GPU is done with
it, even if the local reference of the buffer is released.

v2: Changes (as suggested by Chris):
    - Passing in 'request' struct for emit report function
    - Removed multiple calls to i915_gem_obj_to_ggtt(). Keeping hold of
      pinned vma from start and using when required.
    - Better nomenclature, and error handling.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |  2 +
 drivers/gpu/drm/i915/i915_oa_perf.c | 77 +++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h     |  2 +
 3 files changed, 81 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 5717cb0..46ece85 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1663,6 +1663,7 @@ enum i915_oa_event_state {
 
 enum i915_profile_mode {
 	I915_PROFILE_OA = 0,
+	I915_PROFILE_TS,
 	I915_PROFILE_MAX,
 };
 
@@ -2007,6 +2008,7 @@ struct drm_i915_private {
 		struct {
 			struct drm_i915_gem_object *obj;
 			u32 gtt_offset;
+			struct i915_vma *vma;
 			u8 *addr;
 			u32 node_size;
 			u32 node_count;
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index 06645f0..2cf7f1b 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -113,6 +113,79 @@ static void i915_oa_emit_perf_report(struct drm_i915_gem_request *req,
 	i915_vma_move_to_active(dev_priv->oa_pmu.oa_rcs_buffer.vma, ring);
 }
 
+/*
+ * Emits the commands to capture timestamps, into the CS
+ */
+static void i915_gen_emit_ts_data(struct drm_i915_gem_request *req,
+			u32 global_ctx_id, u32 tag)
+{
+	struct intel_engine_cs *ring = req->ring;
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct drm_i915_gem_object *obj = dev_priv->gen_pmu.buffer.obj;
+	struct i915_gen_pmu_node *entry;
+	u32 addr = 0;
+	int ret;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (entry == NULL) {
+		DRM_ERROR("alloc failed\n");
+		return;
+	}
+
+	ret = intel_ring_begin(ring, 6);
+	if (ret) {
+		kfree(entry);
+		return;
+	}
+
+	entry->ctx_id = global_ctx_id;
+	i915_gem_request_assign(&entry->req, ring->outstanding_lazy_request);
+
+	spin_lock(&dev_priv->gen_pmu.lock);
+	if (list_empty(&dev_priv->gen_pmu.node_list))
+		entry->offset = 0;
+	else {
+		struct i915_gen_pmu_node *last_entry;
+		int max_offset = dev_priv->gen_pmu.buffer.node_count *
+				dev_priv->gen_pmu.buffer.node_size;
+
+		last_entry = list_last_entry(&dev_priv->gen_pmu.node_list,
+					struct i915_gen_pmu_node, head);
+		entry->offset = last_entry->offset +
+				dev_priv->gen_pmu.buffer.node_size;
+
+		if (entry->offset > max_offset)
+			entry->offset = 0;
+	}
+	list_add_tail(&entry->head, &dev_priv->gen_pmu.node_list);
+	spin_unlock(&dev_priv->gen_pmu.lock);
+
+	addr = dev_priv->gen_pmu.buffer.gtt_offset + entry->offset;
+
+	if (ring->id == RCS) {
+		intel_ring_emit(ring, GFX_OP_PIPE_CONTROL(5));
+		intel_ring_emit(ring,
+				PIPE_CONTROL_GLOBAL_GTT_IVB |
+				PIPE_CONTROL_TIMESTAMP_WRITE);
+		intel_ring_emit(ring, addr | PIPE_CONTROL_GLOBAL_GTT);
+		intel_ring_emit(ring, 0);
+		intel_ring_emit(ring, 0);
+		intel_ring_emit(ring, MI_NOOP);
+	} else {
+		intel_ring_emit(ring,
+				(MI_FLUSH_DW + 1) | MI_FLUSH_DW_OP_STAMP);
+		intel_ring_emit(ring, addr | MI_FLUSH_DW_USE_GTT);
+		intel_ring_emit(ring, 0);
+		intel_ring_emit(ring, 0);
+		intel_ring_emit(ring, MI_NOOP);
+		intel_ring_emit(ring, MI_NOOP);
+	}
+	intel_ring_advance(ring);
+
+	obj->base.write_domain = I915_GEM_DOMAIN_RENDER;
+	i915_vma_move_to_active(dev_priv->gen_pmu.buffer.vma, ring);
+}
+
 static void forward_one_oa_snapshot_to_event(struct drm_i915_private *dev_priv,
 					     u8 *snapshot,
 					     struct perf_event *event)
@@ -738,6 +811,7 @@ out:
 	spin_lock(&dev_priv->gen_pmu.lock);
 	dev_priv->gen_pmu.buffer.obj = NULL;
 	dev_priv->gen_pmu.buffer.gtt_offset = 0;
+	dev_priv->gen_pmu.buffer.vma = NULL;
 	dev_priv->gen_pmu.buffer.addr = NULL;
 	spin_unlock(&dev_priv->gen_pmu.lock);
 }
@@ -919,6 +993,7 @@ static int init_gen_pmu_buffer(struct perf_event *event)
 	dev_priv->gen_pmu.buffer.obj = bo;
 	dev_priv->gen_pmu.buffer.gtt_offset =
 				i915_gem_obj_ggtt_offset(bo);
+	dev_priv->gen_pmu.buffer.vma = i915_gem_obj_to_ggtt(bo);
 	dev_priv->gen_pmu.buffer.addr = vmap_oa_buffer(bo);
 	INIT_LIST_HEAD(&dev_priv->gen_pmu.node_list);
 
@@ -1504,6 +1579,7 @@ static void i915_gen_event_start(struct perf_event *event, int flags)
 
 	spin_lock(&dev_priv->gen_pmu.lock);
 	dev_priv->gen_pmu.event_active = true;
+	dev_priv->emit_profiling_data[I915_PROFILE_TS] = i915_gen_emit_ts_data;
 	spin_unlock(&dev_priv->gen_pmu.lock);
 
 	__hrtimer_start_range_ns(&dev_priv->gen_pmu.timer, ns_to_ktime(PERIOD),
@@ -1523,6 +1599,7 @@ static void i915_gen_event_stop(struct perf_event *event, int flags)
 
 	spin_lock(&dev_priv->gen_pmu.lock);
 	dev_priv->gen_pmu.event_active = false;
+	dev_priv->emit_profiling_data[I915_PROFILE_TS] = NULL;
 	list_for_each_entry(entry, &dev_priv->gen_pmu.node_list, head)
 		entry->discard = true;
 	spin_unlock(&dev_priv->gen_pmu.lock);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index c9955968..f816b08 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -347,6 +347,7 @@
 #define   MI_FLUSH_DW_STORE_INDEX	(1<<21)
 #define   MI_INVALIDATE_TLB		(1<<18)
 #define   MI_FLUSH_DW_OP_STOREDW	(1<<14)
+#define   MI_FLUSH_DW_OP_STAMP		(3<<14)
 #define   MI_FLUSH_DW_OP_MASK		(3<<14)
 #define   MI_FLUSH_DW_NOTIFY		(1<<8)
 #define   MI_INVALIDATE_BSD		(1<<7)
@@ -423,6 +424,7 @@
 #define   PIPE_CONTROL_MEDIA_STATE_CLEAR		(1<<16)
 #define   PIPE_CONTROL_QW_WRITE				(1<<14)
 #define   PIPE_CONTROL_POST_SYNC_OP_MASK                (3<<14)
+#define   PIPE_CONTROL_TIMESTAMP_WRITE			(3<<14)
 #define   PIPE_CONTROL_DEPTH_STALL			(1<<13)
 #define   PIPE_CONTROL_WRITE_FLUSH			(1<<12)
 #define   PIPE_CONTROL_RENDER_TARGET_CACHE_FLUSH	(1<<12) /* gen6+ */
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 5/8] drm/i915: Add support for forwarding ring id in sample metadata through perf
  2015-08-05  5:55 [RFC 0/8] Introduce framework for forwarding generic non-OA performance sourab.gupta
                   ` (3 preceding siblings ...)
  2015-08-05  5:55 ` [RFC 4/8] drm/i915: Insert commands for capturing timestamps in the ring sourab.gupta
@ 2015-08-05  5:55 ` sourab.gupta
  2015-08-05  9:26   ` Chris Wilson
  2015-08-05  5:55 ` [RFC 6/8] drm/i915: Add support for forwarding pid in timestamp " sourab.gupta
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: sourab.gupta @ 2015-08-05  5:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch introduces flags and adds support for having ring id output with
the timestamp samples and forwarding them through perf.

When the userspace expresses its interest in listening to the ring id
through a gen pmu attr field during event init, the samples generated would
have an additional field appended with the ring id information. This patch
enables this framework, which can be expanded upon to introduce further
fields in the gen pmu attr through which additional metadata information
can be appended to samples.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |  3 ++
 drivers/gpu/drm/i915/i915_oa_perf.c | 98 +++++++++++++++++++++++++++++++++++--
 include/uapi/drm/i915_drm.h         | 13 +++++
 3 files changed, 111 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 46ece85..70f1bd6 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1683,6 +1683,7 @@ struct i915_gen_pmu_node {
 	u32 offset;
 	bool discard;
 	u32 ctx_id;
+	u32 ring;
 };
 
 extern const struct i915_oa_reg i915_oa_3d_mux_config_hsw[];
@@ -2016,6 +2017,8 @@ struct drm_i915_private {
 		struct list_head node_list;
 		struct work_struct forward_work;
 		struct work_struct event_destroy_work;
+#define I915_GEN_PMU_SAMPLE_RING		(1<<0)
+		int sample_info_flags;
 	} gen_pmu;
 
 	void (*emit_profiling_data[I915_PROFILE_MAX])
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index 2cf7f1b..41e2407 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -13,6 +13,7 @@
 
 #define TS_DATA_SIZE sizeof(struct drm_i915_ts_data)
 #define CTX_INFO_SIZE sizeof(struct drm_i915_ts_node_ctx_id)
+#define RING_INFO_SIZE sizeof(struct drm_i915_ts_node_ring_id)
 
 static u32 i915_oa_event_paranoid = true;
 
@@ -113,6 +114,9 @@ static void i915_oa_emit_perf_report(struct drm_i915_gem_request *req,
 	i915_vma_move_to_active(dev_priv->oa_pmu.oa_rcs_buffer.vma, ring);
 }
 
+/* Returns the ring's ID mask (i.e. I915_EXEC_<ring>) */
+#define ring_id_mask(ring) ((ring)->id + 1)
+
 /*
  * Emits the commands to capture timestamps, into the CS
  */
@@ -139,6 +143,8 @@ static void i915_gen_emit_ts_data(struct drm_i915_gem_request *req,
 	}
 
 	entry->ctx_id = global_ctx_id;
+	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_RING)
+		entry->ring = ring_id_mask(ring);
 	i915_gem_request_assign(&entry->req, ring->outstanding_lazy_request);
 
 	spin_lock(&dev_priv->gen_pmu.lock);
@@ -542,18 +548,27 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
 	struct perf_sample_data data;
 	struct perf_event *event = dev_priv->gen_pmu.exclusive_event;
 	int snapshot_size;
-	u8 *snapshot;
+	u8 *snapshot, *current_ptr;
 	struct drm_i915_ts_node_ctx_id *ctx_info;
+	struct drm_i915_ts_node_ring_id *ring_info;
 	struct perf_raw_record raw;
 
-	BUILD_BUG_ON(TS_DATA_SIZE != 8);
-	BUILD_BUG_ON(CTX_INFO_SIZE != 8);
+	BUILD_BUG_ON((TS_DATA_SIZE != 8) || (CTX_INFO_SIZE != 8) ||
+			(RING_INFO_SIZE != 8));
 
 	snapshot = dev_priv->gen_pmu.buffer.addr + node->offset;
 	snapshot_size = TS_DATA_SIZE + CTX_INFO_SIZE;
 
 	ctx_info = (struct drm_i915_ts_node_ctx_id *)(snapshot + TS_DATA_SIZE);
 	ctx_info->ctx_id = node->ctx_id;
+	current_ptr = snapshot + snapshot_size;
+
+	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_RING) {
+		ring_info = (struct drm_i915_ts_node_ring_id *)current_ptr;
+		ring_info->ring = node->ring;
+		snapshot_size += RING_INFO_SIZE;
+		current_ptr = snapshot + snapshot_size;
+	}
 
 	/* Note: the raw sample consists of a u32 size member and raw data. The
 	 * combined size of these two fields is required to be 8 byte aligned.
@@ -999,6 +1014,9 @@ static int init_gen_pmu_buffer(struct perf_event *event)
 
 	node_size = TS_DATA_SIZE + CTX_INFO_SIZE;
 
+	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_RING)
+		node_size += RING_INFO_SIZE;
+
 	/* size has to be aligned to 8 bytes */
 	node_size = ALIGN(node_size, 8);
 	dev_priv->gen_pmu.buffer.node_size = node_size;
@@ -1533,16 +1551,90 @@ static int i915_oa_event_event_idx(struct perf_event *event)
 	return 0;
 }
 
+static int i915_gen_pmu_copy_attr(struct drm_i915_gen_pmu_attr __user *uattr,
+			     struct drm_i915_gen_pmu_attr *attr)
+{
+	u32 size;
+	int ret;
+
+	if (!access_ok(VERIFY_WRITE, uattr, I915_GEN_PMU_ATTR_SIZE_VER0))
+		return -EFAULT;
+
+	/*
+	 * zero the full structure, so that a short copy will be nice.
+	 */
+	memset(attr, 0, sizeof(*attr));
+
+	ret = get_user(size, &uattr->size);
+	if (ret)
+		return ret;
+
+	if (size > PAGE_SIZE)	/* silly large */
+		goto err_size;
+
+	if (size < I915_GEN_PMU_ATTR_SIZE_VER0)
+		goto err_size;
+
+	/*
+	 * If we're handed a bigger struct than we know of,
+	 * ensure all the unknown bits are 0 - i.e. new
+	 * user-space does not rely on any kernel feature
+	 * extensions we dont know about yet.
+	 */
+	if (size > sizeof(*attr)) {
+		unsigned char __user *addr;
+		unsigned char __user *end;
+		unsigned char val;
+
+		addr = (void __user *)uattr + sizeof(*attr);
+		end  = (void __user *)uattr + size;
+
+		for (; addr < end; addr++) {
+			ret = get_user(val, addr);
+			if (ret)
+				return ret;
+			if (val)
+				goto err_size;
+		}
+		size = sizeof(*attr);
+	}
+
+	ret = copy_from_user(attr, uattr, size);
+	if (ret)
+		return -EFAULT;
+
+out:
+	return ret;
+
+err_size:
+	put_user(sizeof(*attr), &uattr->size);
+	ret = -E2BIG;
+	goto out;
+}
+
 static int i915_gen_event_init(struct perf_event *event)
 {
 	struct drm_i915_private *dev_priv =
 		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
+	struct drm_i915_gen_pmu_attr gen_attr;
 	unsigned long lock_flags;
 	int ret = 0;
 
 	if (event->attr.type != event->pmu->type)
 		return -ENOENT;
 
+	BUILD_BUG_ON(sizeof(struct drm_i915_gen_pmu_attr) !=
+			I915_GEN_PMU_ATTR_SIZE_VER0);
+
+	ret = i915_gen_pmu_copy_attr(to_user_ptr(event->attr.config),
+				&gen_attr);
+	if (ret)
+		return ret;
+
+	if (gen_attr.sample_ring)
+		dev_priv->gen_pmu.sample_info_flags |=
+				I915_GEN_PMU_SAMPLE_RING;
+
 	/* To avoid the complexity of having to accurately filter
 	 * data and marshal to the appropriate client
 	 * we currently only allow exclusive access */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 4a19c9b..5b484fb 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -81,6 +81,8 @@
 
 #define I915_OA_ATTR_SIZE_VER0		32  /* sizeof first published struct */
 
+#define I915_GEN_PMU_ATTR_SIZE_VER0	8  /* sizeof first published struct */
+
 typedef struct _drm_i915_oa_attr {
 	__u32 size;
 
@@ -98,6 +100,12 @@ typedef struct _drm_i915_oa_attr {
 		__reserved_1:60;
 } drm_i915_oa_attr_t;
 
+struct drm_i915_gen_pmu_attr {
+	__u32 size;
+	__u32 sample_ring:1,
+		__reserved_1:31;
+};
+
 /* Header for PERF_RECORD_DEVICE type events */
 typedef struct _drm_i915_oa_event_header {
 	__u32 type;
@@ -150,6 +158,11 @@ struct drm_i915_ts_node_ctx_id {
 	__u32 pad;
 };
 
+struct drm_i915_ts_node_ring_id {
+	__u32 ring;
+	__u32 pad;
+};
+
 /* Each region is a minimum of 16k, and there are at most 255 of them.
  */
 #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 6/8] drm/i915: Add support for forwarding pid in timestamp sample metadata through perf
  2015-08-05  5:55 [RFC 0/8] Introduce framework for forwarding generic non-OA performance sourab.gupta
                   ` (4 preceding siblings ...)
  2015-08-05  5:55 ` [RFC 5/8] drm/i915: Add support for forwarding ring id in sample metadata through perf sourab.gupta
@ 2015-08-05  5:55 ` sourab.gupta
  2015-08-05  5:55 ` [RFC 7/8] drm/i915: Add support for forwarding execbuffer tags in timestamp sample metadata sourab.gupta
  2015-08-05  5:55 ` [RFC 8/8] drm/i915: Support for retrieving MMIO register values alongwith timestamps through perf sourab.gupta
  7 siblings, 0 replies; 31+ messages in thread
From: sourab.gupta @ 2015-08-05  5:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch introduces flags and adds support for having pid output with the
timestamp samples and forwarding them through perf.

When the userspace expresses its interest in listening to the pid through a
gen pmu attr field during event init, the samples generated would have an
additional field appended with the pid information.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |  2 ++
 drivers/gpu/drm/i915/i915_oa_perf.c | 19 ++++++++++++++++++-
 include/uapi/drm/i915_drm.h         |  8 +++++++-
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 70f1bd6..f46687a 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1684,6 +1684,7 @@ struct i915_gen_pmu_node {
 	bool discard;
 	u32 ctx_id;
 	u32 ring;
+	u32 pid;
 };
 
 extern const struct i915_oa_reg i915_oa_3d_mux_config_hsw[];
@@ -2018,6 +2019,7 @@ struct drm_i915_private {
 		struct work_struct forward_work;
 		struct work_struct event_destroy_work;
 #define I915_GEN_PMU_SAMPLE_RING		(1<<0)
+#define I915_GEN_PMU_SAMPLE_PID			(1<<1)
 		int sample_info_flags;
 	} gen_pmu;
 
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index 41e2407..f73d23c 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -14,6 +14,7 @@
 #define TS_DATA_SIZE sizeof(struct drm_i915_ts_data)
 #define CTX_INFO_SIZE sizeof(struct drm_i915_ts_node_ctx_id)
 #define RING_INFO_SIZE sizeof(struct drm_i915_ts_node_ring_id)
+#define PID_INFO_SIZE sizeof(struct drm_i915_ts_node_pid)
 
 static u32 i915_oa_event_paranoid = true;
 
@@ -145,6 +146,8 @@ static void i915_gen_emit_ts_data(struct drm_i915_gem_request *req,
 	entry->ctx_id = global_ctx_id;
 	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_RING)
 		entry->ring = ring_id_mask(ring);
+	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_PID)
+		entry->pid = current->pid;
 	i915_gem_request_assign(&entry->req, ring->outstanding_lazy_request);
 
 	spin_lock(&dev_priv->gen_pmu.lock);
@@ -551,10 +554,11 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
 	u8 *snapshot, *current_ptr;
 	struct drm_i915_ts_node_ctx_id *ctx_info;
 	struct drm_i915_ts_node_ring_id *ring_info;
+	struct drm_i915_ts_node_pid *pid_info;
 	struct perf_raw_record raw;
 
 	BUILD_BUG_ON((TS_DATA_SIZE != 8) || (CTX_INFO_SIZE != 8) ||
-			(RING_INFO_SIZE != 8));
+			(RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8));
 
 	snapshot = dev_priv->gen_pmu.buffer.addr + node->offset;
 	snapshot_size = TS_DATA_SIZE + CTX_INFO_SIZE;
@@ -570,6 +574,13 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
 		current_ptr = snapshot + snapshot_size;
 	}
 
+	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_PID) {
+		pid_info = (struct drm_i915_ts_node_pid *)current_ptr;
+		pid_info->pid = node->pid;
+		snapshot_size += PID_INFO_SIZE;
+		current_ptr = snapshot + snapshot_size;
+	}
+
 	/* Note: the raw sample consists of a u32 size member and raw data. The
 	 * combined size of these two fields is required to be 8 byte aligned.
 	 * The size of raw data field is assumed to be 8 byte aligned already.
@@ -1017,6 +1028,9 @@ static int init_gen_pmu_buffer(struct perf_event *event)
 	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_RING)
 		node_size += RING_INFO_SIZE;
 
+	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_PID)
+		node_size += PID_INFO_SIZE;
+
 	/* size has to be aligned to 8 bytes */
 	node_size = ALIGN(node_size, 8);
 	dev_priv->gen_pmu.buffer.node_size = node_size;
@@ -1635,6 +1649,9 @@ static int i915_gen_event_init(struct perf_event *event)
 		dev_priv->gen_pmu.sample_info_flags |=
 				I915_GEN_PMU_SAMPLE_RING;
 
+	if (gen_attr.sample_pid)
+		dev_priv->gen_pmu.sample_info_flags |= I915_GEN_PMU_SAMPLE_PID;
+
 	/* To avoid the complexity of having to accurately filter
 	 * data and marshal to the appropriate client
 	 * we currently only allow exclusive access */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 5b484fb..3dcc862 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -103,7 +103,8 @@ typedef struct _drm_i915_oa_attr {
 struct drm_i915_gen_pmu_attr {
 	__u32 size;
 	__u32 sample_ring:1,
-		__reserved_1:31;
+		sample_pid:1,
+		__reserved_1:30;
 };
 
 /* Header for PERF_RECORD_DEVICE type events */
@@ -163,6 +164,11 @@ struct drm_i915_ts_node_ring_id {
 	__u32 pad;
 };
 
+struct drm_i915_ts_node_pid {
+	__u32 pid;
+	__u32 pad;
+};
+
 /* Each region is a minimum of 16k, and there are at most 255 of them.
  */
 #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 7/8] drm/i915: Add support for forwarding execbuffer tags in timestamp sample metadata
  2015-08-05  5:55 [RFC 0/8] Introduce framework for forwarding generic non-OA performance sourab.gupta
                   ` (5 preceding siblings ...)
  2015-08-05  5:55 ` [RFC 6/8] drm/i915: Add support for forwarding pid in timestamp " sourab.gupta
@ 2015-08-05  5:55 ` sourab.gupta
  2015-08-05  9:17   ` Chris Wilson
  2015-08-05  5:55 ` [RFC 8/8] drm/i915: Support for retrieving MMIO register values alongwith timestamps through perf sourab.gupta
  7 siblings, 1 reply; 31+ messages in thread
From: sourab.gupta @ 2015-08-05  5:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch enables userspace to specify tags (per workload), provided via
execbuffer ioctl, which could be added to timestamps samples, to help
associate samples with the corresponding workloads.

There may be multiple stages within a single context, from a userspace
perspective. An ability is needed to individually associate the samples
with their corresponding workloads(execbuffers), which may not be possible
solely with ctx_id or pid information.
This patch enables this mechanism.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |  2 ++
 drivers/gpu/drm/i915/i915_oa_perf.c | 20 +++++++++++++++++++-
 include/uapi/drm/i915_drm.h         |  8 +++++++-
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f46687a..c3e823f 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1685,6 +1685,7 @@ struct i915_gen_pmu_node {
 	u32 ctx_id;
 	u32 ring;
 	u32 pid;
+	u32 tag;
 };
 
 extern const struct i915_oa_reg i915_oa_3d_mux_config_hsw[];
@@ -2020,6 +2021,7 @@ struct drm_i915_private {
 		struct work_struct event_destroy_work;
 #define I915_GEN_PMU_SAMPLE_RING		(1<<0)
 #define I915_GEN_PMU_SAMPLE_PID			(1<<1)
+#define I915_GEN_PMU_SAMPLE_TAG			(1<<2)
 		int sample_info_flags;
 	} gen_pmu;
 
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index f73d23c..e065e06 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -15,6 +15,7 @@
 #define CTX_INFO_SIZE sizeof(struct drm_i915_ts_node_ctx_id)
 #define RING_INFO_SIZE sizeof(struct drm_i915_ts_node_ring_id)
 #define PID_INFO_SIZE sizeof(struct drm_i915_ts_node_pid)
+#define TAG_INFO_SIZE sizeof(struct drm_i915_ts_node_tag)
 
 static u32 i915_oa_event_paranoid = true;
 
@@ -148,6 +149,8 @@ static void i915_gen_emit_ts_data(struct drm_i915_gem_request *req,
 		entry->ring = ring_id_mask(ring);
 	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_PID)
 		entry->pid = current->pid;
+	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_TAG)
+		entry->tag = tag;
 	i915_gem_request_assign(&entry->req, ring->outstanding_lazy_request);
 
 	spin_lock(&dev_priv->gen_pmu.lock);
@@ -555,10 +558,12 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
 	struct drm_i915_ts_node_ctx_id *ctx_info;
 	struct drm_i915_ts_node_ring_id *ring_info;
 	struct drm_i915_ts_node_pid *pid_info;
+	struct drm_i915_ts_node_tag *tag_info;
 	struct perf_raw_record raw;
 
 	BUILD_BUG_ON((TS_DATA_SIZE != 8) || (CTX_INFO_SIZE != 8) ||
-			(RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8));
+			(RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8) ||
+			(TAG_INFO_SIZE != 8));
 
 	snapshot = dev_priv->gen_pmu.buffer.addr + node->offset;
 	snapshot_size = TS_DATA_SIZE + CTX_INFO_SIZE;
@@ -581,6 +586,13 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
 		current_ptr = snapshot + snapshot_size;
 	}
 
+	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_TAG) {
+		tag_info = (struct drm_i915_ts_node_tag *)current_ptr;
+		tag_info->tag = node->tag;
+		snapshot_size += TAG_INFO_SIZE;
+		current_ptr = snapshot + snapshot_size;
+	}
+
 	/* Note: the raw sample consists of a u32 size member and raw data. The
 	 * combined size of these two fields is required to be 8 byte aligned.
 	 * The size of raw data field is assumed to be 8 byte aligned already.
@@ -1031,6 +1043,9 @@ static int init_gen_pmu_buffer(struct perf_event *event)
 	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_PID)
 		node_size += PID_INFO_SIZE;
 
+	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_TAG)
+		node_size += TAG_INFO_SIZE;
+
 	/* size has to be aligned to 8 bytes */
 	node_size = ALIGN(node_size, 8);
 	dev_priv->gen_pmu.buffer.node_size = node_size;
@@ -1652,6 +1667,9 @@ static int i915_gen_event_init(struct perf_event *event)
 	if (gen_attr.sample_pid)
 		dev_priv->gen_pmu.sample_info_flags |= I915_GEN_PMU_SAMPLE_PID;
 
+	if (gen_attr.sample_tag)
+		dev_priv->gen_pmu.sample_info_flags |= I915_GEN_PMU_SAMPLE_TAG;
+
 	/* To avoid the complexity of having to accurately filter
 	 * data and marshal to the appropriate client
 	 * we currently only allow exclusive access */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 3dcc862..db91098 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -104,7 +104,8 @@ struct drm_i915_gen_pmu_attr {
 	__u32 size;
 	__u32 sample_ring:1,
 		sample_pid:1,
-		__reserved_1:30;
+		sample_tag:1,
+		__reserved_1:29;
 };
 
 /* Header for PERF_RECORD_DEVICE type events */
@@ -169,6 +170,11 @@ struct drm_i915_ts_node_pid {
 	__u32 pad;
 };
 
+struct drm_i915_ts_node_tag {
+	__u32 tag;
+	__u32 pad;
+};
+
 /* Each region is a minimum of 16k, and there are at most 255 of them.
  */
 #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 8/8] drm/i915: Support for retrieving MMIO register values alongwith timestamps through perf
  2015-08-05  5:55 [RFC 0/8] Introduce framework for forwarding generic non-OA performance sourab.gupta
                   ` (6 preceding siblings ...)
  2015-08-05  5:55 ` [RFC 7/8] drm/i915: Add support for forwarding execbuffer tags in timestamp sample metadata sourab.gupta
@ 2015-08-05  5:55 ` sourab.gupta
  2015-08-05 10:03   ` Chris Wilson
  2015-08-05 20:19   ` Robert Bragg
  7 siblings, 2 replies; 31+ messages in thread
From: sourab.gupta @ 2015-08-05  5:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds support for retrieving MMIO register values alongwith
timestamps and forwarding them to userspace through perf.
The userspace can request upto 8 MMIO register values to be dumped.
The addresses of upto 8 MMIO registers can be passed through perf attr
config. The registers are checked against a whitelist before passing them
on. The commands to dump the values of these MMIO registers are then
inserted into the ring alongwith commands to dump the timestamps.

v2: Implement suggestions by Chris, pertaining to code restructuring, using
BUILD_BUG_ON etc.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |  2 +
 drivers/gpu/drm/i915/i915_oa_perf.c | 99 ++++++++++++++++++++++++++++++++++---
 include/uapi/drm/i915_drm.h         | 11 ++++-
 3 files changed, 104 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index c3e823f..5c6e37a 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2022,7 +2022,9 @@ struct drm_i915_private {
 #define I915_GEN_PMU_SAMPLE_RING		(1<<0)
 #define I915_GEN_PMU_SAMPLE_PID			(1<<1)
 #define I915_GEN_PMU_SAMPLE_TAG			(1<<2)
+#define I915_GEN_PMU_SAMPLE_MMIO		(1<<3)
 		int sample_info_flags;
+		u32 mmio_list[I915_PMU_MMIO_NUM];
 	} gen_pmu;
 
 	void (*emit_profiling_data[I915_PROFILE_MAX])
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index e065e06..4197dbd 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -12,6 +12,7 @@
 #define PERIOD max_t(u64, 10000, NSEC_PER_SEC / FREQUENCY)
 
 #define TS_DATA_SIZE sizeof(struct drm_i915_ts_data)
+#define MMIO_DATA_SIZE sizeof(struct drm_i915_mmio_data)
 #define CTX_INFO_SIZE sizeof(struct drm_i915_ts_node_ctx_id)
 #define RING_INFO_SIZE sizeof(struct drm_i915_ts_node_ring_id)
 #define PID_INFO_SIZE sizeof(struct drm_i915_ts_node_pid)
@@ -129,8 +130,8 @@ static void i915_gen_emit_ts_data(struct drm_i915_gem_request *req,
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
 	struct drm_i915_gem_object *obj = dev_priv->gen_pmu.buffer.obj;
 	struct i915_gen_pmu_node *entry;
-	u32 addr = 0;
-	int ret;
+	u32 mmio_addr, addr = 0;
+	int ret, i, count = 0;
 
 	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
 	if (entry == NULL) {
@@ -138,7 +139,12 @@ static void i915_gen_emit_ts_data(struct drm_i915_gem_request *req,
 		return;
 	}
 
-	ret = intel_ring_begin(ring, 6);
+	for (count = 0; count < I915_PMU_MMIO_NUM; count++) {
+		if (0 == dev_priv->gen_pmu.mmio_list[count])
+			break;
+	}
+
+	ret = intel_ring_begin(ring, 6 + 4*count);
 	if (ret) {
 		kfree(entry);
 		return;
@@ -173,6 +179,7 @@ static void i915_gen_emit_ts_data(struct drm_i915_gem_request *req,
 	spin_unlock(&dev_priv->gen_pmu.lock);
 
 	addr = dev_priv->gen_pmu.buffer.gtt_offset + entry->offset;
+	mmio_addr = addr + TS_DATA_SIZE;
 
 	if (ring->id == RCS) {
 		intel_ring_emit(ring, GFX_OP_PIPE_CONTROL(5));
@@ -192,6 +199,32 @@ static void i915_gen_emit_ts_data(struct drm_i915_gem_request *req,
 		intel_ring_emit(ring, MI_NOOP);
 		intel_ring_emit(ring, MI_NOOP);
 	}
+
+	/*
+	 * Note:
+	 * 1) The optimization to store the register array with a single
+	 * command doesn't seem to be working with SRM commands. Hence, have a
+	 * loop with a single SRM command repeated. Missing anything here?
+	 * 2) This fn is presently called before and after batch buffer. As
+	 * such, there should already be the CS stall commands called after BB.
+	 * Is there a need/necessity for a command barrier to be inserted in
+	 * ring here? If so, which commands? (CS Stall?)
+	 */
+	for (i = 0; i < I915_PMU_MMIO_NUM; i++) {
+		if (0 == dev_priv->gen_pmu.mmio_list[i])
+			break;
+
+		addr = mmio_addr +
+			i * sizeof(dev_priv->gen_pmu.mmio_list[i]);
+
+		intel_ring_emit(ring,
+				MI_STORE_REGISTER_MEM(1) |
+				MI_SRM_LRM_GLOBAL_GTT);
+		intel_ring_emit(ring, dev_priv->gen_pmu.mmio_list[i]);
+		intel_ring_emit(ring, addr);
+		intel_ring_emit(ring, MI_NOOP);
+	}
+
 	intel_ring_advance(ring);
 
 	obj->base.write_domain = I915_GEM_DOMAIN_RENDER;
@@ -553,7 +586,7 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
 {
 	struct perf_sample_data data;
 	struct perf_event *event = dev_priv->gen_pmu.exclusive_event;
-	int snapshot_size;
+	int snapshot_size, mmio_size;
 	u8 *snapshot, *current_ptr;
 	struct drm_i915_ts_node_ctx_id *ctx_info;
 	struct drm_i915_ts_node_ring_id *ring_info;
@@ -565,10 +598,16 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
 			(RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8) ||
 			(TAG_INFO_SIZE != 8));
 
+	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_MMIO)
+		mmio_size = MMIO_DATA_SIZE;
+	else
+		mmio_size = 0;
+
 	snapshot = dev_priv->gen_pmu.buffer.addr + node->offset;
-	snapshot_size = TS_DATA_SIZE + CTX_INFO_SIZE;
+	snapshot_size = TS_DATA_SIZE + mmio_size + CTX_INFO_SIZE;
 
-	ctx_info = (struct drm_i915_ts_node_ctx_id *)(snapshot + TS_DATA_SIZE);
+	ctx_info = (struct drm_i915_ts_node_ctx_id *)
+				(snapshot + mmio_size +	TS_DATA_SIZE);
 	ctx_info->ctx_id = node->ctx_id;
 	current_ptr = snapshot + snapshot_size;
 
@@ -1046,6 +1085,9 @@ static int init_gen_pmu_buffer(struct perf_event *event)
 	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_TAG)
 		node_size += TAG_INFO_SIZE;
 
+	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_MMIO)
+		node_size += MMIO_DATA_SIZE;
+
 	/* size has to be aligned to 8 bytes */
 	node_size = ALIGN(node_size, 8);
 	dev_priv->gen_pmu.buffer.node_size = node_size;
@@ -1641,6 +1683,40 @@ err_size:
 	goto out;
 }
 
+
+static int check_mmio_whitelist(struct drm_i915_private *dev_priv,
+				struct drm_i915_gen_pmu_attr *gen_attr)
+{
+
+#define GEN_RANGE(l, h) GENMASK(h, l)
+	static const struct register_whitelist {
+		uint64_t offset;
+		uint32_t size;
+		/* supported gens, 0x10 for 4, 0x30 for 4 and 5, etc. */
+		uint32_t gen_bitmask;
+	} whitelist[] = {
+		{ GEN6_GT_GFX_RC6, 4, GEN_RANGE(7, 9) },
+		{ GEN6_GT_GFX_RC6p, 4, GEN_RANGE(7, 9) },
+	};
+	int i, count;
+
+	for (count = 0; count < I915_PMU_MMIO_NUM; count++) {
+		if (!gen_attr->mmio_list[count])
+			break;
+
+		for (i = 0; i < ARRAY_SIZE(whitelist); i++) {
+			if (whitelist[i].offset == gen_attr->mmio_list[count] &&
+			    (1 << INTEL_INFO(dev_priv)->gen &
+					whitelist[i].gen_bitmask))
+				break;
+		}
+
+		if (i == ARRAY_SIZE(whitelist))
+			return -EINVAL;
+	}
+	return 0;
+}
+
 static int i915_gen_event_init(struct perf_event *event)
 {
 	struct drm_i915_private *dev_priv =
@@ -1670,6 +1746,17 @@ static int i915_gen_event_init(struct perf_event *event)
 	if (gen_attr.sample_tag)
 		dev_priv->gen_pmu.sample_info_flags |= I915_GEN_PMU_SAMPLE_TAG;
 
+	if (gen_attr.sample_mmio) {
+		ret = check_mmio_whitelist(dev_priv, &gen_attr);
+		if (ret)
+			return ret;
+
+		dev_priv->gen_pmu.sample_info_flags |=
+				I915_GEN_PMU_SAMPLE_MMIO;
+		memcpy(dev_priv->gen_pmu.mmio_list, gen_attr.mmio_list,
+				sizeof(dev_priv->gen_pmu.mmio_list));
+	}
+
 	/* To avoid the complexity of having to accurately filter
 	 * data and marshal to the appropriate client
 	 * we currently only allow exclusive access */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index db91098..4153cdf 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -81,7 +81,7 @@
 
 #define I915_OA_ATTR_SIZE_VER0		32  /* sizeof first published struct */
 
-#define I915_GEN_PMU_ATTR_SIZE_VER0	8  /* sizeof first published struct */
+#define I915_GEN_PMU_ATTR_SIZE_VER0	40  /* sizeof first published struct */
 
 typedef struct _drm_i915_oa_attr {
 	__u32 size;
@@ -105,7 +105,9 @@ struct drm_i915_gen_pmu_attr {
 	__u32 sample_ring:1,
 		sample_pid:1,
 		sample_tag:1,
-		__reserved_1:29;
+		sample_mmio:1,
+		__reserved_1:28;
+	__u32 mmio_list[8];
 };
 
 /* Header for PERF_RECORD_DEVICE type events */
@@ -155,6 +157,11 @@ struct drm_i915_ts_data {
 	__u32 ts_high;
 };
 
+struct drm_i915_mmio_data {
+#define I915_PMU_MMIO_NUM	8
+	__u32 mmio[I915_PMU_MMIO_NUM];
+};
+
 struct drm_i915_ts_node_ctx_id {
 	__u32 ctx_id;
 	__u32 pad;
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC 7/8] drm/i915: Add support for forwarding execbuffer tags in timestamp sample metadata
  2015-08-05  5:55 ` [RFC 7/8] drm/i915: Add support for forwarding execbuffer tags in timestamp sample metadata sourab.gupta
@ 2015-08-05  9:17   ` Chris Wilson
  2015-08-05  9:29     ` Daniel Vetter
  0 siblings, 1 reply; 31+ messages in thread
From: Chris Wilson @ 2015-08-05  9:17 UTC (permalink / raw)
  To: sourab.gupta; +Cc: Peter Zijlstra, intel-gfx, Jabin Wu, Insoo Woo

On Wed, Aug 05, 2015 at 11:25:43AM +0530, sourab.gupta@intel.com wrote:
> @@ -555,10 +558,12 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
>  	struct drm_i915_ts_node_ctx_id *ctx_info;
>  	struct drm_i915_ts_node_ring_id *ring_info;
>  	struct drm_i915_ts_node_pid *pid_info;
> +	struct drm_i915_ts_node_tag *tag_info;
>  	struct perf_raw_record raw;
>  
>  	BUILD_BUG_ON((TS_DATA_SIZE != 8) || (CTX_INFO_SIZE != 8) ||
> -			(RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8));
> +			(RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8) ||
> +			(TAG_INFO_SIZE != 8));

This is much more useful if each clause is independent. The error
message is then unambiguous and it looks neater.

>  	snapshot = dev_priv->gen_pmu.buffer.addr + node->offset;
>  	snapshot_size = TS_DATA_SIZE + CTX_INFO_SIZE;

> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 3dcc862..db91098 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -104,7 +104,8 @@ struct drm_i915_gen_pmu_attr {
>  	__u32 size;
>  	__u32 sample_ring:1,
>  		sample_pid:1,
> -		__reserved_1:30;
> +		sample_tag:1,
> +		__reserved_1:29;

Start each bitfield entry on its own line with __u32;
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests
  2015-08-05  5:55 ` [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests sourab.gupta
@ 2015-08-05  9:22   ` Chris Wilson
  2015-08-05  9:40     ` Gupta, Sourab
  2015-08-05  9:38   ` Chris Wilson
  1 sibling, 1 reply; 31+ messages in thread
From: Chris Wilson @ 2015-08-05  9:22 UTC (permalink / raw)
  To: sourab.gupta; +Cc: Peter Zijlstra, intel-gfx, Jabin Wu, Insoo Woo

On Wed, Aug 05, 2015 at 11:25:37AM +0530, sourab.gupta@intel.com wrote:
> +static void gen_buffer_destroy(struct drm_i915_private *i915)
> +{
> +	mutex_lock(&i915->dev->struct_mutex);
> +	vunmap(i915->gen_pmu.buffer.addr);
> +	i915_gem_object_ggtt_unpin(i915->gen_pmu.buffer.obj);
> +	drm_gem_object_unreference(&i915->gen_pmu.buffer.obj->base);
> +	mutex_unlock(&i915->dev->struct_mutex);
> +
> +	spin_lock(&i915->gen_pmu.lock);
> +	i915->gen_pmu.buffer.obj = NULL;
> +	i915->gen_pmu.buffer.gtt_offset = 0;
> +	i915->gen_pmu.buffer.addr = NULL;
> +	spin_unlock(&i915->gen_pmu.lock);

This ordering looks scary. At the very least it deserves a comment to
explain why it is safe.

So what stops a second event being created while the first is being
destroyed? I presume the perf events are exclusive? Or a refcounted
singleton?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 5/8] drm/i915: Add support for forwarding ring id in sample metadata through perf
  2015-08-05  5:55 ` [RFC 5/8] drm/i915: Add support for forwarding ring id in sample metadata through perf sourab.gupta
@ 2015-08-05  9:26   ` Chris Wilson
  0 siblings, 0 replies; 31+ messages in thread
From: Chris Wilson @ 2015-08-05  9:26 UTC (permalink / raw)
  To: sourab.gupta; +Cc: Peter Zijlstra, intel-gfx, Jabin Wu, Insoo Woo

On Wed, Aug 05, 2015 at 11:25:41AM +0530, sourab.gupta@intel.com wrote:
> @@ -542,18 +548,27 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
>  	struct perf_sample_data data;
>  	struct perf_event *event = dev_priv->gen_pmu.exclusive_event;
>  	int snapshot_size;
> -	u8 *snapshot;
> +	u8 *snapshot, *current_ptr;
>  	struct drm_i915_ts_node_ctx_id *ctx_info;
> +	struct drm_i915_ts_node_ring_id *ring_info;
>  	struct perf_raw_record raw;
>  
> -	BUILD_BUG_ON(TS_DATA_SIZE != 8);
> -	BUILD_BUG_ON(CTX_INFO_SIZE != 8);
> +	BUILD_BUG_ON((TS_DATA_SIZE != 8) || (CTX_INFO_SIZE != 8) ||
> +			(RING_INFO_SIZE != 8));
>  
>  	snapshot = dev_priv->gen_pmu.buffer.addr + node->offset;
>  	snapshot_size = TS_DATA_SIZE + CTX_INFO_SIZE;
>  
>  	ctx_info = (struct drm_i915_ts_node_ctx_id *)(snapshot + TS_DATA_SIZE);
>  	ctx_info->ctx_id = node->ctx_id;
> +	current_ptr = snapshot + snapshot_size;
> +
> +	if (dev_priv->gen_pmu.sample_info_flags & I915_GEN_PMU_SAMPLE_RING) {
> +		ring_info = (struct drm_i915_ts_node_ring_id *)current_ptr;
> +		ring_info->ring = node->ring;

Stylewise I would be move familar with current_ptr = ring_info + 1, and
make current_ptr void*. snapshot_size is then redundant.

> +		snapshot_size += RING_INFO_SIZE;
> +		current_ptr = snapshot + snapshot_size;
> +	}

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 7/8] drm/i915: Add support for forwarding execbuffer tags in timestamp sample metadata
  2015-08-05  9:17   ` Chris Wilson
@ 2015-08-05  9:29     ` Daniel Vetter
  2015-08-05 13:59       ` Robert Bragg
  0 siblings, 1 reply; 31+ messages in thread
From: Daniel Vetter @ 2015-08-05  9:29 UTC (permalink / raw)
  To: Chris Wilson, sourab.gupta, intel-gfx, Robert Bragg, Zhenyu Wang,
	Jon Bloomfield, Peter Zijlstra, Jabin Wu, Insoo Woo

On Wed, Aug 05, 2015 at 10:17:55AM +0100, Chris Wilson wrote:
> On Wed, Aug 05, 2015 at 11:25:43AM +0530, sourab.gupta@intel.com wrote:
> > @@ -555,10 +558,12 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
> >  	struct drm_i915_ts_node_ctx_id *ctx_info;
> >  	struct drm_i915_ts_node_ring_id *ring_info;
> >  	struct drm_i915_ts_node_pid *pid_info;
> > +	struct drm_i915_ts_node_tag *tag_info;
> >  	struct perf_raw_record raw;
> >  
> >  	BUILD_BUG_ON((TS_DATA_SIZE != 8) || (CTX_INFO_SIZE != 8) ||
> > -			(RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8));
> > +			(RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8) ||
> > +			(TAG_INFO_SIZE != 8));
> 
> This is much more useful if each clause is independent. The error
> message is then unambiguous and it looks neater.
> 
> >  	snapshot = dev_priv->gen_pmu.buffer.addr + node->offset;
> >  	snapshot_size = TS_DATA_SIZE + CTX_INFO_SIZE;
> 
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index 3dcc862..db91098 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -104,7 +104,8 @@ struct drm_i915_gen_pmu_attr {
> >  	__u32 size;
> >  	__u32 sample_ring:1,
> >  		sample_pid:1,
> > -		__reserved_1:30;
> > +		sample_tag:1,
> > +		__reserved_1:29;
> 
> Start each bitfield entry on its own line with __u32;

also no bitfields in uapi headers.
-Daniel

> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 4/8] drm/i915: Insert commands for capturing timestamps in the ring
  2015-08-05  5:55 ` [RFC 4/8] drm/i915: Insert commands for capturing timestamps in the ring sourab.gupta
@ 2015-08-05  9:30   ` Chris Wilson
  2015-08-05  9:54     ` Gupta, Sourab
  0 siblings, 1 reply; 31+ messages in thread
From: Chris Wilson @ 2015-08-05  9:30 UTC (permalink / raw)
  To: sourab.gupta; +Cc: Peter Zijlstra, intel-gfx, Jabin Wu, Insoo Woo

On Wed, Aug 05, 2015 at 11:25:40AM +0530, sourab.gupta@intel.com wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
> 
> This patch adds the routines through which one can insert commands in the
> ringbuf for capturing timestamps, which are used to insert these commands
> around the batchbuffer.
> 
> While inserting the commands, we keep a reference of associated request.
> This will be released when we are forwarding the samples to userspace
> (or when the event is being destroyed).
> Also, an active reference of the destination buffer is taken here, so that
> we can be assured that the buffer is freed up only after GPU is done with
> it, even if the local reference of the buffer is released.
> 
> v2: Changes (as suggested by Chris):
>     - Passing in 'request' struct for emit report function
>     - Removed multiple calls to i915_gem_obj_to_ggtt(). Keeping hold of
>       pinned vma from start and using when required.
>     - Better nomenclature, and error handling.
> 
> @@ -919,6 +993,7 @@ static int init_gen_pmu_buffer(struct perf_event *event)
>  	dev_priv->gen_pmu.buffer.obj = bo;
>  	dev_priv->gen_pmu.buffer.gtt_offset =
>  				i915_gem_obj_ggtt_offset(bo);
> +	dev_priv->gen_pmu.buffer.vma = i915_gem_obj_to_ggtt(bo);
>  	dev_priv->gen_pmu.buffer.addr = vmap_oa_buffer(bo);
>  	INIT_LIST_HEAD(&dev_priv->gen_pmu.node_list);

Still calling i915_gem_obj_to_ggtt(bo) twice! With pmu_buffer.vma you
can drop pmu_buffer.gtt_offset and never be confused again.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests
  2015-08-05  5:55 ` [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests sourab.gupta
  2015-08-05  9:22   ` Chris Wilson
@ 2015-08-05  9:38   ` Chris Wilson
  2015-08-05  9:45     ` Gupta, Sourab
  1 sibling, 1 reply; 31+ messages in thread
From: Chris Wilson @ 2015-08-05  9:38 UTC (permalink / raw)
  To: sourab.gupta; +Cc: Peter Zijlstra, intel-gfx, Jabin Wu, Insoo Woo

On Wed, Aug 05, 2015 at 11:25:37AM +0530, sourab.gupta@intel.com wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
> 
> The current perf PMU driver is specific for collection of OA counter
> statistics (which may be done in a periodic or asynchronous way). Since
> this enables us (and limits us) to render ring, we have no means for
> collection of data pertaining to other rings.
> 
> To overcome this limitation, we need to have a new PMU driver which enables
> data collection for other rings also (in a non-OA specific mode).
> This patch adds a new perf PMU to i915 device private, for handling
> profiling requests for non-OA counter data.This data may encompass
> timestamps, mmio register values, etc. for the relevant ring.
> The new perf PMU will serve these purposes, without constraining itself to
> type of data being dumped (which may restrict the user to specific ring
> like in case of OA counters).
> 
> The patch introduces this PMU driver alongwith its associated callbacks.
> 
> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_dma.c     |   2 +
>  drivers/gpu/drm/i915/i915_drv.h     |  19 ++++
>  drivers/gpu/drm/i915/i915_oa_perf.c | 215 ++++++++++++++++++++++++++++++++++++

You have to admit it is a bit odd for the object to be called
i915_oa_pmu/i915_gen_pmu and the file i915_oa_perf.c
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests
  2015-08-05  9:22   ` Chris Wilson
@ 2015-08-05  9:40     ` Gupta, Sourab
  0 siblings, 0 replies; 31+ messages in thread
From: Gupta, Sourab @ 2015-08-05  9:40 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo

On Wed, 2015-08-05 at 09:22 +0000, Chris Wilson wrote:
> On Wed, Aug 05, 2015 at 11:25:37AM +0530, sourab.gupta@intel.com wrote:
> > +static void gen_buffer_destroy(struct drm_i915_private *i915)
> > +{
> > +	mutex_lock(&i915->dev->struct_mutex);
> > +	vunmap(i915->gen_pmu.buffer.addr);
> > +	i915_gem_object_ggtt_unpin(i915->gen_pmu.buffer.obj);
> > +	drm_gem_object_unreference(&i915->gen_pmu.buffer.obj->base);
> > +	mutex_unlock(&i915->dev->struct_mutex);
> > +
> > +	spin_lock(&i915->gen_pmu.lock);
> > +	i915->gen_pmu.buffer.obj = NULL;
> > +	i915->gen_pmu.buffer.gtt_offset = 0;
> > +	i915->gen_pmu.buffer.addr = NULL;
> > +	spin_unlock(&i915->gen_pmu.lock);
> 
> This ordering looks scary. At the very least it deserves a comment to
> explain why it is safe.
> 
> So what stops a second event being created while the first is being
> destroyed? I presume the perf events are exclusive? Or a refcounted
> singleton?
> -Chris
> 
Hi Chris,

Yes, the perf events are exclusive. This patch doesn't handle the
problem of exclusion fully. I intended to handle this problem in the
later patch in the series: 
http://lists.freedesktop.org/archives/intel-gfx/2015-August/072959.html
If you check here, a new event init checks whether obj is NULL (while
holding the spinlock), to see whether it is exclusive.
This is taken care of during the event destroy work fn, which assigns
obj to NULL (while holding spinlock), after it is done with everything.

Regards,
Sourab
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests
  2015-08-05  9:38   ` Chris Wilson
@ 2015-08-05  9:45     ` Gupta, Sourab
  2015-08-05  9:49       ` Gupta, Sourab
  2015-08-05  9:56       ` Chris Wilson
  0 siblings, 2 replies; 31+ messages in thread
From: Gupta, Sourab @ 2015-08-05  9:45 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo

On Wed, 2015-08-05 at 09:38 +0000, Chris Wilson wrote:
> On Wed, Aug 05, 2015 at 11:25:37AM +0530, sourab.gupta@intel.com wrote:
> > From: Sourab Gupta <sourab.gupta@intel.com>
> > 
> > The current perf PMU driver is specific for collection of OA counter
> > statistics (which may be done in a periodic or asynchronous way). Since
> > this enables us (and limits us) to render ring, we have no means for
> > collection of data pertaining to other rings.
> > 
> > To overcome this limitation, we need to have a new PMU driver which enables
> > data collection for other rings also (in a non-OA specific mode).
> > This patch adds a new perf PMU to i915 device private, for handling
> > profiling requests for non-OA counter data.This data may encompass
> > timestamps, mmio register values, etc. for the relevant ring.
> > The new perf PMU will serve these purposes, without constraining itself to
> > type of data being dumped (which may restrict the user to specific ring
> > like in case of OA counters).
> > 
> > The patch introduces this PMU driver alongwith its associated callbacks.
> > 
> > Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_dma.c     |   2 +
> >  drivers/gpu/drm/i915/i915_drv.h     |  19 ++++
> >  drivers/gpu/drm/i915/i915_oa_perf.c | 215 ++++++++++++++++++++++++++++++++++++
> 
> You have to admit it is a bit odd for the object to be called
> i915_oa_pmu/i915_gen_pmu and the file i915_oa_perf.c
> -Chris
> 

Well, yes.. If the nomenclature of i915_gen_pmu is agreed upon, I can
have a new file named as i915_gen_pmu.c which will hold the routines for
this pmu, leaving oa pmu stuff behind in i915_oa_pmu.c

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests
  2015-08-05  9:45     ` Gupta, Sourab
@ 2015-08-05  9:49       ` Gupta, Sourab
  2015-08-05 11:08         ` Chris Wilson
  2015-08-05  9:56       ` Chris Wilson
  1 sibling, 1 reply; 31+ messages in thread
From: Gupta, Sourab @ 2015-08-05  9:49 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo

On Wed, 2015-08-05 at 15:17 +0530, sourab gupta wrote:
> On Wed, 2015-08-05 at 09:38 +0000, Chris Wilson wrote:
> > On Wed, Aug 05, 2015 at 11:25:37AM +0530, sourab.gupta@intel.com wrote:
> > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > 
> > > The current perf PMU driver is specific for collection of OA counter
> > > statistics (which may be done in a periodic or asynchronous way). Since
> > > this enables us (and limits us) to render ring, we have no means for
> > > collection of data pertaining to other rings.
> > > 
> > > To overcome this limitation, we need to have a new PMU driver which enables
> > > data collection for other rings also (in a non-OA specific mode).
> > > This patch adds a new perf PMU to i915 device private, for handling
> > > profiling requests for non-OA counter data.This data may encompass
> > > timestamps, mmio register values, etc. for the relevant ring.
> > > The new perf PMU will serve these purposes, without constraining itself to
> > > type of data being dumped (which may restrict the user to specific ring
> > > like in case of OA counters).
> > > 
> > > The patch introduces this PMU driver alongwith its associated callbacks.
> > > 
> > > Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_dma.c     |   2 +
> > >  drivers/gpu/drm/i915/i915_drv.h     |  19 ++++
> > >  drivers/gpu/drm/i915/i915_oa_perf.c | 215 ++++++++++++++++++++++++++++++++++++
> > 
> > You have to admit it is a bit odd for the object to be called
> > i915_oa_pmu/i915_gen_pmu and the file i915_oa_perf.c
> > -Chris
> > 
> 
> Well, yes.. If the nomenclature of i915_gen_pmu is agreed upon, I can
> have a new file named as i915_gen_pmu.c which will hold the routines for
> this pmu, leaving oa pmu stuff behind in i915_oa_pmu.c
Sorry, I meant to say 'oa pmu stuff behind in i915_oa_perf.c'. Does
i915_gen_pmu.c sound fine?
> 

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 4/8] drm/i915: Insert commands for capturing timestamps in the ring
  2015-08-05  9:30   ` Chris Wilson
@ 2015-08-05  9:54     ` Gupta, Sourab
  0 siblings, 0 replies; 31+ messages in thread
From: Gupta, Sourab @ 2015-08-05  9:54 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo

On Wed, 2015-08-05 at 09:30 +0000, Chris Wilson wrote:
> On Wed, Aug 05, 2015 at 11:25:40AM +0530, sourab.gupta@intel.com wrote:
> > From: Sourab Gupta <sourab.gupta@intel.com>
> > 
> > This patch adds the routines through which one can insert commands in the
> > ringbuf for capturing timestamps, which are used to insert these commands
> > around the batchbuffer.
> > 
> > While inserting the commands, we keep a reference of associated request.
> > This will be released when we are forwarding the samples to userspace
> > (or when the event is being destroyed).
> > Also, an active reference of the destination buffer is taken here, so that
> > we can be assured that the buffer is freed up only after GPU is done with
> > it, even if the local reference of the buffer is released.
> > 
> > v2: Changes (as suggested by Chris):
> >     - Passing in 'request' struct for emit report function
> >     - Removed multiple calls to i915_gem_obj_to_ggtt(). Keeping hold of
> >       pinned vma from start and using when required.
> >     - Better nomenclature, and error handling.
> > 
> > @@ -919,6 +993,7 @@ static int init_gen_pmu_buffer(struct perf_event *event)
> >  	dev_priv->gen_pmu.buffer.obj = bo;
> >  	dev_priv->gen_pmu.buffer.gtt_offset =
> >  				i915_gem_obj_ggtt_offset(bo);
> > +	dev_priv->gen_pmu.buffer.vma = i915_gem_obj_to_ggtt(bo);
> >  	dev_priv->gen_pmu.buffer.addr = vmap_oa_buffer(bo);
> >  	INIT_LIST_HEAD(&dev_priv->gen_pmu.node_list);
> 
> Still calling i915_gem_obj_to_ggtt(bo) twice! With pmu_buffer.vma you
> can drop pmu_buffer.gtt_offset and never be confused again.
> -Chris
> 
Sorry, I'll have the one pertaining to gtt offset removed, and derive
the gtt offset field from the vma only.


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 2/8] drm/i915: Add mechanism for forwarding the timestamp data through perf
  2015-08-05  5:55 ` [RFC 2/8] drm/i915: Add mechanism for forwarding the timestamp data through perf sourab.gupta
@ 2015-08-05  9:55   ` Chris Wilson
  0 siblings, 0 replies; 31+ messages in thread
From: Chris Wilson @ 2015-08-05  9:55 UTC (permalink / raw)
  To: sourab.gupta; +Cc: Peter Zijlstra, intel-gfx, Jabin Wu, Insoo Woo

On Wed, Aug 05, 2015 at 11:25:38AM +0530, sourab.gupta@intel.com wrote:
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 66f9ee9..08235582 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1676,6 +1676,13 @@ struct i915_oa_rcs_node {
>  	u32 tag;
>  };
>  
> +struct i915_gen_pmu_node {
> +	struct list_head head;

This not the head of the list, but a node.

> +	struct drm_i915_gem_request *req;

Use request since this is a less often used member name and brevity
isn't saving thousands of keystrokes.

> +	u32 offset;
> +	u32 ctx_id;
> +};

> +static int i915_gen_pmu_wait_gpu(struct drm_i915_private *dev_priv)
> +{
> +	struct i915_gen_pmu_node *last_entry = NULL;
> +	int ret;
> +
> +	/*
> +	 * Wait for the last scheduled request to complete. This would
> +	 * implicitly wait for the prior submitted requests. The refcount
> +	 * of the requests is not decremented here.
> +	 */
> +	spin_lock(&dev_priv->gen_pmu.lock);
> +
> +	if (!list_empty(&dev_priv->gen_pmu.node_list)) {
> +		last_entry = list_last_entry(&dev_priv->gen_pmu.node_list,
> +			struct i915_gen_pmu_node, head);
> +	}
> +	spin_unlock(&dev_priv->gen_pmu.lock);

Because you issue requests on all rings that are not actually
serialised, the order of writes and retirements are also not serialised,
i.e. this does not do a complete wait for all activity on the object.

>  static void forward_gen_pmu_snapshots(struct drm_i915_private *dev_priv)
>  {
> -	WARN_ON(!dev_priv->gen_pmu.buffer.addr);
> +	struct i915_gen_pmu_node *entry, *next;
> +	LIST_HEAD(deferred_list_free);
> +	int ret;
>  
> -	/* TODO: routine for forwarding snapshots to userspace */
> +	list_for_each_entry_safe
> +		(entry, next, &dev_priv->gen_pmu.node_list, head) {
> +		if (!i915_gem_request_completed(entry->req, true))
> +			break;

Again the list is not actually in retirement order since you combine
multiple rings into one list.

These problems magically disappear with a list per-ring and a page
per-ring. You also need to be more careful with overwriting unflushed
entries. A dynamic approach to page allocation overcomes that.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests
  2015-08-05  9:45     ` Gupta, Sourab
  2015-08-05  9:49       ` Gupta, Sourab
@ 2015-08-05  9:56       ` Chris Wilson
  1 sibling, 0 replies; 31+ messages in thread
From: Chris Wilson @ 2015-08-05  9:56 UTC (permalink / raw)
  To: Gupta, Sourab; +Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo

On Wed, Aug 05, 2015 at 09:45:28AM +0000, Gupta, Sourab wrote:
> On Wed, 2015-08-05 at 09:38 +0000, Chris Wilson wrote:
> > On Wed, Aug 05, 2015 at 11:25:37AM +0530, sourab.gupta@intel.com wrote:
> > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > 
> > > The current perf PMU driver is specific for collection of OA counter
> > > statistics (which may be done in a periodic or asynchronous way). Since
> > > this enables us (and limits us) to render ring, we have no means for
> > > collection of data pertaining to other rings.
> > > 
> > > To overcome this limitation, we need to have a new PMU driver which enables
> > > data collection for other rings also (in a non-OA specific mode).
> > > This patch adds a new perf PMU to i915 device private, for handling
> > > profiling requests for non-OA counter data.This data may encompass
> > > timestamps, mmio register values, etc. for the relevant ring.
> > > The new perf PMU will serve these purposes, without constraining itself to
> > > type of data being dumped (which may restrict the user to specific ring
> > > like in case of OA counters).
> > > 
> > > The patch introduces this PMU driver alongwith its associated callbacks.
> > > 
> > > Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_dma.c     |   2 +
> > >  drivers/gpu/drm/i915/i915_drv.h     |  19 ++++
> > >  drivers/gpu/drm/i915/i915_oa_perf.c | 215 ++++++++++++++++++++++++++++++++++++
> > 
> > You have to admit it is a bit odd for the object to be called
> > i915_oa_pmu/i915_gen_pmu and the file i915_oa_perf.c
> > -Chris
> > 
> 
> Well, yes.. If the nomenclature of i915_gen_pmu is agreed upon, I can
> have a new file named as i915_gen_pmu.c which will hold the routines for
> this pmu, leaving oa pmu stuff behind in i915_oa_pmu.c

Or just i915_pmu.c if we share a lot of the routines and i915_gen_pmu,
i915_oa_pmu themselves are more or less the perf_event interface.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 8/8] drm/i915: Support for retrieving MMIO register values alongwith timestamps through perf
  2015-08-05  5:55 ` [RFC 8/8] drm/i915: Support for retrieving MMIO register values alongwith timestamps through perf sourab.gupta
@ 2015-08-05 10:03   ` Chris Wilson
  2015-08-05 10:18     ` Gupta, Sourab
  2015-08-05 20:19   ` Robert Bragg
  1 sibling, 1 reply; 31+ messages in thread
From: Chris Wilson @ 2015-08-05 10:03 UTC (permalink / raw)
  To: sourab.gupta; +Cc: Peter Zijlstra, intel-gfx, Jabin Wu, Insoo Woo

On Wed, Aug 05, 2015 at 11:25:44AM +0530, sourab.gupta@intel.com wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
> 
> This patch adds support for retrieving MMIO register values alongwith
> timestamps and forwarding them to userspace through perf.
> The userspace can request upto 8 MMIO register values to be dumped.
> The addresses of upto 8 MMIO registers can be passed through perf attr
> config. The registers are checked against a whitelist before passing them
> on. The commands to dump the values of these MMIO registers are then
> inserted into the ring alongwith commands to dump the timestamps.

The values reported to userspace are deltas across batches right? We
don't expose the global value to an unprivileged user? It would be nice
to clarify that in perf_init so that the reviewer is aware that
the issue of unprivileged information leak is addressed (or at least
reminded that the register values do not leak!).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 8/8] drm/i915: Support for retrieving MMIO register values alongwith timestamps through perf
  2015-08-05 10:03   ` Chris Wilson
@ 2015-08-05 10:18     ` Gupta, Sourab
  2015-08-05 10:30       ` Chris Wilson
  0 siblings, 1 reply; 31+ messages in thread
From: Gupta, Sourab @ 2015-08-05 10:18 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo

On Wed, 2015-08-05 at 10:03 +0000, Chris Wilson wrote:
> On Wed, Aug 05, 2015 at 11:25:44AM +0530, sourab.gupta@intel.com wrote:
> > From: Sourab Gupta <sourab.gupta@intel.com>
> > 
> > This patch adds support for retrieving MMIO register values alongwith
> > timestamps and forwarding them to userspace through perf.
> > The userspace can request upto 8 MMIO register values to be dumped.
> > The addresses of upto 8 MMIO registers can be passed through perf attr
> > config. The registers are checked against a whitelist before passing them
> > on. The commands to dump the values of these MMIO registers are then
> > inserted into the ring alongwith commands to dump the timestamps.
> 
> The values reported to userspace are deltas across batches right? We
> don't expose the global value to an unprivileged user? It would be nice
> to clarify that in perf_init so that the reviewer is aware that
> the issue of unprivileged information leak is addressed (or at least
> reminded that the register values do not leak!).
> -Chris
> 
Hi Chris,
Two things here:
1) Only root is allowed to call event_init for gen pmu. This restriction
is there in event_init. (The thought behind this restriction being that
we are profiling data across contexts here, so a process wishing to
listen to global activity happening in system across all contexts ought
to have root priviliges). Is this thought process correct? Should we be
supporting non-root users too?

2) Being already a root, do we need to worry about the unauthorized mmio
access while exposing these mmio values through the interface?

In the current patches, the full mmio register value is dumped to be
passed on to userspace (no deltas across batches), provided the register
is there in the whitelist. Does the question of unpriviliged information
leak arise here(the user being root)?

Regards,
Sourab
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 8/8] drm/i915: Support for retrieving MMIO register values alongwith timestamps through perf
  2015-08-05 10:18     ` Gupta, Sourab
@ 2015-08-05 10:30       ` Chris Wilson
  2015-08-05 14:22         ` Gupta, Sourab
  0 siblings, 1 reply; 31+ messages in thread
From: Chris Wilson @ 2015-08-05 10:30 UTC (permalink / raw)
  To: Gupta, Sourab; +Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo

On Wed, Aug 05, 2015 at 10:18:50AM +0000, Gupta, Sourab wrote:
> On Wed, 2015-08-05 at 10:03 +0000, Chris Wilson wrote:
> > On Wed, Aug 05, 2015 at 11:25:44AM +0530, sourab.gupta@intel.com wrote:
> > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > 
> > > This patch adds support for retrieving MMIO register values alongwith
> > > timestamps and forwarding them to userspace through perf.
> > > The userspace can request upto 8 MMIO register values to be dumped.
> > > The addresses of upto 8 MMIO registers can be passed through perf attr
> > > config. The registers are checked against a whitelist before passing them
> > > on. The commands to dump the values of these MMIO registers are then
> > > inserted into the ring alongwith commands to dump the timestamps.
> > 
> > The values reported to userspace are deltas across batches right? We
> > don't expose the global value to an unprivileged user? It would be nice
> > to clarify that in perf_init so that the reviewer is aware that
> > the issue of unprivileged information leak is addressed (or at least
> > reminded that the register values do not leak!).
> > -Chris
> > 
> Hi Chris,
> Two things here:
> 1) Only root is allowed to call event_init for gen pmu. This restriction
> is there in event_init. (The thought behind this restriction being that
> we are profiling data across contexts here, so a process wishing to
> listen to global activity happening in system across all contexts ought
> to have root priviliges). Is this thought process correct? Should we be
> supporting non-root users too?

That is not clear in this patch, so you need to address such concerns at
least in the changelog, and preferrably with a reminder in the
whitelist (that these register reads are safe because they are being
done from a privileged context only - we then have a red flag in case we
lower it).

What is the privilege check you are using here exactly?

For gen pmu, I want it user accessible. How long does it take to execute
my batches is a common developer query. We may even be able to make
anonymised information freely available ala top (per-process GPU usage,
memory usage, though cgroups/namespacing rules probably apply here).

> 2) Being already a root, do we need to worry about the unauthorized mmio
> access while exposing these mmio values through the interface?

Yes. See above, the information here can be anonymised and useful for
user processes exactly like TIMESTAMP.
 
> In the current patches, the full mmio register value is dumped to be
> passed on to userspace (no deltas across batches), provided the register
> is there in the whitelist. Does the question of unpriviliged information
> leak arise here(the user being root)?

Not for root.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests
  2015-08-05  9:49       ` Gupta, Sourab
@ 2015-08-05 11:08         ` Chris Wilson
  0 siblings, 0 replies; 31+ messages in thread
From: Chris Wilson @ 2015-08-05 11:08 UTC (permalink / raw)
  To: Gupta, Sourab; +Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo

On Wed, Aug 05, 2015 at 09:49:39AM +0000, Gupta, Sourab wrote:
> On Wed, 2015-08-05 at 15:17 +0530, sourab gupta wrote:
> > On Wed, 2015-08-05 at 09:38 +0000, Chris Wilson wrote:
> > > On Wed, Aug 05, 2015 at 11:25:37AM +0530, sourab.gupta@intel.com wrote:
> > > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > > 
> > > > The current perf PMU driver is specific for collection of OA counter
> > > > statistics (which may be done in a periodic or asynchronous way). Since
> > > > this enables us (and limits us) to render ring, we have no means for
> > > > collection of data pertaining to other rings.
> > > > 
> > > > To overcome this limitation, we need to have a new PMU driver which enables
> > > > data collection for other rings also (in a non-OA specific mode).
> > > > This patch adds a new perf PMU to i915 device private, for handling
> > > > profiling requests for non-OA counter data.This data may encompass
> > > > timestamps, mmio register values, etc. for the relevant ring.
> > > > The new perf PMU will serve these purposes, without constraining itself to
> > > > type of data being dumped (which may restrict the user to specific ring
> > > > like in case of OA counters).
> > > > 
> > > > The patch introduces this PMU driver alongwith its associated callbacks.
> > > > 
> > > > Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_dma.c     |   2 +
> > > >  drivers/gpu/drm/i915/i915_drv.h     |  19 ++++
> > > >  drivers/gpu/drm/i915/i915_oa_perf.c | 215 ++++++++++++++++++++++++++++++++++++
> > > 
> > > You have to admit it is a bit odd for the object to be called
> > > i915_oa_pmu/i915_gen_pmu and the file i915_oa_perf.c
> > > -Chris
> > > 
> > 
> > Well, yes.. If the nomenclature of i915_gen_pmu is agreed upon, I can
> > have a new file named as i915_gen_pmu.c which will hold the routines for
> > this pmu, leaving oa pmu stuff behind in i915_oa_pmu.c
> Sorry, I meant to say 'oa pmu stuff behind in i915_oa_perf.c'. Does
> i915_gen_pmu.c sound fine?

Aiui, there is some common code with i915_oa_perf and a fair amount of
perf_event boilerplate. At the moment, I'm leaning towards just
i915_pmu.c for both (and if need be i915_pmu_gen.c i915_pmu_oa.c if and
only if splitting the boilerplate away helps maintainablity).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 7/8] drm/i915: Add support for forwarding execbuffer tags in timestamp sample metadata
  2015-08-05  9:29     ` Daniel Vetter
@ 2015-08-05 13:59       ` Robert Bragg
  2015-08-05 15:25         ` Daniel Vetter
  0 siblings, 1 reply; 31+ messages in thread
From: Robert Bragg @ 2015-08-05 13:59 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Insoo Woo, Peter Zijlstra, intel-gfx, Jabin Wu, Gupta, Sourab

On Wed, Aug 5, 2015 at 10:29 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, Aug 05, 2015 at 10:17:55AM +0100, Chris Wilson wrote:
>> On Wed, Aug 05, 2015 at 11:25:43AM +0530, sourab.gupta@intel.com wrote:
>> > @@ -555,10 +558,12 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
>> >     struct drm_i915_ts_node_ctx_id *ctx_info;
>> >     struct drm_i915_ts_node_ring_id *ring_info;
>> >     struct drm_i915_ts_node_pid *pid_info;
>> > +   struct drm_i915_ts_node_tag *tag_info;
>> >     struct perf_raw_record raw;
>> >
>> >     BUILD_BUG_ON((TS_DATA_SIZE != 8) || (CTX_INFO_SIZE != 8) ||
>> > -                   (RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8));
>> > +                   (RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8) ||
>> > +                   (TAG_INFO_SIZE != 8));
>>
>> This is much more useful if each clause is independent. The error
>> message is then unambiguous and it looks neater.
>>
>> >     snapshot = dev_priv->gen_pmu.buffer.addr + node->offset;
>> >     snapshot_size = TS_DATA_SIZE + CTX_INFO_SIZE;
>>
>> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> > index 3dcc862..db91098 100644
>> > --- a/include/uapi/drm/i915_drm.h
>> > +++ b/include/uapi/drm/i915_drm.h
>> > @@ -104,7 +104,8 @@ struct drm_i915_gen_pmu_attr {
>> >     __u32 size;
>> >     __u32 sample_ring:1,
>> >             sample_pid:1,
>> > -           __reserved_1:30;
>> > +           sample_tag:1,
>> > +           __reserved_1:29;
>>
>> Start each bitfield entry on its own line with __u32;
>
> also no bitfields in uapi headers.
> -Daniel

Ah, I had previously asked Sourab to pack the bitfields into the same
u64. I think we only get into undefined ABI territory if we have
multiple sequential bitfields in the structure where the compiler can
choose to combine them in some undefined way?

This follows the same pattern for bitfields seen in struct perf_event_attr.

I'm not sure we'll need lots of flags in our case though so perhaps it
would be fine to avoid the use of bitfields altogether here.

- Robert

>
>> -Chris
>>
>> --
>> Chris Wilson, Intel Open Source Technology Centre
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 8/8] drm/i915: Support for retrieving MMIO register values alongwith timestamps through perf
  2015-08-05 10:30       ` Chris Wilson
@ 2015-08-05 14:22         ` Gupta, Sourab
  0 siblings, 0 replies; 31+ messages in thread
From: Gupta, Sourab @ 2015-08-05 14:22 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo

On Wed, 2015-08-05 at 10:30 +0000, Chris Wilson wrote:
> On Wed, Aug 05, 2015 at 10:18:50AM +0000, Gupta, Sourab wrote:
> > On Wed, 2015-08-05 at 10:03 +0000, Chris Wilson wrote:
> > > On Wed, Aug 05, 2015 at 11:25:44AM +0530, sourab.gupta@intel.com wrote:
> > > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > > 
> > > > This patch adds support for retrieving MMIO register values alongwith
> > > > timestamps and forwarding them to userspace through perf.
> > > > The userspace can request upto 8 MMIO register values to be dumped.
> > > > The addresses of upto 8 MMIO registers can be passed through perf attr
> > > > config. The registers are checked against a whitelist before passing them
> > > > on. The commands to dump the values of these MMIO registers are then
> > > > inserted into the ring alongwith commands to dump the timestamps.
> > > 
> > > The values reported to userspace are deltas across batches right? We
> > > don't expose the global value to an unprivileged user? It would be nice
> > > to clarify that in perf_init so that the reviewer is aware that
> > > the issue of unprivileged information leak is addressed (or at least
> > > reminded that the register values do not leak!).
> > > -Chris
> > > 
> > Hi Chris,
> > Two things here:
> > 1) Only root is allowed to call event_init for gen pmu. This restriction
> > is there in event_init. (The thought behind this restriction being that
> > we are profiling data across contexts here, so a process wishing to
> > listen to global activity happening in system across all contexts ought
> > to have root priviliges). Is this thought process correct? Should we be
> > supporting non-root users too?
> 
> That is not clear in this patch, so you need to address such concerns at
> least in the changelog, and preferrably with a reminder in the
> whitelist (that these register reads are safe because they are being
> done from a privileged context only - we then have a red flag in case we
> lower it).
> 
> What is the privilege check you are using here exactly?

In the current patch set, during the gen pmu event_init, I'm checking
for root access using the below check:
+	if (!capable(CAP_SYS_ADMIN))
+			return -EACCES;

> 
> For gen pmu, I want it user accessible. How long does it take to execute
> my batches is a common developer query. We may even be able to make
> anonymised information freely available ala top (per-process GPU usage,
> memory usage, though cgroups/namespacing rules probably apply here).
> 

So, aiui the privilige access should be controlled as below:
- For gen pmu, no need to restrict only to root processes. This would
imply that user processes would now be able to gather timestamps for all
the batches (no unpriviliged information leak since timestamps are
inherently anonymised) ..

- For the collection of mmio register data, we have following options:
    - If it is a root process, allow access (is whitelist check
necessary in this case?).
    - If not root, one option is to disallow mmio register dump(probably
not a preferable option?).
    - If not root, second option is to allow mmio dump (after checking
against the whitelist?). In this case, do we send the mmio register
values as they exist or do we anonymise them?. Since my impression was
that perf is expected to simply return the dump of mmio registers
requested, and throw an access error in case of unpriviliged operation.
And if required, how do we anonymise the mmio data?

can you let me know your opinion here wrt the above points. And the
mechanism to anonymise the mmio data.

> > 2) Being already a root, do we need to worry about the unauthorized mmio
> > access while exposing these mmio values through the interface?
> 
> Yes. See above, the information here can be anonymised and useful for
> user processes exactly like TIMESTAMP.
>  
> > In the current patches, the full mmio register value is dumped to be
> > passed on to userspace (no deltas across batches), provided the register
> > is there in the whitelist. Does the question of unpriviliged information
> > leak arise here(the user being root)?
> 
> Not for root.
> -Chris
> 

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 7/8] drm/i915: Add support for forwarding execbuffer tags in timestamp sample metadata
  2015-08-05 13:59       ` Robert Bragg
@ 2015-08-05 15:25         ` Daniel Vetter
  2015-08-05 16:48           ` Robert Bragg
  0 siblings, 1 reply; 31+ messages in thread
From: Daniel Vetter @ 2015-08-05 15:25 UTC (permalink / raw)
  To: Robert Bragg
  Cc: Insoo Woo, Peter Zijlstra, intel-gfx, Jabin Wu, Gupta, Sourab

On Wed, Aug 05, 2015 at 02:59:03PM +0100, Robert Bragg wrote:
> On Wed, Aug 5, 2015 at 10:29 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> > On Wed, Aug 05, 2015 at 10:17:55AM +0100, Chris Wilson wrote:
> >> On Wed, Aug 05, 2015 at 11:25:43AM +0530, sourab.gupta@intel.com wrote:
> >> > @@ -555,10 +558,12 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
> >> >     struct drm_i915_ts_node_ctx_id *ctx_info;
> >> >     struct drm_i915_ts_node_ring_id *ring_info;
> >> >     struct drm_i915_ts_node_pid *pid_info;
> >> > +   struct drm_i915_ts_node_tag *tag_info;
> >> >     struct perf_raw_record raw;
> >> >
> >> >     BUILD_BUG_ON((TS_DATA_SIZE != 8) || (CTX_INFO_SIZE != 8) ||
> >> > -                   (RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8));
> >> > +                   (RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8) ||
> >> > +                   (TAG_INFO_SIZE != 8));
> >>
> >> This is much more useful if each clause is independent. The error
> >> message is then unambiguous and it looks neater.
> >>
> >> >     snapshot = dev_priv->gen_pmu.buffer.addr + node->offset;
> >> >     snapshot_size = TS_DATA_SIZE + CTX_INFO_SIZE;
> >>
> >> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> >> > index 3dcc862..db91098 100644
> >> > --- a/include/uapi/drm/i915_drm.h
> >> > +++ b/include/uapi/drm/i915_drm.h
> >> > @@ -104,7 +104,8 @@ struct drm_i915_gen_pmu_attr {
> >> >     __u32 size;
> >> >     __u32 sample_ring:1,
> >> >             sample_pid:1,
> >> > -           __reserved_1:30;
> >> > +           sample_tag:1,
> >> > +           __reserved_1:29;
> >>
> >> Start each bitfield entry on its own line with __u32;
> >
> > also no bitfields in uapi headers.
> > -Daniel
> 
> Ah, I had previously asked Sourab to pack the bitfields into the same
> u64. I think we only get into undefined ABI territory if we have
> multiple sequential bitfields in the structure where the compiler can
> choose to combine them in some undefined way?
> 
> This follows the same pattern for bitfields seen in struct perf_event_attr.
> 
> I'm not sure we'll need lots of flags in our case though so perhaps it
> would be fine to avoid the use of bitfields altogether here.

It might be uapi cargo culting, but I'm just not sure ;-) The other
problem with bitfields is that it's fickle properly size the reserved
fields, and we need those to correctly reject unused flags. Otherwise
userspace might but garbage in there and extendability is out the window.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 7/8] drm/i915: Add support for forwarding execbuffer tags in timestamp sample metadata
  2015-08-05 15:25         ` Daniel Vetter
@ 2015-08-05 16:48           ` Robert Bragg
  0 siblings, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-08-05 16:48 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Insoo Woo, Peter Zijlstra, intel-gfx, Jabin Wu, Gupta, Sourab

On Wed, Aug 5, 2015 at 4:25 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, Aug 05, 2015 at 02:59:03PM +0100, Robert Bragg wrote:
>> On Wed, Aug 5, 2015 at 10:29 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
>> > On Wed, Aug 05, 2015 at 10:17:55AM +0100, Chris Wilson wrote:
>> >> On Wed, Aug 05, 2015 at 11:25:43AM +0530, sourab.gupta@intel.com wrote:
>> >> > @@ -555,10 +558,12 @@ static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
>> >> >     struct drm_i915_ts_node_ctx_id *ctx_info;
>> >> >     struct drm_i915_ts_node_ring_id *ring_info;
>> >> >     struct drm_i915_ts_node_pid *pid_info;
>> >> > +   struct drm_i915_ts_node_tag *tag_info;
>> >> >     struct perf_raw_record raw;
>> >> >
>> >> >     BUILD_BUG_ON((TS_DATA_SIZE != 8) || (CTX_INFO_SIZE != 8) ||
>> >> > -                   (RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8));
>> >> > +                   (RING_INFO_SIZE != 8) || (PID_INFO_SIZE != 8) ||
>> >> > +                   (TAG_INFO_SIZE != 8));
>> >>
>> >> This is much more useful if each clause is independent. The error
>> >> message is then unambiguous and it looks neater.
>> >>
>> >> >     snapshot = dev_priv->gen_pmu.buffer.addr + node->offset;
>> >> >     snapshot_size = TS_DATA_SIZE + CTX_INFO_SIZE;
>> >>
>> >> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> >> > index 3dcc862..db91098 100644
>> >> > --- a/include/uapi/drm/i915_drm.h
>> >> > +++ b/include/uapi/drm/i915_drm.h
>> >> > @@ -104,7 +104,8 @@ struct drm_i915_gen_pmu_attr {
>> >> >     __u32 size;
>> >> >     __u32 sample_ring:1,
>> >> >             sample_pid:1,
>> >> > -           __reserved_1:30;
>> >> > +           sample_tag:1,
>> >> > +           __reserved_1:29;
>> >>
>> >> Start each bitfield entry on its own line with __u32;
>> >
>> > also no bitfields in uapi headers.
>> > -Daniel
>>
>> Ah, I had previously asked Sourab to pack the bitfields into the same
>> u64. I think we only get into undefined ABI territory if we have
>> multiple sequential bitfields in the structure where the compiler can
>> choose to combine them in some undefined way?
>>
>> This follows the same pattern for bitfields seen in struct perf_event_attr.
>>
>> I'm not sure we'll need lots of flags in our case though so perhaps it
>> would be fine to avoid the use of bitfields altogether here.
>
> It might be uapi cargo culting, but I'm just not sure ;-) The other
> problem with bitfields is that it's fickle properly size the reserved
> fields, and we need those to correctly reject unused flags. Otherwise
> userspace might but garbage in there and extendability is out the window.

In my latest branch (sorry I haven't sent out a recent RFC myself as
I'm hoping to update public Gen Observability docs before I do that) I
ended up slightly generalizing and exporting perf_copy_attr() in
kernel/events/core.c to use the same tested code to help with this.
Core perf's approach to versioning + extending the attributes
structure seems pretty decent.

That said though regarding unused/reserved fields I realise now I did
miss an important check within the i915_oa code that core perf has
which is to explicitly return -EINVAL if __reserved_1 != 0.

Maybe that should be taken as a case in point.

- Robert

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 8/8] drm/i915: Support for retrieving MMIO register values alongwith timestamps through perf
  2015-08-05  5:55 ` [RFC 8/8] drm/i915: Support for retrieving MMIO register values alongwith timestamps through perf sourab.gupta
  2015-08-05 10:03   ` Chris Wilson
@ 2015-08-05 20:19   ` Robert Bragg
  1 sibling, 0 replies; 31+ messages in thread
From: Robert Bragg @ 2015-08-05 20:19 UTC (permalink / raw)
  To: Gupta, Sourab; +Cc: Peter Zijlstra, intel-gfx, Jabin Wu, Insoo Woo

On Wed, Aug 5, 2015 at 6:55 AM,  <sourab.gupta@intel.com> wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
>
> This patch adds support for retrieving MMIO register values alongwith
> timestamps and forwarding them to userspace through perf.
> The userspace can request upto 8 MMIO register values to be dumped.
> The addresses of upto 8 MMIO registers can be passed through perf attr
> config. The registers are checked against a whitelist before passing them
> on. The commands to dump the values of these MMIO registers are then
> inserted into the ring alongwith commands to dump the timestamps.

Considering the discussion had so far with Peter: one thing raised was
a preference for exposing individual counters via separate events. In
the case of OA metrics I don't think that's at all as straight forward
as it sounds due to the way the OA unit is configured and reports
counters but for mmio based counters the configurations are completely
orthogonal (just an address) so I don't know that there's a need to
configure multiple reads per event and I imagine we should be able to
avoid the arbitrary limit of 8 reads.

Perf allows users to group event fds together which signifies to the
kernel that it wants the counters to be reported in the same buffer
(the buffer of the group leader).

A more extensible list of registers that should be read via the SRM
commands could be indirectly derived by maintaining a list of the
active mmio-read events.

I think something else to raise here is that it could help if we had
some more concrete use cases and at least some prototype userspace
code for this interface. I guess the requirements around privileges
could depend a bit on what specific registers you're interested in.

If security requirements may vary for different counters I do also
wonder if instead of a generic mmio event it might be appropriate to
enumerate what we're interested in and have a separate event for each
specific counter considering requirements on a case by case basis.

I wonder if we should also consider exposing 64bit counters such as
the pipeline statistics here. intel_gpu_top tries to expose pipeline
statistics but one problem if faces is that these are per-context
counters so it would be better to read them via the command stream
with a mechanism like this instead of periodically so that the reads
can be reliably mapped to a context.

In general a mechanism like this could be a good fit for exposing
per-context metrics to a system compositor (metrics not well suited to
period sampling).

- Robert
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC 0/8] Introduce framework for forwarding generic non-OA performance
@ 2015-07-15  8:51 sourab.gupta
  0 siblings, 0 replies; 31+ messages in thread
From: sourab.gupta @ 2015-07-15  8:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This is an updated patch set (changes list at end), which builds upon the
multi context OA patch set introduced earlier at:

http://lists.freedesktop.org/archives/intel-gfx/2015-July/071697.html

The OA unit, as such, is specific to render ring and can't cater to performance
data requirements for other GPU engines.
Specifically, the media workloads may utilize other GPU engines, but there is
currently no framework which can be used to query performance statistics for
non-RCS workloads and provide this data to userspace tools. This patch set
tries to address this specific problem. The aim of this patch series is to
build upon the perf event framework developed earlier and use it for
forwarding performance data of non-RCS engine workloads.

Since the previous PMU is customized to handle OA reports, a new perf PMU is
added to handle generic non-OA performance data. An example of such non-OA
performance data is the timestamps/mmio registers.
This patch set enables framework for capturing the timestamps at
batch buffer boundaries, by inserting commands for the same in ringbuffer,
and forwarding the samples to userspace through perf interface.
Nevertheless, the framework and data structures can be extended to introduce
more performance data types (other than timestamps). The intention here is to
introduce a framework to enable capturing of generic performance data and
forwarding the same to userspace using perf apis.

The reports generated will again have an additional footer for metadata
information such as ctx_id, pid, ring id and tags (in the same way as done
for OA reports specified in the patch series earlier). This information can be
used by userspace tools such as MVP (Modular Video Profiler) to associate
reports with individual contexts and different stages of workload execution.

In this patch set, the timestamps are captured at BB boundaries by inserting
the commands in the ringbuffer at the batchbuffer boundaries. As specified
earlier, for a system wide GPU profiler, the relative complexity of doing this
in kernel is significantly less than supporting this usecase through userspace
command insertion by all the different components.

The final patch in the series tries to extend the data structures to enable
capture of upto 8 MMIO register values, in conjunction with timestamps

v2: This patch series has the following changes wrt the one floated earlier:
    - Removing synchronous waits during event stop/destroy
    - segregating the book-keeping data for the samples from destination buffer
      and collecting it into a separate list
    - managing the lifetime of destination buffer with the help of gem active
      reference tracking
    - having the scope of i915 device mutex limited to places of gem interaction
      and having the pmu data structures protected with a per pmu lock
    - userspace can now control the metadata it wants by requesting the same
      during event init. The sample is sent with the requested metadata in a
      packed format.
    - Some patches merged together and a few more introduced
    - mmio whitelist in place

Sourab Gupta (8):
  drm/i915: Add a new PMU for handling non-OA counter data profiling
    requests
  drm/i915: Add mechanism for forwarding the timestamp data through perf
  drm/i915: Handle event stop and destroy for GPU commands submitted
  drm/i915: Insert commands for capturing timestamps in the ring
  drm/i915: Add support for forwarding ring id in sample metadata
    through perf
  drm/i915: Add support for forwarding pid in timestamp sample metadata
    through perf
  drm/i915: Add support for forwarding execbuffer tags in timestamp
    sample metadata
  drm/i915: Support for retrieving MMIO register values alongwith
    timestamps through perf

 drivers/gpu/drm/i915/i915_dma.c     |   2 +
 drivers/gpu/drm/i915/i915_drv.h     |  41 +++
 drivers/gpu/drm/i915/i915_oa_perf.c | 680 ++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h     |   2 +
 include/uapi/drm/i915_drm.h         |  41 +++
 5 files changed, 766 insertions(+)

-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2015-08-05 20:19 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-05  5:55 [RFC 0/8] Introduce framework for forwarding generic non-OA performance sourab.gupta
2015-08-05  5:55 ` [RFC 1/8] drm/i915: Add a new PMU for handling non-OA counter data profiling requests sourab.gupta
2015-08-05  9:22   ` Chris Wilson
2015-08-05  9:40     ` Gupta, Sourab
2015-08-05  9:38   ` Chris Wilson
2015-08-05  9:45     ` Gupta, Sourab
2015-08-05  9:49       ` Gupta, Sourab
2015-08-05 11:08         ` Chris Wilson
2015-08-05  9:56       ` Chris Wilson
2015-08-05  5:55 ` [RFC 2/8] drm/i915: Add mechanism for forwarding the timestamp data through perf sourab.gupta
2015-08-05  9:55   ` Chris Wilson
2015-08-05  5:55 ` [RFC 3/8] drm/i915: Handle event stop and destroy for GPU commands submitted sourab.gupta
2015-08-05  5:55 ` [RFC 4/8] drm/i915: Insert commands for capturing timestamps in the ring sourab.gupta
2015-08-05  9:30   ` Chris Wilson
2015-08-05  9:54     ` Gupta, Sourab
2015-08-05  5:55 ` [RFC 5/8] drm/i915: Add support for forwarding ring id in sample metadata through perf sourab.gupta
2015-08-05  9:26   ` Chris Wilson
2015-08-05  5:55 ` [RFC 6/8] drm/i915: Add support for forwarding pid in timestamp " sourab.gupta
2015-08-05  5:55 ` [RFC 7/8] drm/i915: Add support for forwarding execbuffer tags in timestamp sample metadata sourab.gupta
2015-08-05  9:17   ` Chris Wilson
2015-08-05  9:29     ` Daniel Vetter
2015-08-05 13:59       ` Robert Bragg
2015-08-05 15:25         ` Daniel Vetter
2015-08-05 16:48           ` Robert Bragg
2015-08-05  5:55 ` [RFC 8/8] drm/i915: Support for retrieving MMIO register values alongwith timestamps through perf sourab.gupta
2015-08-05 10:03   ` Chris Wilson
2015-08-05 10:18     ` Gupta, Sourab
2015-08-05 10:30       ` Chris Wilson
2015-08-05 14:22         ` Gupta, Sourab
2015-08-05 20:19   ` Robert Bragg
  -- strict thread matches above, loose matches on Subject: below --
2015-07-15  8:51 [RFC 0/8] Introduce framework for forwarding generic non-OA performance sourab.gupta

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.