All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/7] Introduce framework for forwarding generic non-OA performance
@ 2015-06-22  9:55 sourab.gupta
  2015-06-22  9:55 ` [RFC 1/7] drm/i915: Add a new PMU for handling non-OA counter data profiling requests sourab.gupta
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: sourab.gupta @ 2015-06-22  9:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

Cc: Robert Bragg <robert@sixbynine.org>,
    Zhenyu Wang <zhenyuw@linux.intel.com>,
    Jon Bloomfield <jon.bloomfield@intel.com>,
    Peter Zijlstra <a.p.zijlstra@chello.nl>,
    Jabin Wu <jabin.wu@intel.com>,
    Insoo Woo <insoo.woo@intel.com>


This patch series builds upon the initial patch set floated earlier which
extends the periodic OA sampling framework and adds handling asynchronous OA
counter data and forwards the samples using perf. This series can be seen at:

http://lists.freedesktop.org/archives/intel-gfx/2015-June/069263.html

The OA unit, as such, is specific to render ring and can't cater to performance
data requirements for other GPU engines.
Specifically, the media workloads may utilize other GPU engines, but there is
currently no framework which can be used to query performance statistics for
non-RCS workloads and provide this data to userspace tools. This patch set
tries to address this specific problem. The aim of this patch series is to
build upon the perf event framework developed earlier and use it for
forwarding performance data of non-RCS engine workloads.

Since the previous PMU is customized to handle OA reports, a new perf PMU is
added to handle generic non-OA performance data. An example of such non-OA
performance data is the timestamps captured at asynchronous points during
workload execution.
This patch set makes this specific further by capturing the timestamps at
batch buffer boundaries, by inserting commands for the same in ringbuffer,
and forwarding the samples to userspace through perf interface.
Nevertheless, the framework and data structures can be extended to introduce
more performance data types (other than timestamps) and capture these at other 
points of workload execution. The intention here is to introduce a framework
to enable capturing of generic performance data and forwarding the same to
userspace using perf apis.

The reports generated will again have an additional footer for metadata
information such as ctx_id, pid, ring id and tags (in the same way as done
for OA reports specified in the patch series earlier). This information can be
used by userspace tools such as MVP (Modular Video Profiler) to associate
reports with individual contexts and different stages of workload execution.

In this patch set, the timestamps are captured at BB boundaries by inserting
the commands in the ringbuffer at the batchbuffer boundaries. As specified
earlier, for a system wide GPU profiler, the relative complexity of doing this
in kernel is significantly less than supporting this usecase through userspace
command insertion by all the different components.

The final patch in the series tries to extend the data structures to enable
capture of upto 8 MMIO register values, in conjunction with timestamps


Sourab Gupta (7):
  drm/i915: Add a new PMU for handling non-OA counter data profiling
    requests
  drm/i915: Register routines for Gen perf PMU driver
  drm/i915: Introduce timestamp node for timestamp data collection
  drm/i915: Add mechanism for forwarding the data samples to userspace
    through Gen PMU perf interface
  drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  drm/i915: Add routines for inserting commands in the ringbuf for
    capturing timestamps
  drm/i915: Add support for retrieving MMIO register values in Gen Perf
    PMU

 drivers/gpu/drm/i915/i915_dma.c     |   2 +
 drivers/gpu/drm/i915/i915_drv.h     |  47 +++
 drivers/gpu/drm/i915/i915_oa_perf.c | 579 ++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h     |   2 +
 include/uapi/drm/i915_drm.h         |  25 ++
 5 files changed, 655 insertions(+)

-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC 1/7] drm/i915: Add a new PMU for handling non-OA counter data profiling requests
  2015-06-22  9:55 [RFC 0/7] Introduce framework for forwarding generic non-OA performance sourab.gupta
@ 2015-06-22  9:55 ` sourab.gupta
  2015-06-22  9:55 ` [RFC 2/7] drm/i915: Register routines for Gen perf PMU driver sourab.gupta
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: sourab.gupta @ 2015-06-22  9:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

The current perf PMU driver is specific for collection of OA counter statistics
(which may be done in a periodic or asynchronous way). Since this enables us
(and limits us) to render ring, we have no means for collection of data
pertaining to other rings.

To overcome this limitation, we need to have a new PMU driver which enables data
collection for other rings also (in a non-OA specific mode).
This patch adds a new perf PMU to i915 device private, for handling profiling
requests for non-OA counter data.This data may encompass timestamps, mmio
register values, etc. for the relevant ring.
The new perf PMU will serve these purposes, without constraining itself to type
of data being dumped (which may constraint the user to specific ring like in
case of OA counters).

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 758d924..b8b5455 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1985,6 +1985,22 @@ struct drm_i915_private {
 		struct completion complete;
 	} oa_pmu;
 
+	struct {
+		struct pmu pmu;
+		spinlock_t lock;
+		struct hrtimer timer;
+		struct pt_regs dummy_regs;
+		struct perf_event *exclusive_event;
+		bool event_active;
+
+		struct {
+			struct drm_i915_gem_object *obj;
+			u8 *addr;
+			u32 head;
+			u32 tail;
+		} buffer;
+	} gen_pmu;
+
 	struct list_head profile_cmd;
 #endif
 
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 2/7] drm/i915: Register routines for Gen perf PMU driver
  2015-06-22  9:55 [RFC 0/7] Introduce framework for forwarding generic non-OA performance sourab.gupta
  2015-06-22  9:55 ` [RFC 1/7] drm/i915: Add a new PMU for handling non-OA counter data profiling requests sourab.gupta
@ 2015-06-22  9:55 ` sourab.gupta
  2015-06-22  9:55 ` [RFC 3/7] drm/i915: Introduce timestamp node for timestamp data collection sourab.gupta
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: sourab.gupta @ 2015-06-22  9:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch registers the new PMU driver, whose purpose is to enable data
collection of non-OA counter data for all the rings, in a generic way.
The patch introduces routines for this PMU driver, which also include the
allocation routines for the buffer for collecting the data.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_dma.c     |   2 +
 drivers/gpu/drm/i915/i915_drv.h     |   4 +
 drivers/gpu/drm/i915/i915_oa_perf.c | 225 ++++++++++++++++++++++++++++++++++++
 3 files changed, 231 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index f12feaa..a53aa04 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -823,6 +823,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
 	 * spinlock, upsetting lockdep checks */
 	INIT_LIST_HEAD(&dev_priv->profile_cmd);
 	i915_oa_pmu_register(dev);
+	i915_gen_pmu_register(dev);
 
 	intel_pm_setup(dev);
 
@@ -1073,6 +1074,7 @@ int i915_driver_unload(struct drm_device *dev)
 		return ret;
 	}
 
+	i915_gen_pmu_unregister(dev);
 	i915_oa_pmu_unregister(dev);
 	intel_power_domains_fini(dev_priv);
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b8b5455..3491584 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3305,10 +3305,14 @@ int i915_parse_cmds(struct intel_engine_cs *ring,
 /* i915_oa_perf.c */
 #ifdef CONFIG_PERF_EVENTS
 extern void i915_oa_pmu_register(struct drm_device *dev);
+extern void i915_gen_pmu_register(struct drm_device *dev);
 extern void i915_oa_pmu_unregister(struct drm_device *dev);
+extern void i915_gen_pmu_unregister(struct drm_device *dev);
 #else
 static inline void i915_oa_pmu_register(struct drm_device *dev) {}
+static inline void i915_gen_pmu_register(struct drm_device *dev) {}
 static inline void i915_oa_pmu_unregister(struct drm_device *dev) {}
+static inline void i915_gen_pmu_unregister(struct drm_device *dev) {}
 #endif
 
 /* i915_suspend.c */
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index ab419d9..e2042b6 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -224,6 +224,13 @@ void forward_oa_async_snapshots_work(struct work_struct *__work)
 	mutex_unlock(&dev_priv->dev->struct_mutex);
 }
 
+static void gen_pmu_flush_snapshots(struct drm_i915_private *dev_priv)
+{
+	WARN_ON(!dev_priv->gen_pmu.buffer.addr);
+
+	/* TODO: routine for forwarding snapshots to userspace */
+}
+
 static void forward_one_oa_snapshot_to_event(struct drm_i915_private *dev_priv,
 					     u8 *snapshot,
 					     struct perf_event *event)
@@ -443,6 +450,33 @@ static void i915_oa_event_destroy(struct perf_event *event)
 	intel_runtime_pm_put(dev_priv);
 }
 
+static void gen_buffer_destroy(struct drm_i915_private *i915)
+{
+	mutex_lock(&i915->dev->struct_mutex);
+
+	vunmap(i915->gen_pmu.buffer.addr);
+	i915_gem_object_ggtt_unpin(i915->gen_pmu.buffer.obj);
+	drm_gem_object_unreference(&i915->gen_pmu.buffer.obj->base);
+
+	i915->gen_pmu.buffer.obj = NULL;
+	i915->gen_pmu.buffer.addr = NULL;
+
+	mutex_unlock(&i915->dev->struct_mutex);
+}
+
+static void i915_gen_event_destroy(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), gen_pmu.pmu);
+
+	WARN_ON(event->parent);
+
+	gen_buffer_destroy(i915);
+
+	BUG_ON(i915->gen_pmu.exclusive_event != event);
+	i915->gen_pmu.exclusive_event = NULL;
+}
+
 static void *vmap_oa_buffer(struct drm_i915_gem_object *obj)
 {
 	int i;
@@ -599,6 +633,32 @@ unlock:
 	return ret;
 }
 
+static int init_gen_pmu_buffer(struct perf_event *event)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
+	struct drm_i915_gem_object *bo;
+	int ret;
+
+	BUG_ON(!IS_HASWELL(dev_priv->dev) && !IS_VALLEYVIEW(dev_priv->dev) &&
+		 !IS_BROADWELL(dev_priv->dev));
+	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
+	BUG_ON(dev_priv->gen_pmu.buffer.obj);
+
+	ret = alloc_oa_obj(dev_priv, &bo);
+	if (ret)
+		return ret;
+
+	dev_priv->gen_pmu.buffer.obj = bo;
+
+	dev_priv->gen_pmu.buffer.addr = vmap_oa_buffer(bo);
+
+	DRM_DEBUG_DRIVER("Gen PMU Buffer initialized, vaddr = %p",
+			 dev_priv->gen_pmu.buffer.addr);
+
+	return 0;
+}
+
 static enum hrtimer_restart hrtimer_sample(struct hrtimer *hrtimer)
 {
 	struct drm_i915_private *i915 =
@@ -610,6 +670,17 @@ static enum hrtimer_restart hrtimer_sample(struct hrtimer *hrtimer)
 	return HRTIMER_RESTART;
 }
 
+static enum hrtimer_restart hrtimer_sample_gen(struct hrtimer *hrtimer)
+{
+	struct drm_i915_private *i915 =
+		container_of(hrtimer, typeof(*i915), gen_pmu.timer);
+
+	gen_pmu_flush_snapshots(i915);
+
+	hrtimer_forward_now(hrtimer, ns_to_ktime(PERIOD));
+	return HRTIMER_RESTART;
+}
+
 static struct intel_context *
 lookup_context(struct drm_i915_private *dev_priv,
 	       struct file *user_filp,
@@ -1156,6 +1227,115 @@ static int i915_oa_event_event_idx(struct perf_event *event)
 	return 0;
 }
 
+static int i915_gen_event_init(struct perf_event *event)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
+	int ret = 0;
+
+	if (event->attr.type != event->pmu->type)
+		return -ENOENT;
+
+	/* To avoid the complexity of having to accurately filter
+	 * data and marshal to the appropriate client
+	 * we currently only allow exclusive access */
+	if (dev_priv->gen_pmu.buffer.obj)
+		return -EBUSY;
+
+	/*
+	 * We need to check for CAP_SYS_ADMIN capability as we profile all
+	 * the running contexts
+	 */
+	if (!capable(CAP_SYS_ADMIN))
+			return -EACCES;
+
+	mutex_lock(&dev_priv->dev->struct_mutex);
+
+	ret = init_gen_pmu_buffer(event);
+	if (ret) {
+		mutex_unlock(&dev_priv->dev->struct_mutex);
+		return ret;
+	}
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+
+	BUG_ON(dev_priv->gen_pmu.exclusive_event);
+	dev_priv->gen_pmu.exclusive_event = event;
+
+	event->destroy = i915_gen_event_destroy;
+
+	return 0;
+}
+
+static void i915_gen_event_start(struct perf_event *event, int flags)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
+	unsigned long lock_flags;
+
+	spin_lock_irqsave(&dev_priv->gen_pmu.lock, lock_flags);
+
+	dev_priv->gen_pmu.event_active = true;
+
+	spin_unlock_irqrestore(&dev_priv->gen_pmu.lock, lock_flags);
+
+	__hrtimer_start_range_ns(&dev_priv->gen_pmu.timer, ns_to_ktime(PERIOD),
+					0, HRTIMER_MODE_REL_PINNED, 0);
+
+	event->hw.state = 0;
+}
+
+static void i915_gen_event_stop(struct perf_event *event, int flags)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
+	unsigned long lock_flags;
+
+	spin_lock_irqsave(&dev_priv->gen_pmu.lock, lock_flags);
+	dev_priv->gen_pmu.event_active = false;
+
+	spin_unlock_irqrestore(&dev_priv->gen_pmu.lock, lock_flags);
+
+	hrtimer_cancel(&dev_priv->gen_pmu.timer);
+	gen_pmu_flush_snapshots(dev_priv);
+
+	event->hw.state = PERF_HES_STOPPED;
+}
+
+static int i915_gen_event_add(struct perf_event *event, int flags)
+{
+	if (flags & PERF_EF_START)
+		i915_gen_event_start(event, flags);
+
+	return 0;
+}
+
+static void i915_gen_event_del(struct perf_event *event, int flags)
+{
+	i915_gen_event_stop(event, flags);
+}
+
+static void i915_gen_event_read(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), gen_pmu.pmu);
+
+	/* XXX: What counter would be useful here? */
+	local64_set(&event->count, 0);
+}
+
+static void i915_gen_event_flush(struct perf_event *event)
+{
+	struct drm_i915_private *i915 =
+		container_of(event->pmu, typeof(*i915), gen_pmu.pmu);
+
+	gen_pmu_flush_snapshots(i915);
+}
+
+static int i915_gen_event_event_idx(struct perf_event *event)
+{
+	return 0;
+}
+
 void i915_oa_context_pin_notify(struct drm_i915_private *dev_priv,
 				struct intel_context *context)
 {
@@ -1283,3 +1463,48 @@ void i915_oa_pmu_unregister(struct drm_device *dev)
 	perf_pmu_unregister(&i915->oa_pmu.pmu);
 	i915->oa_pmu.pmu.event_init = NULL;
 }
+
+void i915_gen_pmu_register(struct drm_device *dev)
+{
+	struct drm_i915_private *i915 = to_i915(dev);
+
+	if (!(IS_HASWELL(dev) || IS_VALLEYVIEW(dev) || IS_BROADWELL(dev)))
+		return;
+
+	i915->gen_pmu.dummy_regs = *task_pt_regs(current);
+
+	hrtimer_init(&i915->gen_pmu.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	i915->gen_pmu.timer.function = hrtimer_sample_gen;
+
+	spin_lock_init(&i915->gen_pmu.lock);
+
+	i915->gen_pmu.pmu.capabilities  = PERF_PMU_CAP_IS_DEVICE;
+
+	/* Effectively disallow opening an event with a specific pid
+	 * since we aren't interested in processes running on the cpu...
+	 */
+	i915->gen_pmu.pmu.task_ctx_nr   = perf_invalid_context;
+
+	i915->gen_pmu.pmu.event_init    = i915_gen_event_init;
+	i915->gen_pmu.pmu.add	       = i915_gen_event_add;
+	i915->gen_pmu.pmu.del	       = i915_gen_event_del;
+	i915->gen_pmu.pmu.start	       = i915_gen_event_start;
+	i915->gen_pmu.pmu.stop	       = i915_gen_event_stop;
+	i915->gen_pmu.pmu.read	       = i915_gen_event_read;
+	i915->gen_pmu.pmu.flush	       = i915_gen_event_flush;
+	i915->gen_pmu.pmu.event_idx     = i915_gen_event_event_idx;
+
+	if (perf_pmu_register(&i915->gen_pmu.pmu, "i915_gen", -1))
+		i915->gen_pmu.pmu.event_init = NULL;
+}
+
+void i915_gen_pmu_unregister(struct drm_device *dev)
+{
+	struct drm_i915_private *i915 = to_i915(dev);
+
+	if (i915->gen_pmu.pmu.event_init == NULL)
+		return;
+
+	perf_pmu_unregister(&i915->gen_pmu.pmu);
+	i915->gen_pmu.pmu.event_init = NULL;
+}
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 3/7] drm/i915: Introduce timestamp node for timestamp data collection
  2015-06-22  9:55 [RFC 0/7] Introduce framework for forwarding generic non-OA performance sourab.gupta
  2015-06-22  9:55 ` [RFC 1/7] drm/i915: Add a new PMU for handling non-OA counter data profiling requests sourab.gupta
  2015-06-22  9:55 ` [RFC 2/7] drm/i915: Register routines for Gen perf PMU driver sourab.gupta
@ 2015-06-22  9:55 ` sourab.gupta
  2015-06-22  9:55 ` [RFC 4/7] drm/i915: Add mechanism for forwarding the data samples to userspace through Gen PMU perf interface sourab.gupta
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: sourab.gupta @ 2015-06-22  9:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch introduces data structures for holding the timestamp data, which
can then be forwarded to userspace using Gen Perf PMU.
Each timestamp node will have the timestamp value, alongwith additional metadata
information such as ctx_id, pid, ring.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h | 22 ++++++++++++++++++++++
 include/uapi/drm/i915_drm.h     | 18 ++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3491584..b6a897a 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1701,6 +1701,28 @@ struct drm_i915_oa_async_node {
 	struct drm_i915_oa_async_node_info node_info;
 	__u32 report_perf[64]; /* Must be aligned to 64-byte boundary */
 };
+
+struct drm_i915_ts_queue_header {
+	__u64 size_in_bytes;
+	/* Byte offset, start of queue header to first node */
+	__u64 data_offset;
+	__u32 node_count;
+	__u32 wrap_count;
+};
+
+struct drm_i915_ts_node_info {
+	__u32 ring;
+	__u32 pid;
+	__u32 ctx_id;
+	__u32 perftag;
+	struct drm_i915_gem_request *req;
+};
+
+struct drm_i915_ts_node {
+	/* ensure timestamp starts on a qword boundary */
+	struct drm_i915_ts_data timestamp;
+	struct drm_i915_ts_node_info node_info;
+};
 #endif
 
 struct drm_i915_private {
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 4d99992..a7da421 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -131,6 +131,24 @@ struct drm_i915_oa_async_node_footer {
 	__u32 pad;
 };
 
+struct drm_i915_ts_node_footer {
+	__u32 ring;
+	__u32 pid;
+	__u32 ctx_id;
+	__u32 perftag;
+};
+
+struct drm_i915_ts_data {
+	__u32 ts_low;
+	__u32 ts_high;
+};
+
+struct drm_i915_ts_usernode {
+	/* ensure timestamp starts on a qword boundary */
+	struct drm_i915_ts_data timestamp;
+	struct drm_i915_ts_node_footer node_info;
+};
+
 /* Each region is a minimum of 16k, and there are at most 255 of them.
  */
 #define I915_NR_TEX_REGIONS 255	/* table size 2k - maximum due to use
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 4/7] drm/i915: Add mechanism for forwarding the data samples to userspace through Gen PMU perf interface
  2015-06-22  9:55 [RFC 0/7] Introduce framework for forwarding generic non-OA performance sourab.gupta
                   ` (2 preceding siblings ...)
  2015-06-22  9:55 ` [RFC 3/7] drm/i915: Introduce timestamp node for timestamp data collection sourab.gupta
@ 2015-06-22  9:55 ` sourab.gupta
  2015-06-22 13:21   ` Chris Wilson
  2015-06-22  9:55 ` [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU sourab.gupta
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: sourab.gupta @ 2015-06-22  9:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds the mechanism for forwarding the data snapshots
through the Gen PMU perf event interface.
In this particular case, the data type of timestamp data node introduced
earlier is being forwarded through the interface.

The samples will be forwarded in a workqueue, which is scheduled when hrtimer
triggers. In the workqueue, each node of data collected will be forwarded as a
separate perf sample.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |   1 +
 drivers/gpu/drm/i915/i915_oa_perf.c | 125 +++++++++++++++++++++++++++++++++++-
 2 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b6a897a..25c0938 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2021,6 +2021,7 @@ struct drm_i915_private {
 			u32 head;
 			u32 tail;
 		} buffer;
+		struct work_struct work_timer;
 	} gen_pmu;
 
 	struct list_head profile_cmd;
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index e2042b6..e3e867f 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -224,11 +224,121 @@ void forward_oa_async_snapshots_work(struct work_struct *__work)
 	mutex_unlock(&dev_priv->dev->struct_mutex);
 }
 
+static void init_gen_pmu_buf_queue(struct drm_i915_private *dev_priv)
+{
+	struct drm_i915_ts_queue_header *hdr =
+		(struct drm_i915_ts_queue_header *)
+		dev_priv->gen_pmu.buffer.addr;
+	void *data_ptr;
+
+	hdr->size_in_bytes = dev_priv->gen_pmu.buffer.obj->base.size;
+	/* 8 byte alignment for node address */
+	data_ptr = PTR_ALIGN((void *)(hdr + 1), 8);
+	hdr->data_offset = (__u64)(data_ptr - (void *)hdr);
+
+	hdr->node_count = 0;
+	hdr->wrap_count = 0;
+}
+
+static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
+				struct drm_i915_ts_node *node)
+{
+	struct perf_sample_data data;
+	struct perf_event *event = dev_priv->gen_pmu.exclusive_event;
+	int snapshot_size = sizeof(struct drm_i915_ts_usernode);
+	struct perf_raw_record raw;
+
+	perf_sample_data_init(&data, 0, event->hw.last_period);
+
+	/* Note: the combined u32 raw->size member + raw data itself must be 8
+	 * byte aligned.*/
+	raw.size = snapshot_size + 4;
+	raw.data = node;
+
+	data.raw = &raw;
+
+	perf_event_overflow(event, &data, &dev_priv->gen_pmu.dummy_regs);
+}
+
+void i915_gen_pmu_wait_gpu(struct drm_i915_private *dev_priv)
+{
+	struct drm_i915_ts_queue_header *hdr =
+		(struct drm_i915_ts_queue_header *)
+		dev_priv->gen_pmu.buffer.addr;
+	struct drm_i915_ts_node *first_node, *node;
+	int head, tail, num_nodes, ret;
+	struct drm_i915_gem_request *req;
+
+	first_node = (struct drm_i915_ts_node *)
+			((char *)hdr + hdr->data_offset);
+	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
+			sizeof(*node);
+
+	tail = hdr->node_count;
+	head = dev_priv->gen_pmu.buffer.head;
+
+	/* wait for all requests to complete*/
+	while ((head % num_nodes) != (tail % num_nodes)) {
+		node = &first_node[head % num_nodes];
+		req = node->node_info.req;
+		if (req) {
+			if (!i915_gem_request_completed(req, true)) {
+				ret = i915_wait_request(req);
+				if (ret)
+					DRM_DEBUG_DRIVER(
+					"gen pmu: failed to wait\n");
+			}
+			i915_gem_request_assign(&node->node_info.req, NULL);
+		}
+		head++;
+	}
+}
+
+void forward_gen_pmu_snapshots_work(struct work_struct *__work)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(__work, typeof(*dev_priv),
+			     gen_pmu.work_timer);
+	struct drm_i915_ts_queue_header *hdr =
+		(struct drm_i915_ts_queue_header *)
+		dev_priv->gen_pmu.buffer.addr;
+	struct drm_i915_ts_node *first_node, *node;
+	int head, tail, num_nodes, ret;
+	struct drm_i915_gem_request *req;
+
+	first_node = (struct drm_i915_ts_node *)
+			((char *)hdr + hdr->data_offset);
+	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
+			sizeof(*node);
+
+	ret = i915_mutex_lock_interruptible(dev_priv->dev);
+	if (ret)
+		return;
+
+	tail = hdr->node_count;
+	head = dev_priv->gen_pmu.buffer.head;
+
+	while ((head % num_nodes) != (tail % num_nodes)) {
+		node = &first_node[head % num_nodes];
+		req = node->node_info.req;
+		if (req && i915_gem_request_completed(req, true)) {
+			forward_one_gen_pmu_sample(dev_priv, node);
+			i915_gem_request_assign(&node->node_info.req, NULL);
+			head++;
+		} else
+			break;
+	}
+
+	dev_priv->gen_pmu.buffer.tail = tail;
+	dev_priv->gen_pmu.buffer.head = head;
+
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+}
+
 static void gen_pmu_flush_snapshots(struct drm_i915_private *dev_priv)
 {
 	WARN_ON(!dev_priv->gen_pmu.buffer.addr);
-
-	/* TODO: routine for forwarding snapshots to userspace */
+	schedule_work(&dev_priv->gen_pmu.work_timer);
 }
 
 static void forward_one_oa_snapshot_to_event(struct drm_i915_private *dev_priv,
@@ -652,6 +762,7 @@ static int init_gen_pmu_buffer(struct perf_event *event)
 	dev_priv->gen_pmu.buffer.obj = bo;
 
 	dev_priv->gen_pmu.buffer.addr = vmap_oa_buffer(bo);
+	init_gen_pmu_buf_queue(dev_priv);
 
 	DRM_DEBUG_DRIVER("Gen PMU Buffer initialized, vaddr = %p",
 			 dev_priv->gen_pmu.buffer.addr);
@@ -1327,6 +1438,13 @@ static void i915_gen_event_flush(struct perf_event *event)
 {
 	struct drm_i915_private *i915 =
 		container_of(event->pmu, typeof(*i915), gen_pmu.pmu);
+	int ret;
+
+	ret = i915_mutex_lock_interruptible(i915->dev);
+	if (ret)
+		return;
+	i915_gen_pmu_wait_gpu(i915);
+	mutex_unlock(&i915->dev->struct_mutex);
 
 	gen_pmu_flush_snapshots(i915);
 }
@@ -1476,6 +1594,7 @@ void i915_gen_pmu_register(struct drm_device *dev)
 	hrtimer_init(&i915->gen_pmu.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	i915->gen_pmu.timer.function = hrtimer_sample_gen;
 
+	INIT_WORK(&i915->gen_pmu.work_timer, forward_gen_pmu_snapshots_work);
 	spin_lock_init(&i915->gen_pmu.lock);
 
 	i915->gen_pmu.pmu.capabilities  = PERF_PMU_CAP_IS_DEVICE;
@@ -1505,6 +1624,8 @@ void i915_gen_pmu_unregister(struct drm_device *dev)
 	if (i915->gen_pmu.pmu.event_init == NULL)
 		return;
 
+	cancel_work_sync(&i915->gen_pmu.work_timer);
+
 	perf_pmu_unregister(&i915->gen_pmu.pmu);
 	i915->gen_pmu.pmu.event_init = NULL;
 }
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-22  9:55 [RFC 0/7] Introduce framework for forwarding generic non-OA performance sourab.gupta
                   ` (3 preceding siblings ...)
  2015-06-22  9:55 ` [RFC 4/7] drm/i915: Add mechanism for forwarding the data samples to userspace through Gen PMU perf interface sourab.gupta
@ 2015-06-22  9:55 ` sourab.gupta
  2015-06-22 13:22   ` Chris Wilson
  2015-06-22  9:55 ` [RFC 6/7] drm/i915: Add routines for inserting commands in the ringbuf for capturing timestamps sourab.gupta
  2015-06-22  9:55 ` [RFC 7/7] drm/i915: Add support for retrieving MMIO register values in Gen Perf PMU sourab.gupta
  6 siblings, 1 reply; 23+ messages in thread
From: sourab.gupta @ 2015-06-22  9:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

To collect timestamps around any GPU workload, we need to insert
commands to capture them into the ringbuffer. Therefore, during the stop event
call, we need to wait for GPU to complete processing the last request for
which these commands were inserted.
We need to ensure this processing is done before event_destroy callback which
deallocates the buffer for holding the data.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |  2 ++
 drivers/gpu/drm/i915/i915_oa_perf.c | 54 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 25c0938..a0e1d17 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2022,6 +2022,8 @@ struct drm_i915_private {
 			u32 tail;
 		} buffer;
 		struct work_struct work_timer;
+		struct work_struct work_event_stop;
+		struct completion complete;
 	} gen_pmu;
 
 	struct list_head profile_cmd;
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index e3e867f..574b6d3 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -306,6 +306,9 @@ void forward_gen_pmu_snapshots_work(struct work_struct *__work)
 	int head, tail, num_nodes, ret;
 	struct drm_i915_gem_request *req;
 
+	if (dev_priv->gen_pmu.event_active == false)
+		return;
+
 	first_node = (struct drm_i915_ts_node *)
 			((char *)hdr + hdr->data_offset);
 	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
@@ -335,6 +338,50 @@ void forward_gen_pmu_snapshots_work(struct work_struct *__work)
 	mutex_unlock(&dev_priv->dev->struct_mutex);
 }
 
+void i915_gen_pmu_stop_work_fn(struct work_struct *__work)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(__work, typeof(*dev_priv),
+			gen_pmu.work_event_stop);
+	struct perf_event *event = dev_priv->gen_pmu.exclusive_event;
+	struct drm_i915_ts_queue_header *hdr =
+		(struct drm_i915_ts_queue_header *)
+		dev_priv->gen_pmu.buffer.addr;
+	struct drm_i915_ts_node *first_node, *node;
+	int head, tail, num_nodes, ret;
+	struct drm_i915_gem_request *req;
+
+	first_node = (struct drm_i915_ts_node *)
+			((char *)hdr + hdr->data_offset);
+	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
+			sizeof(*node);
+
+
+	ret = i915_mutex_lock_interruptible(dev_priv->dev);
+	if (ret)
+		return;
+
+	i915_gen_pmu_wait_gpu(dev_priv);
+
+	/* Ensure that all requests are completed*/
+	tail = hdr->node_count;
+	head = dev_priv->gen_pmu.buffer.head;
+	while ((head % num_nodes) != (tail % num_nodes)) {
+		node = &first_node[head % num_nodes];
+		req = node->node_info.req;
+		if (req && !i915_gem_request_completed(req, true))
+			WARN_ON(1);
+		head++;
+	}
+
+	event->hw.state = PERF_HES_STOPPED;
+	dev_priv->gen_pmu.buffer.tail = 0;
+	dev_priv->gen_pmu.buffer.head = 0;
+
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+	complete(&dev_priv->gen_pmu.complete);
+}
+
 static void gen_pmu_flush_snapshots(struct drm_i915_private *dev_priv)
 {
 	WARN_ON(!dev_priv->gen_pmu.buffer.addr);
@@ -562,6 +609,7 @@ static void i915_oa_event_destroy(struct perf_event *event)
 
 static void gen_buffer_destroy(struct drm_i915_private *i915)
 {
+	wait_for_completion(&i915->gen_pmu.complete);
 	mutex_lock(&i915->dev->struct_mutex);
 
 	vunmap(i915->gen_pmu.buffer.addr);
@@ -1409,7 +1457,7 @@ static void i915_gen_event_stop(struct perf_event *event, int flags)
 	hrtimer_cancel(&dev_priv->gen_pmu.timer);
 	gen_pmu_flush_snapshots(dev_priv);
 
-	event->hw.state = PERF_HES_STOPPED;
+	schedule_work(&dev_priv->gen_pmu.work_event_stop);
 }
 
 static int i915_gen_event_add(struct perf_event *event, int flags)
@@ -1595,6 +1643,9 @@ void i915_gen_pmu_register(struct drm_device *dev)
 	i915->gen_pmu.timer.function = hrtimer_sample_gen;
 
 	INIT_WORK(&i915->gen_pmu.work_timer, forward_gen_pmu_snapshots_work);
+	INIT_WORK(&i915->gen_pmu.work_event_stop, i915_gen_pmu_stop_work_fn);
+	init_completion(&i915->gen_pmu.complete);
+
 	spin_lock_init(&i915->gen_pmu.lock);
 
 	i915->gen_pmu.pmu.capabilities  = PERF_PMU_CAP_IS_DEVICE;
@@ -1625,6 +1676,7 @@ void i915_gen_pmu_unregister(struct drm_device *dev)
 		return;
 
 	cancel_work_sync(&i915->gen_pmu.work_timer);
+	cancel_work_sync(&i915->gen_pmu.work_event_stop);
 
 	perf_pmu_unregister(&i915->gen_pmu.pmu);
 	i915->gen_pmu.pmu.event_init = NULL;
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 6/7] drm/i915: Add routines for inserting commands in the ringbuf for capturing timestamps
  2015-06-22  9:55 [RFC 0/7] Introduce framework for forwarding generic non-OA performance sourab.gupta
                   ` (4 preceding siblings ...)
  2015-06-22  9:55 ` [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU sourab.gupta
@ 2015-06-22  9:55 ` sourab.gupta
  2015-06-22  9:55 ` [RFC 7/7] drm/i915: Add support for retrieving MMIO register values in Gen Perf PMU sourab.gupta
  6 siblings, 0 replies; 23+ messages in thread
From: sourab.gupta @ 2015-06-22  9:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds the routines through which one can insert commands in the
ringbuf for capturing timestamps. The routines to insert these commands can be
called at appropriate places during workload execution.
The snapshots thus captured for each batchbuffer are then forwarded to
userspace using the perf event framework, through the Gen PMU interfaces.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_oa_perf.c | 88 +++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h     |  2 +
 2 files changed, 90 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index 574b6d3..ed0bdc9 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -99,6 +99,79 @@ void i915_oa_insert_cmd(struct intel_ringbuffer *ringbuf, u32 ctx_id,
 		queue_hdr->wrap_count++;
 }
 
+/* Returns the ring's ID mask (i.e. I915_EXEC_<ring>) */
+#define ring_id_mask(ring) ((ring)->id + 1)
+
+void i915_gen_insert_cmd_ts(struct intel_ringbuffer *ringbuf, u32 ctx_id,
+				int perftag)
+{
+	struct intel_engine_cs *ring = ringbuf->ring;
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct drm_i915_ts_node_info *node_info = NULL;
+	struct drm_i915_ts_queue_header *queue_hdr =
+			(struct drm_i915_ts_queue_header *)
+			dev_priv->gen_pmu.buffer.addr;
+	void *data_ptr = (u8 *)queue_hdr + queue_hdr->data_offset;
+	int data_size =	(queue_hdr->size_in_bytes - queue_hdr->data_offset);
+	u32 node_offset, timestamp_offset, addr = 0;
+	int ret;
+
+	struct drm_i915_ts_node *nodes = data_ptr;
+	int num_nodes = 0;
+	int index = 0;
+
+	num_nodes = data_size / sizeof(*nodes);
+	index = queue_hdr->node_count % num_nodes;
+
+	timestamp_offset = offsetof(struct drm_i915_ts_data, ts_low);
+
+	node_offset = i915_gem_obj_ggtt_offset(dev_priv->gen_pmu.buffer.obj) +
+			queue_hdr->data_offset +
+			index * sizeof(struct drm_i915_ts_node);
+	addr = node_offset +
+		offsetof(struct drm_i915_ts_node, timestamp) +
+		timestamp_offset;
+
+	if (ring->id == RCS) {
+		ret = intel_ring_begin(ring, 6);
+		if (ret)
+			return;
+
+		intel_ring_emit(ring, GFX_OP_PIPE_CONTROL(5));
+		intel_ring_emit(ring,
+				PIPE_CONTROL_GLOBAL_GTT_IVB |
+				PIPE_CONTROL_TIMESTAMP_WRITE);
+		intel_ring_emit(ring, addr | PIPE_CONTROL_GLOBAL_GTT);
+		intel_ring_emit(ring, 0); /* imm low, must be zero */
+		intel_ring_emit(ring, 0); /* imm high, must be zero */
+		intel_ring_emit(ring, MI_NOOP);
+		intel_ring_advance(ring);
+	} else {
+		ret = intel_ring_begin(ring, 4);
+		if (ret)
+			return;
+
+		intel_ring_emit(ring,
+				MI_FLUSH_DW | MI_FLUSH_DW_OP_STAMP);
+		intel_ring_emit(ring, addr | MI_FLUSH_DW_USE_GTT);
+		intel_ring_emit(ring, 0); /* imm low, must be zero */
+		intel_ring_emit(ring, 0); /* imm high, must be zero */
+		intel_ring_advance(ring);
+	}
+	node_info = &nodes[index].node_info;
+	i915_gem_request_assign(&node_info->req,
+				ring->outstanding_lazy_request);
+
+	node_info = &nodes[index].node_info;
+	node_info->pid = current->pid;
+	node_info->ctx_id = ctx_id;
+	node_info->ring = ring_id_mask(ring);
+	node_info->perftag = perftag;
+	queue_hdr->node_count++;
+	if (queue_hdr->node_count > num_nodes)
+		queue_hdr->wrap_count++;
+}
+
 static void init_oa_async_buf_queue(struct drm_i915_private *dev_priv)
 {
 	struct drm_i915_oa_async_queue_header *hdr =
@@ -344,6 +417,7 @@ void i915_gen_pmu_stop_work_fn(struct work_struct *__work)
 		container_of(__work, typeof(*dev_priv),
 			gen_pmu.work_event_stop);
 	struct perf_event *event = dev_priv->gen_pmu.exclusive_event;
+	struct drm_i915_insert_cmd *entry, *next;
 	struct drm_i915_ts_queue_header *hdr =
 		(struct drm_i915_ts_queue_header *)
 		dev_priv->gen_pmu.buffer.addr;
@@ -361,6 +435,13 @@ void i915_gen_pmu_stop_work_fn(struct work_struct *__work)
 	if (ret)
 		return;
 
+	list_for_each_entry_safe(entry, next, &dev_priv->profile_cmd, list) {
+		if (entry->insert_cmd == i915_gen_insert_cmd_ts) {
+			list_del(&entry->list);
+			kfree(entry);
+		}
+	}
+
 	i915_gen_pmu_wait_gpu(dev_priv);
 
 	/* Ensure that all requests are completed*/
@@ -1430,10 +1511,17 @@ static void i915_gen_event_start(struct perf_event *event, int flags)
 	struct drm_i915_private *dev_priv =
 		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
 	unsigned long lock_flags;
+	struct drm_i915_insert_cmd *entry;
+
+	entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
+	if (!entry)
+		return;
+	entry->insert_cmd = i915_gen_insert_cmd_ts;
 
 	spin_lock_irqsave(&dev_priv->gen_pmu.lock, lock_flags);
 
 	dev_priv->gen_pmu.event_active = true;
+	list_add_tail(&entry->list, &dev_priv->profile_cmd);
 
 	spin_unlock_irqrestore(&dev_priv->gen_pmu.lock, lock_flags);
 
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index c9955968..22eee10 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -347,6 +347,7 @@
 #define   MI_FLUSH_DW_STORE_INDEX	(1<<21)
 #define   MI_INVALIDATE_TLB		(1<<18)
 #define   MI_FLUSH_DW_OP_STOREDW	(1<<14)
+#define   MI_FLUSH_DW_OP_STAMP		(3<<14)
 #define   MI_FLUSH_DW_OP_MASK		(3<<14)
 #define   MI_FLUSH_DW_NOTIFY		(1<<8)
 #define   MI_INVALIDATE_BSD		(1<<7)
@@ -422,6 +423,7 @@
 #define   PIPE_CONTROL_TLB_INVALIDATE			(1<<18)
 #define   PIPE_CONTROL_MEDIA_STATE_CLEAR		(1<<16)
 #define   PIPE_CONTROL_QW_WRITE				(1<<14)
+#define   PIPE_CONTROL_TIMESTAMP_WRITE			(3<<14)
 #define   PIPE_CONTROL_POST_SYNC_OP_MASK                (3<<14)
 #define   PIPE_CONTROL_DEPTH_STALL			(1<<13)
 #define   PIPE_CONTROL_WRITE_FLUSH			(1<<12)
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 7/7] drm/i915: Add support for retrieving MMIO register values in Gen Perf PMU
  2015-06-22  9:55 [RFC 0/7] Introduce framework for forwarding generic non-OA performance sourab.gupta
                   ` (5 preceding siblings ...)
  2015-06-22  9:55 ` [RFC 6/7] drm/i915: Add routines for inserting commands in the ringbuf for capturing timestamps sourab.gupta
@ 2015-06-22  9:55 ` sourab.gupta
  2015-06-22 13:29   ` Chris Wilson
  2015-06-22 16:06   ` Daniel Vetter
  6 siblings, 2 replies; 23+ messages in thread
From: sourab.gupta @ 2015-06-22  9:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Insoo Woo, Peter Zijlstra, Jabin Wu, Sourab Gupta

From: Sourab Gupta <sourab.gupta@intel.com>

This patch adds support for retrieving MMIO register values through Gen Perf PMU
interface. Through this interface, now the userspace can request upto 8 MMIO
register values to be dumped, alongwith the timestamp values which were dumped
earlier across the batchbuffer boundaries.
Userspace can pass the addresses of upto 8 MMIO registers through perf attr
config. The commands to dump the values of these MMIO registers are then
inserted into the ring alongwith commands to dump the timestamps.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |   4 +-
 drivers/gpu/drm/i915/i915_oa_perf.c | 119 ++++++++++++++++++++++++++++++++----
 include/uapi/drm/i915_drm.h         |   9 ++-
 3 files changed, 117 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a0e1d17..1f86358 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1718,9 +1718,10 @@ struct drm_i915_ts_node_info {
 	struct drm_i915_gem_request *req;
 };
 
-struct drm_i915_ts_node {
+struct drm_i915_ts_mmio_node {
 	/* ensure timestamp starts on a qword boundary */
 	struct drm_i915_ts_data timestamp;
+	__u32 mmio[8];
 	struct drm_i915_ts_node_info node_info;
 };
 #endif
@@ -2024,6 +2025,7 @@ struct drm_i915_private {
 		struct work_struct work_timer;
 		struct work_struct work_event_stop;
 		struct completion complete;
+		u32 mmio_list[8];
 	} gen_pmu;
 
 	struct list_head profile_cmd;
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index ed0bdc9..465e823 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -113,10 +113,10 @@ void i915_gen_insert_cmd_ts(struct intel_ringbuffer *ringbuf, u32 ctx_id,
 			dev_priv->gen_pmu.buffer.addr;
 	void *data_ptr = (u8 *)queue_hdr + queue_hdr->data_offset;
 	int data_size =	(queue_hdr->size_in_bytes - queue_hdr->data_offset);
-	u32 node_offset, timestamp_offset, addr = 0;
-	int ret;
+	u32 node_offset, timestamp_offset, mmio_offset, addr = 0;
+	int ret, i = 0;
 
-	struct drm_i915_ts_node *nodes = data_ptr;
+	struct drm_i915_ts_mmio_node *nodes = data_ptr;
 	int num_nodes = 0;
 	int index = 0;
 
@@ -124,12 +124,14 @@ void i915_gen_insert_cmd_ts(struct intel_ringbuffer *ringbuf, u32 ctx_id,
 	index = queue_hdr->node_count % num_nodes;
 
 	timestamp_offset = offsetof(struct drm_i915_ts_data, ts_low);
+	mmio_offset =
+		offsetof(struct drm_i915_ts_mmio_node, mmio);
 
 	node_offset = i915_gem_obj_ggtt_offset(dev_priv->gen_pmu.buffer.obj) +
 			queue_hdr->data_offset +
-			index * sizeof(struct drm_i915_ts_node);
+			index * sizeof(struct drm_i915_ts_mmio_node);
 	addr = node_offset +
-		offsetof(struct drm_i915_ts_node, timestamp) +
+		offsetof(struct drm_i915_ts_mmio_node, timestamp) +
 		timestamp_offset;
 
 	if (ring->id == RCS) {
@@ -158,6 +160,27 @@ void i915_gen_insert_cmd_ts(struct intel_ringbuffer *ringbuf, u32 ctx_id,
 		intel_ring_emit(ring, 0); /* imm high, must be zero */
 		intel_ring_advance(ring);
 	}
+
+	for (i = 0; i < 8; i++) {
+		if (0 == dev_priv->gen_pmu.mmio_list[i])
+			break;
+
+		addr = node_offset + mmio_offset +
+			i * sizeof(dev_priv->gen_pmu.mmio_list[i]);
+
+		ret = intel_ring_begin(ring, 4);
+		if (ret)
+			return;
+
+		intel_ring_emit(ring,
+				MI_STORE_REGISTER_MEM(1) |
+				MI_SRM_LRM_GLOBAL_GTT);
+		intel_ring_emit(ring, dev_priv->gen_pmu.mmio_list[i]);
+		intel_ring_emit(ring, addr);
+		intel_ring_emit(ring, MI_NOOP);
+		intel_ring_advance(ring);
+	}
+
 	node_info = &nodes[index].node_info;
 	i915_gem_request_assign(&node_info->req,
 				ring->outstanding_lazy_request);
@@ -314,11 +337,11 @@ static void init_gen_pmu_buf_queue(struct drm_i915_private *dev_priv)
 }
 
 static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
-				struct drm_i915_ts_node *node)
+				struct drm_i915_ts_mmio_node *node)
 {
 	struct perf_sample_data data;
 	struct perf_event *event = dev_priv->gen_pmu.exclusive_event;
-	int snapshot_size = sizeof(struct drm_i915_ts_usernode);
+	int snapshot_size = sizeof(struct drm_i915_ts_mmio_usernode);
 	struct perf_raw_record raw;
 
 	perf_sample_data_init(&data, 0, event->hw.last_period);
@@ -338,11 +361,11 @@ void i915_gen_pmu_wait_gpu(struct drm_i915_private *dev_priv)
 	struct drm_i915_ts_queue_header *hdr =
 		(struct drm_i915_ts_queue_header *)
 		dev_priv->gen_pmu.buffer.addr;
-	struct drm_i915_ts_node *first_node, *node;
+	struct drm_i915_ts_mmio_node *first_node, *node;
 	int head, tail, num_nodes, ret;
 	struct drm_i915_gem_request *req;
 
-	first_node = (struct drm_i915_ts_node *)
+	first_node = (struct drm_i915_ts_mmio_node *)
 			((char *)hdr + hdr->data_offset);
 	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
 			sizeof(*node);
@@ -375,14 +398,14 @@ void forward_gen_pmu_snapshots_work(struct work_struct *__work)
 	struct drm_i915_ts_queue_header *hdr =
 		(struct drm_i915_ts_queue_header *)
 		dev_priv->gen_pmu.buffer.addr;
-	struct drm_i915_ts_node *first_node, *node;
+	struct drm_i915_ts_mmio_node *first_node, *node;
 	int head, tail, num_nodes, ret;
 	struct drm_i915_gem_request *req;
 
 	if (dev_priv->gen_pmu.event_active == false)
 		return;
 
-	first_node = (struct drm_i915_ts_node *)
+	first_node = (struct drm_i915_ts_mmio_node *)
 			((char *)hdr + hdr->data_offset);
 	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
 			sizeof(*node);
@@ -421,11 +444,11 @@ void i915_gen_pmu_stop_work_fn(struct work_struct *__work)
 	struct drm_i915_ts_queue_header *hdr =
 		(struct drm_i915_ts_queue_header *)
 		dev_priv->gen_pmu.buffer.addr;
-	struct drm_i915_ts_node *first_node, *node;
+	struct drm_i915_ts_mmio_node *first_node, *node;
 	int head, tail, num_nodes, ret;
 	struct drm_i915_gem_request *req;
 
-	first_node = (struct drm_i915_ts_node *)
+	first_node = (struct drm_i915_ts_mmio_node *)
 			((char *)hdr + hdr->data_offset);
 	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
 			sizeof(*node);
@@ -1467,15 +1490,85 @@ static int i915_oa_event_event_idx(struct perf_event *event)
 	return 0;
 }
 
+static int i915_gen_pmu_copy_attr(struct drm_i915_gen_pmu_attr __user *uattr,
+			     struct drm_i915_gen_pmu_attr *attr)
+{
+	u32 size;
+	int ret;
+
+	if (!access_ok(VERIFY_WRITE, uattr, I915_GEN_PMU_ATTR_SIZE_VER0))
+		return -EFAULT;
+
+	/*
+	 * zero the full structure, so that a short copy will be nice.
+	 */
+	memset(attr, 0, sizeof(*attr));
+
+	ret = get_user(size, &uattr->size);
+	if (ret)
+		return ret;
+
+	if (size > PAGE_SIZE)	/* silly large */
+		goto err_size;
+
+	if (size < I915_GEN_PMU_ATTR_SIZE_VER0)
+		goto err_size;
+
+	/*
+	 * If we're handed a bigger struct than we know of,
+	 * ensure all the unknown bits are 0 - i.e. new
+	 * user-space does not rely on any kernel feature
+	 * extensions we dont know about yet.
+	 */
+	if (size > sizeof(*attr)) {
+		unsigned char __user *addr;
+		unsigned char __user *end;
+		unsigned char val;
+
+		addr = (void __user *)uattr + sizeof(*attr);
+		end  = (void __user *)uattr + size;
+
+		for (; addr < end; addr++) {
+			ret = get_user(val, addr);
+			if (ret)
+				return ret;
+			if (val)
+				goto err_size;
+		}
+		size = sizeof(*attr);
+	}
+
+	ret = copy_from_user(attr, uattr, size);
+	if (ret)
+		return -EFAULT;
+
+out:
+	return ret;
+
+err_size:
+	put_user(sizeof(*attr), &uattr->size);
+	ret = -E2BIG;
+	goto out;
+}
+
 static int i915_gen_event_init(struct perf_event *event)
 {
 	struct drm_i915_private *dev_priv =
 		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
+	struct drm_i915_gen_pmu_attr gen_attr;
 	int ret = 0;
 
 	if (event->attr.type != event->pmu->type)
 		return -ENOENT;
 
+	ret = i915_gen_pmu_copy_attr(to_user_ptr(event->attr.config),
+				&gen_attr);
+	if (ret)
+		return ret;
+
+	memcpy(dev_priv->gen_pmu.mmio_list, gen_attr.mmio_list,
+			sizeof(dev_priv->gen_pmu.mmio_list));
+
 	/* To avoid the complexity of having to accurately filter
 	 * data and marshal to the appropriate client
 	 * we currently only allow exclusive access */
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index a7da421..8d4deec 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -80,6 +80,7 @@
 #define I915_OA_METRICS_SET_MAX			I915_OA_METRICS_SET_SAMPLER_BALANCE
 
 #define I915_OA_ATTR_SIZE_VER0		32  /* sizeof first published struct */
+#define I915_GEN_PMU_ATTR_SIZE_VER0	36  /* sizeof first published struct */
 
 typedef struct _drm_i915_oa_attr {
 	__u32 size;
@@ -97,6 +98,11 @@ typedef struct _drm_i915_oa_attr {
 	__reserved_2:31;
 } drm_i915_oa_attr_t;
 
+struct drm_i915_gen_pmu_attr {
+	__u32 size;
+	__u32 mmio_list[8];
+};
+
 /* Header for PERF_RECORD_DEVICE type events */
 typedef struct _drm_i915_oa_event_header {
 	__u32 type;
@@ -143,9 +149,10 @@ struct drm_i915_ts_data {
 	__u32 ts_high;
 };
 
-struct drm_i915_ts_usernode {
+struct drm_i915_ts_mmio_usernode {
 	/* ensure timestamp starts on a qword boundary */
 	struct drm_i915_ts_data timestamp;
+	__u32 mmio[8];
 	struct drm_i915_ts_node_footer node_info;
 };
 
-- 
1.8.5.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC 4/7] drm/i915: Add mechanism for forwarding the data samples to userspace through Gen PMU perf interface
  2015-06-22  9:55 ` [RFC 4/7] drm/i915: Add mechanism for forwarding the data samples to userspace through Gen PMU perf interface sourab.gupta
@ 2015-06-22 13:21   ` Chris Wilson
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Wilson @ 2015-06-22 13:21 UTC (permalink / raw)
  To: sourab.gupta; +Cc: intel-gfx, Insoo Woo, Peter Zijlstra, Jabin Wu

On Mon, Jun 22, 2015 at 03:25:06PM +0530, sourab.gupta@intel.com wrote:
> +void i915_gen_pmu_wait_gpu(struct drm_i915_private *dev_priv)
> +{
> +	struct drm_i915_ts_queue_header *hdr =
> +		(struct drm_i915_ts_queue_header *)
> +		dev_priv->gen_pmu.buffer.addr;
> +	struct drm_i915_ts_node *first_node, *node;
> +	int head, tail, num_nodes, ret;
> +	struct drm_i915_gem_request *req;
> +
> +	first_node = (struct drm_i915_ts_node *)
> +			((char *)hdr + hdr->data_offset);
> +	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
> +			sizeof(*node);
> +
> +	tail = hdr->node_count;
> +	head = dev_priv->gen_pmu.buffer.head;
> +
> +	/* wait for all requests to complete*/
> +	while ((head % num_nodes) != (tail % num_nodes)) {
> +		node = &first_node[head % num_nodes];
> +		req = node->node_info.req;
> +		if (req) {
> +			if (!i915_gem_request_completed(req, true)) {
> +				ret = i915_wait_request(req);
> +				if (ret)
> +					DRM_DEBUG_DRIVER(
> +					"gen pmu: failed to wait\n");
> +			}
> +			i915_gem_request_assign(&node->node_info.req, NULL);
> +		}
> +		head++;
> +	}

You can, sorry must, rewrite this to avoid the struct mutex, and handle
the *likely* errors correctly, and even remove the superfluous
i915_gem_request_completed.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-22  9:55 ` [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU sourab.gupta
@ 2015-06-22 13:22   ` Chris Wilson
  2015-06-22 16:09     ` Daniel Vetter
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Wilson @ 2015-06-22 13:22 UTC (permalink / raw)
  To: sourab.gupta; +Cc: intel-gfx, Insoo Woo, Peter Zijlstra, Jabin Wu

On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
> 
> To collect timestamps around any GPU workload, we need to insert
> commands to capture them into the ringbuffer. Therefore, during the stop event
> call, we need to wait for GPU to complete processing the last request for
> which these commands were inserted.
> We need to ensure this processing is done before event_destroy callback which
> deallocates the buffer for holding the data.

There's no reason for this to be synchronous. Just that you need an
active reference on the output buffer to be only released after the
final request.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 7/7] drm/i915: Add support for retrieving MMIO register values in Gen Perf PMU
  2015-06-22  9:55 ` [RFC 7/7] drm/i915: Add support for retrieving MMIO register values in Gen Perf PMU sourab.gupta
@ 2015-06-22 13:29   ` Chris Wilson
  2015-06-22 16:06   ` Daniel Vetter
  1 sibling, 0 replies; 23+ messages in thread
From: Chris Wilson @ 2015-06-22 13:29 UTC (permalink / raw)
  To: sourab.gupta; +Cc: intel-gfx, Insoo Woo, Peter Zijlstra, Jabin Wu

On Mon, Jun 22, 2015 at 03:25:09PM +0530, sourab.gupta@intel.com wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
> 
> This patch adds support for retrieving MMIO register values through Gen Perf PMU
> interface. Through this interface, now the userspace can request upto 8 MMIO
> register values to be dumped, alongwith the timestamp values which were dumped
> earlier across the batchbuffer boundaries.
> Userspace can pass the addresses of upto 8 MMIO registers through perf attr
> config. The commands to dump the values of these MMIO registers are then
> inserted into the ring alongwith commands to dump the timestamps.

I don't see any whitelisting of allowed register targets...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 7/7] drm/i915: Add support for retrieving MMIO register values in Gen Perf PMU
  2015-06-22  9:55 ` [RFC 7/7] drm/i915: Add support for retrieving MMIO register values in Gen Perf PMU sourab.gupta
  2015-06-22 13:29   ` Chris Wilson
@ 2015-06-22 16:06   ` Daniel Vetter
  1 sibling, 0 replies; 23+ messages in thread
From: Daniel Vetter @ 2015-06-22 16:06 UTC (permalink / raw)
  To: sourab.gupta; +Cc: intel-gfx, Insoo Woo, Peter Zijlstra, Jabin Wu

On Mon, Jun 22, 2015 at 03:25:09PM +0530, sourab.gupta@intel.com wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
> 
> This patch adds support for retrieving MMIO register values through Gen Perf PMU
> interface. Through this interface, now the userspace can request upto 8 MMIO
> register values to be dumped, alongwith the timestamp values which were dumped
> earlier across the batchbuffer boundaries.
> Userspace can pass the addresses of upto 8 MMIO registers through perf attr
> config. The commands to dump the values of these MMIO registers are then
> inserted into the ring alongwith commands to dump the timestamps.
> 
> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>

I'm not a fan of exposing random mmio's to userspace through this. The OA
counters are kinda special since we also need to allow capture mid-batch
for the gl perf extensions and hence interpreting the data must be done in
mesa (or we'll end up with duplicated code across kernel/userspace, which
sucks). But the mmio captures for other engines don't have this issue, so
an explicit list of useful things to capture would be good. Especially
since this would allow generic tools to get interesting samples for all
engines.
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_drv.h     |   4 +-
>  drivers/gpu/drm/i915/i915_oa_perf.c | 119 ++++++++++++++++++++++++++++++++----
>  include/uapi/drm/i915_drm.h         |   9 ++-
>  3 files changed, 117 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index a0e1d17..1f86358 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1718,9 +1718,10 @@ struct drm_i915_ts_node_info {
>  	struct drm_i915_gem_request *req;
>  };
>  
> -struct drm_i915_ts_node {
> +struct drm_i915_ts_mmio_node {
>  	/* ensure timestamp starts on a qword boundary */
>  	struct drm_i915_ts_data timestamp;
> +	__u32 mmio[8];
>  	struct drm_i915_ts_node_info node_info;
>  };
>  #endif
> @@ -2024,6 +2025,7 @@ struct drm_i915_private {
>  		struct work_struct work_timer;
>  		struct work_struct work_event_stop;
>  		struct completion complete;
> +		u32 mmio_list[8];
>  	} gen_pmu;
>  
>  	struct list_head profile_cmd;
> diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
> index ed0bdc9..465e823 100644
> --- a/drivers/gpu/drm/i915/i915_oa_perf.c
> +++ b/drivers/gpu/drm/i915/i915_oa_perf.c
> @@ -113,10 +113,10 @@ void i915_gen_insert_cmd_ts(struct intel_ringbuffer *ringbuf, u32 ctx_id,
>  			dev_priv->gen_pmu.buffer.addr;
>  	void *data_ptr = (u8 *)queue_hdr + queue_hdr->data_offset;
>  	int data_size =	(queue_hdr->size_in_bytes - queue_hdr->data_offset);
> -	u32 node_offset, timestamp_offset, addr = 0;
> -	int ret;
> +	u32 node_offset, timestamp_offset, mmio_offset, addr = 0;
> +	int ret, i = 0;
>  
> -	struct drm_i915_ts_node *nodes = data_ptr;
> +	struct drm_i915_ts_mmio_node *nodes = data_ptr;
>  	int num_nodes = 0;
>  	int index = 0;
>  
> @@ -124,12 +124,14 @@ void i915_gen_insert_cmd_ts(struct intel_ringbuffer *ringbuf, u32 ctx_id,
>  	index = queue_hdr->node_count % num_nodes;
>  
>  	timestamp_offset = offsetof(struct drm_i915_ts_data, ts_low);
> +	mmio_offset =
> +		offsetof(struct drm_i915_ts_mmio_node, mmio);
>  
>  	node_offset = i915_gem_obj_ggtt_offset(dev_priv->gen_pmu.buffer.obj) +
>  			queue_hdr->data_offset +
> -			index * sizeof(struct drm_i915_ts_node);
> +			index * sizeof(struct drm_i915_ts_mmio_node);
>  	addr = node_offset +
> -		offsetof(struct drm_i915_ts_node, timestamp) +
> +		offsetof(struct drm_i915_ts_mmio_node, timestamp) +
>  		timestamp_offset;
>  
>  	if (ring->id == RCS) {
> @@ -158,6 +160,27 @@ void i915_gen_insert_cmd_ts(struct intel_ringbuffer *ringbuf, u32 ctx_id,
>  		intel_ring_emit(ring, 0); /* imm high, must be zero */
>  		intel_ring_advance(ring);
>  	}
> +
> +	for (i = 0; i < 8; i++) {
> +		if (0 == dev_priv->gen_pmu.mmio_list[i])
> +			break;
> +
> +		addr = node_offset + mmio_offset +
> +			i * sizeof(dev_priv->gen_pmu.mmio_list[i]);
> +
> +		ret = intel_ring_begin(ring, 4);
> +		if (ret)
> +			return;
> +
> +		intel_ring_emit(ring,
> +				MI_STORE_REGISTER_MEM(1) |
> +				MI_SRM_LRM_GLOBAL_GTT);
> +		intel_ring_emit(ring, dev_priv->gen_pmu.mmio_list[i]);
> +		intel_ring_emit(ring, addr);
> +		intel_ring_emit(ring, MI_NOOP);
> +		intel_ring_advance(ring);
> +	}
> +
>  	node_info = &nodes[index].node_info;
>  	i915_gem_request_assign(&node_info->req,
>  				ring->outstanding_lazy_request);
> @@ -314,11 +337,11 @@ static void init_gen_pmu_buf_queue(struct drm_i915_private *dev_priv)
>  }
>  
>  static void forward_one_gen_pmu_sample(struct drm_i915_private *dev_priv,
> -				struct drm_i915_ts_node *node)
> +				struct drm_i915_ts_mmio_node *node)
>  {
>  	struct perf_sample_data data;
>  	struct perf_event *event = dev_priv->gen_pmu.exclusive_event;
> -	int snapshot_size = sizeof(struct drm_i915_ts_usernode);
> +	int snapshot_size = sizeof(struct drm_i915_ts_mmio_usernode);
>  	struct perf_raw_record raw;
>  
>  	perf_sample_data_init(&data, 0, event->hw.last_period);
> @@ -338,11 +361,11 @@ void i915_gen_pmu_wait_gpu(struct drm_i915_private *dev_priv)
>  	struct drm_i915_ts_queue_header *hdr =
>  		(struct drm_i915_ts_queue_header *)
>  		dev_priv->gen_pmu.buffer.addr;
> -	struct drm_i915_ts_node *first_node, *node;
> +	struct drm_i915_ts_mmio_node *first_node, *node;
>  	int head, tail, num_nodes, ret;
>  	struct drm_i915_gem_request *req;
>  
> -	first_node = (struct drm_i915_ts_node *)
> +	first_node = (struct drm_i915_ts_mmio_node *)
>  			((char *)hdr + hdr->data_offset);
>  	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
>  			sizeof(*node);
> @@ -375,14 +398,14 @@ void forward_gen_pmu_snapshots_work(struct work_struct *__work)
>  	struct drm_i915_ts_queue_header *hdr =
>  		(struct drm_i915_ts_queue_header *)
>  		dev_priv->gen_pmu.buffer.addr;
> -	struct drm_i915_ts_node *first_node, *node;
> +	struct drm_i915_ts_mmio_node *first_node, *node;
>  	int head, tail, num_nodes, ret;
>  	struct drm_i915_gem_request *req;
>  
>  	if (dev_priv->gen_pmu.event_active == false)
>  		return;
>  
> -	first_node = (struct drm_i915_ts_node *)
> +	first_node = (struct drm_i915_ts_mmio_node *)
>  			((char *)hdr + hdr->data_offset);
>  	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
>  			sizeof(*node);
> @@ -421,11 +444,11 @@ void i915_gen_pmu_stop_work_fn(struct work_struct *__work)
>  	struct drm_i915_ts_queue_header *hdr =
>  		(struct drm_i915_ts_queue_header *)
>  		dev_priv->gen_pmu.buffer.addr;
> -	struct drm_i915_ts_node *first_node, *node;
> +	struct drm_i915_ts_mmio_node *first_node, *node;
>  	int head, tail, num_nodes, ret;
>  	struct drm_i915_gem_request *req;
>  
> -	first_node = (struct drm_i915_ts_node *)
> +	first_node = (struct drm_i915_ts_mmio_node *)
>  			((char *)hdr + hdr->data_offset);
>  	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
>  			sizeof(*node);
> @@ -1467,15 +1490,85 @@ static int i915_oa_event_event_idx(struct perf_event *event)
>  	return 0;
>  }
>  
> +static int i915_gen_pmu_copy_attr(struct drm_i915_gen_pmu_attr __user *uattr,
> +			     struct drm_i915_gen_pmu_attr *attr)
> +{
> +	u32 size;
> +	int ret;
> +
> +	if (!access_ok(VERIFY_WRITE, uattr, I915_GEN_PMU_ATTR_SIZE_VER0))
> +		return -EFAULT;
> +
> +	/*
> +	 * zero the full structure, so that a short copy will be nice.
> +	 */
> +	memset(attr, 0, sizeof(*attr));
> +
> +	ret = get_user(size, &uattr->size);
> +	if (ret)
> +		return ret;
> +
> +	if (size > PAGE_SIZE)	/* silly large */
> +		goto err_size;
> +
> +	if (size < I915_GEN_PMU_ATTR_SIZE_VER0)
> +		goto err_size;
> +
> +	/*
> +	 * If we're handed a bigger struct than we know of,
> +	 * ensure all the unknown bits are 0 - i.e. new
> +	 * user-space does not rely on any kernel feature
> +	 * extensions we dont know about yet.
> +	 */
> +	if (size > sizeof(*attr)) {
> +		unsigned char __user *addr;
> +		unsigned char __user *end;
> +		unsigned char val;
> +
> +		addr = (void __user *)uattr + sizeof(*attr);
> +		end  = (void __user *)uattr + size;
> +
> +		for (; addr < end; addr++) {
> +			ret = get_user(val, addr);
> +			if (ret)
> +				return ret;
> +			if (val)
> +				goto err_size;
> +		}
> +		size = sizeof(*attr);
> +	}
> +
> +	ret = copy_from_user(attr, uattr, size);
> +	if (ret)
> +		return -EFAULT;
> +
> +out:
> +	return ret;
> +
> +err_size:
> +	put_user(sizeof(*attr), &uattr->size);
> +	ret = -E2BIG;
> +	goto out;
> +}
> +
>  static int i915_gen_event_init(struct perf_event *event)
>  {
>  	struct drm_i915_private *dev_priv =
>  		container_of(event->pmu, typeof(*dev_priv), gen_pmu.pmu);
> +	struct drm_i915_gen_pmu_attr gen_attr;
>  	int ret = 0;
>  
>  	if (event->attr.type != event->pmu->type)
>  		return -ENOENT;
>  
> +	ret = i915_gen_pmu_copy_attr(to_user_ptr(event->attr.config),
> +				&gen_attr);
> +	if (ret)
> +		return ret;
> +
> +	memcpy(dev_priv->gen_pmu.mmio_list, gen_attr.mmio_list,
> +			sizeof(dev_priv->gen_pmu.mmio_list));
> +
>  	/* To avoid the complexity of having to accurately filter
>  	 * data and marshal to the appropriate client
>  	 * we currently only allow exclusive access */
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index a7da421..8d4deec 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -80,6 +80,7 @@
>  #define I915_OA_METRICS_SET_MAX			I915_OA_METRICS_SET_SAMPLER_BALANCE
>  
>  #define I915_OA_ATTR_SIZE_VER0		32  /* sizeof first published struct */
> +#define I915_GEN_PMU_ATTR_SIZE_VER0	36  /* sizeof first published struct */
>  
>  typedef struct _drm_i915_oa_attr {
>  	__u32 size;
> @@ -97,6 +98,11 @@ typedef struct _drm_i915_oa_attr {
>  	__reserved_2:31;
>  } drm_i915_oa_attr_t;
>  
> +struct drm_i915_gen_pmu_attr {
> +	__u32 size;
> +	__u32 mmio_list[8];
> +};
> +
>  /* Header for PERF_RECORD_DEVICE type events */
>  typedef struct _drm_i915_oa_event_header {
>  	__u32 type;
> @@ -143,9 +149,10 @@ struct drm_i915_ts_data {
>  	__u32 ts_high;
>  };
>  
> -struct drm_i915_ts_usernode {
> +struct drm_i915_ts_mmio_usernode {
>  	/* ensure timestamp starts on a qword boundary */
>  	struct drm_i915_ts_data timestamp;
> +	__u32 mmio[8];
>  	struct drm_i915_ts_node_footer node_info;
>  };
>  
> -- 
> 1.8.5.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-22 13:22   ` Chris Wilson
@ 2015-06-22 16:09     ` Daniel Vetter
  2015-06-25  6:02       ` Gupta, Sourab
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Vetter @ 2015-06-22 16:09 UTC (permalink / raw)
  To: Chris Wilson, sourab.gupta, intel-gfx, Insoo Woo, Peter Zijlstra,
	Jabin Wu

On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
> On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
> > From: Sourab Gupta <sourab.gupta@intel.com>
> > 
> > To collect timestamps around any GPU workload, we need to insert
> > commands to capture them into the ringbuffer. Therefore, during the stop event
> > call, we need to wait for GPU to complete processing the last request for
> > which these commands were inserted.
> > We need to ensure this processing is done before event_destroy callback which
> > deallocates the buffer for holding the data.
> 
> There's no reason for this to be synchronous. Just that you need an
> active reference on the output buffer to be only released after the
> final request.

Yeah I think the interaction between OA sampling and GEM is the critical
piece here for both patch series. Step one is to have a per-pmu lock to
keep track of data private to OA and mmio based sampling as a starting
point. Then we need to figure out how to structure the connection without
OA/PMU and gem trampling over each another's feet.

Wrt requests those are refcounted already and just waiting on them doesn't
need dev->struct_mutex. That should be all you need, assuming you do
correctly refcount them ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-22 16:09     ` Daniel Vetter
@ 2015-06-25  6:02       ` Gupta, Sourab
  2015-06-25  7:42         ` Daniel Vetter
                           ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Gupta, Sourab @ 2015-06-25  6:02 UTC (permalink / raw)
  To: Daniel Vetter, Chris Wilson
  Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo, Bragg, Robert

On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
> On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
> > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
> > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > 
> > > To collect timestamps around any GPU workload, we need to insert
> > > commands to capture them into the ringbuffer. Therefore, during the stop event
> > > call, we need to wait for GPU to complete processing the last request for
> > > which these commands were inserted.
> > > We need to ensure this processing is done before event_destroy callback which
> > > deallocates the buffer for holding the data.
> > 
> > There's no reason for this to be synchronous. Just that you need an
> > active reference on the output buffer to be only released after the
> > final request.
> 
> Yeah I think the interaction between OA sampling and GEM is the critical
> piece here for both patch series. Step one is to have a per-pmu lock to
> keep track of data private to OA and mmio based sampling as a starting
> point. Then we need to figure out how to structure the connection without
> OA/PMU and gem trampling over each another's feet.
> 
> Wrt requests those are refcounted already and just waiting on them doesn't
> need dev->struct_mutex. That should be all you need, assuming you do
> correctly refcount them ...
> -Daniel

Hi Daniel/Chris,

Thanks for your feedback. I acknowledge the issue with increasing
coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
of data private to OA in my next version of patches.

W.r.t. having the synchronous wait during event stop, I thought through
the method of having active reference on output buffer. This will let us
have a delayed buffer destroy (controlled by decrement of output buffer
refcount as and when gpu requests complete). This will be decoupled from
the event stop here. But that is not the only thing at play here.

The event stop is expected to disable the OA unit (which it is doing
right now). In my usecase, I can't disable OA unit, till the time GPU
processes the MI_RPC requests scheduled already, which otherwise may
lead to hangs.
So, I'm waiting synchronously for requests to complete before disabling
OA unit.

Also, the event_destroy callback should be destroying the buffers
associated with event. (Right now, in current design, event_destroy
callback waits for the above operations to be completed before
proceeding to destroy the buffer).

If I try to create a function which destroys the output buffer according
to all references being released, all these operations will have to be
performed together in that function, in this serial order (so that when
refcount drops to 0, i.e. requests have completed, OA unit will be
disabled, and then the buffer destroyed). This would lead to these
operations being decoupled from the event_stop and event_destroy
functions. This may be non-intentional behaviour w.r.t. these callbacks
from userspace perspective.

Robert, any thoughts?

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-25  6:02       ` Gupta, Sourab
@ 2015-06-25  7:42         ` Daniel Vetter
  2015-06-25  8:27           ` Gupta, Sourab
  2015-06-25  8:02         ` Chris Wilson
  2015-06-25 13:02         ` Robert Bragg
  2 siblings, 1 reply; 23+ messages in thread
From: Daniel Vetter @ 2015-06-25  7:42 UTC (permalink / raw)
  To: Gupta, Sourab
  Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo, Bragg, Robert

On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote:
> On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
> > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
> > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
> > > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > > 
> > > > To collect timestamps around any GPU workload, we need to insert
> > > > commands to capture them into the ringbuffer. Therefore, during the stop event
> > > > call, we need to wait for GPU to complete processing the last request for
> > > > which these commands were inserted.
> > > > We need to ensure this processing is done before event_destroy callback which
> > > > deallocates the buffer for holding the data.
> > > 
> > > There's no reason for this to be synchronous. Just that you need an
> > > active reference on the output buffer to be only released after the
> > > final request.
> > 
> > Yeah I think the interaction between OA sampling and GEM is the critical
> > piece here for both patch series. Step one is to have a per-pmu lock to
> > keep track of data private to OA and mmio based sampling as a starting
> > point. Then we need to figure out how to structure the connection without
> > OA/PMU and gem trampling over each another's feet.
> > 
> > Wrt requests those are refcounted already and just waiting on them doesn't
> > need dev->struct_mutex. That should be all you need, assuming you do
> > correctly refcount them ...
> > -Daniel
> 
> Hi Daniel/Chris,
> 
> Thanks for your feedback. I acknowledge the issue with increasing
> coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
> of data private to OA in my next version of patches.

If you create new locking schemes please coordinate with Rob about them.
Better if we come up with a consistent locking scheme for all of OA. That
ofc doesn't exclude that we'll have per-pmu locking, just that the locking
design should match if it makes sense. The example I'm thinking of is
drrs&psr which both use the frontbuffer tracking code and both have their
private mutex. But the overall approach to locking and cleaning up the
async work is the same.

> W.r.t. having the synchronous wait during event stop, I thought through
> the method of having active reference on output buffer. This will let us
> have a delayed buffer destroy (controlled by decrement of output buffer
> refcount as and when gpu requests complete). This will be decoupled from
> the event stop here. But that is not the only thing at play here.
> 
> The event stop is expected to disable the OA unit (which it is doing
> right now). In my usecase, I can't disable OA unit, till the time GPU
> processes the MI_RPC requests scheduled already, which otherwise may
> lead to hangs.
> So, I'm waiting synchronously for requests to complete before disabling
> OA unit.
> 
> Also, the event_destroy callback should be destroying the buffers
> associated with event. (Right now, in current design, event_destroy
> callback waits for the above operations to be completed before
> proceeding to destroy the buffer).
> 
> If I try to create a function which destroys the output buffer according
> to all references being released, all these operations will have to be
> performed together in that function, in this serial order (so that when
> refcount drops to 0, i.e. requests have completed, OA unit will be
> disabled, and then the buffer destroyed). This would lead to these
> operations being decoupled from the event_stop and event_destroy
> functions. This may be non-intentional behaviour w.r.t. these callbacks
> from userspace perspective.
> 
> Robert, any thoughts?

Yeah now idea whether those perf callbacks are allowed to stall for a
longer time. If not we could simply push the cleanup/OA disabling into an
async work.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-25  6:02       ` Gupta, Sourab
  2015-06-25  7:42         ` Daniel Vetter
@ 2015-06-25  8:02         ` Chris Wilson
  2015-06-25 17:31           ` Robert Bragg
  2015-06-25 13:02         ` Robert Bragg
  2 siblings, 1 reply; 23+ messages in thread
From: Chris Wilson @ 2015-06-25  8:02 UTC (permalink / raw)
  To: Gupta, Sourab
  Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo, Bragg, Robert

On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote:
> On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
> > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
> > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
> > > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > > 
> > > > To collect timestamps around any GPU workload, we need to insert
> > > > commands to capture them into the ringbuffer. Therefore, during the stop event
> > > > call, we need to wait for GPU to complete processing the last request for
> > > > which these commands were inserted.
> > > > We need to ensure this processing is done before event_destroy callback which
> > > > deallocates the buffer for holding the data.
> > > 
> > > There's no reason for this to be synchronous. Just that you need an
> > > active reference on the output buffer to be only released after the
> > > final request.
> > 
> > Yeah I think the interaction between OA sampling and GEM is the critical
> > piece here for both patch series. Step one is to have a per-pmu lock to
> > keep track of data private to OA and mmio based sampling as a starting
> > point. Then we need to figure out how to structure the connection without
> > OA/PMU and gem trampling over each another's feet.
> > 
> > Wrt requests those are refcounted already and just waiting on them doesn't
> > need dev->struct_mutex. That should be all you need, assuming you do
> > correctly refcount them ...
> > -Daniel
> 
> Hi Daniel/Chris,
> 
> Thanks for your feedback. I acknowledge the issue with increasing
> coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
> of data private to OA in my next version of patches.
> 
> W.r.t. having the synchronous wait during event stop, I thought through
> the method of having active reference on output buffer. This will let us
> have a delayed buffer destroy (controlled by decrement of output buffer
> refcount as and when gpu requests complete). This will be decoupled from
> the event stop here. But that is not the only thing at play here.
> 
> The event stop is expected to disable the OA unit (which it is doing
> right now). In my usecase, I can't disable OA unit, till the time GPU
> processes the MI_RPC requests scheduled already, which otherwise may
> lead to hangs.
> So, I'm waiting synchronously for requests to complete before disabling
> OA unit.

There's no issue here. Just hook into the retire-notification callback
for the last active request of the oa buffer. If you are using LRI to
disable the OA Context writing to the buffer, you can simply create a
request for the LRI and use the active reference as the retire
notification. (Otoh, if the oa buffer is inactive you can simply do the
decoupling immediately.) You shouldn't need a object-free-notification
callback aiui.
 
> Also, the event_destroy callback should be destroying the buffers
> associated with event. (Right now, in current design, event_destroy
> callback waits for the above operations to be completed before
> proceeding to destroy the buffer).
> 
> If I try to create a function which destroys the output buffer according
> to all references being released, all these operations will have to be
> performed together in that function, in this serial order (so that when
> refcount drops to 0, i.e. requests have completed, OA unit will be
> disabled, and then the buffer destroyed). This would lead to these
> operations being decoupled from the event_stop and event_destroy
> functions. This may be non-intentional behaviour w.r.t. these callbacks
> from userspace perspective.

When the perf event is stopped, you stop generating perf events. But the
buffer may still be alive, and may be resurrected if you a new event is
started, but you will want to be sure to ensure that pending oa writes
are ignored.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-25  7:42         ` Daniel Vetter
@ 2015-06-25  8:27           ` Gupta, Sourab
  2015-06-25 11:47             ` Robert Bragg
  0 siblings, 1 reply; 23+ messages in thread
From: Gupta, Sourab @ 2015-06-25  8:27 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo, Bragg, Robert

On Thu, 2015-06-25 at 07:42 +0000, Daniel Vetter wrote:
> On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote:
> > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
> > > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
> > > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
> > > > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > > > 
> > > > > To collect timestamps around any GPU workload, we need to insert
> > > > > commands to capture them into the ringbuffer. Therefore, during the stop event
> > > > > call, we need to wait for GPU to complete processing the last request for
> > > > > which these commands were inserted.
> > > > > We need to ensure this processing is done before event_destroy callback which
> > > > > deallocates the buffer for holding the data.
> > > > 
> > > > There's no reason for this to be synchronous. Just that you need an
> > > > active reference on the output buffer to be only released after the
> > > > final request.
> > > 
> > > Yeah I think the interaction between OA sampling and GEM is the critical
> > > piece here for both patch series. Step one is to have a per-pmu lock to
> > > keep track of data private to OA and mmio based sampling as a starting
> > > point. Then we need to figure out how to structure the connection without
> > > OA/PMU and gem trampling over each another's feet.
> > > 
> > > Wrt requests those are refcounted already and just waiting on them doesn't
> > > need dev->struct_mutex. That should be all you need, assuming you do
> > > correctly refcount them ...
> > > -Daniel
> > 
> > Hi Daniel/Chris,
> > 
> > Thanks for your feedback. I acknowledge the issue with increasing
> > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
> > of data private to OA in my next version of patches.
> 
> If you create new locking schemes please coordinate with Rob about them.
> Better if we come up with a consistent locking scheme for all of OA. That
> ofc doesn't exclude that we'll have per-pmu locking, just that the locking
> design should match if it makes sense. The example I'm thinking of is
> drrs&psr which both use the frontbuffer tracking code and both have their
> private mutex. But the overall approach to locking and cleaning up the
> async work is the same.
> 
Ok, I'll coordinate with Robert about a consistent locking scheme here.

> > W.r.t. having the synchronous wait during event stop, I thought through
> > the method of having active reference on output buffer. This will let us
> > have a delayed buffer destroy (controlled by decrement of output buffer
> > refcount as and when gpu requests complete). This will be decoupled from
> > the event stop here. But that is not the only thing at play here.
> > 
> > The event stop is expected to disable the OA unit (which it is doing
> > right now). In my usecase, I can't disable OA unit, till the time GPU
> > processes the MI_RPC requests scheduled already, which otherwise may
> > lead to hangs.
> > So, I'm waiting synchronously for requests to complete before disabling
> > OA unit.
> > 
> > Also, the event_destroy callback should be destroying the buffers
> > associated with event. (Right now, in current design, event_destroy
> > callback waits for the above operations to be completed before
> > proceeding to destroy the buffer).
> > 
> > If I try to create a function which destroys the output buffer according
> > to all references being released, all these operations will have to be
> > performed together in that function, in this serial order (so that when
> > refcount drops to 0, i.e. requests have completed, OA unit will be
> > disabled, and then the buffer destroyed). This would lead to these
> > operations being decoupled from the event_stop and event_destroy
> > functions. This may be non-intentional behaviour w.r.t. these callbacks
> > from userspace perspective.
> > 
> > Robert, any thoughts?
> 
> Yeah now idea whether those perf callbacks are allowed to stall for a
> longer time. If not we could simply push the cleanup/OA disabling into an
> async work.
> -Daniel

Hi Daniel,
Actually right now, the stop_event callback is not stalled. Instead I'm
scheduling an async work fn to do the job. The event destroy is then
waiting for the work fn to finish processing, before proceeding on to
destroy the buffer.
I'm also thinking through Chris' suggestions here and will try to come
up with solution based on them.

Thanks,
Sourab

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-25  8:27           ` Gupta, Sourab
@ 2015-06-25 11:47             ` Robert Bragg
  0 siblings, 0 replies; 23+ messages in thread
From: Robert Bragg @ 2015-06-25 11:47 UTC (permalink / raw)
  To: Gupta, Sourab
  Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo, Bragg, Robert

On Thu, Jun 25, 2015 at 9:27 AM, Gupta, Sourab <sourab.gupta@intel.com> wrote:
> On Thu, 2015-06-25 at 07:42 +0000, Daniel Vetter wrote:
>> On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote:
>> > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
>> > > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
>> > > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
>> > > > > From: Sourab Gupta <sourab.gupta@intel.com>
>> > > > >
>> > > > > To collect timestamps around any GPU workload, we need to insert
>> > > > > commands to capture them into the ringbuffer. Therefore, during the stop event
>> > > > > call, we need to wait for GPU to complete processing the last request for
>> > > > > which these commands were inserted.
>> > > > > We need to ensure this processing is done before event_destroy callback which
>> > > > > deallocates the buffer for holding the data.
>> > > >
>> > > > There's no reason for this to be synchronous. Just that you need an
>> > > > active reference on the output buffer to be only released after the
>> > > > final request.
>> > >
>> > > Yeah I think the interaction between OA sampling and GEM is the critical
>> > > piece here for both patch series. Step one is to have a per-pmu lock to
>> > > keep track of data private to OA and mmio based sampling as a starting
>> > > point. Then we need to figure out how to structure the connection without
>> > > OA/PMU and gem trampling over each another's feet.
>> > >
>> > > Wrt requests those are refcounted already and just waiting on them doesn't
>> > > need dev->struct_mutex. That should be all you need, assuming you do
>> > > correctly refcount them ...
>> > > -Daniel
>> >
>> > Hi Daniel/Chris,
>> >
>> > Thanks for your feedback. I acknowledge the issue with increasing
>> > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
>> > of data private to OA in my next version of patches.
>>
>> If you create new locking schemes please coordinate with Rob about them.
>> Better if we come up with a consistent locking scheme for all of OA. That
>> ofc doesn't exclude that we'll have per-pmu locking, just that the locking
>> design should match if it makes sense. The example I'm thinking of is
>> drrs&psr which both use the frontbuffer tracking code and both have their
>> private mutex. But the overall approach to locking and cleaning up the
>> async work is the same.
>>
> Ok, I'll coordinate with Robert about a consistent locking scheme here.
>
>> > W.r.t. having the synchronous wait during event stop, I thought through
>> > the method of having active reference on output buffer. This will let us
>> > have a delayed buffer destroy (controlled by decrement of output buffer
>> > refcount as and when gpu requests complete). This will be decoupled from
>> > the event stop here. But that is not the only thing at play here.
>> >
>> > The event stop is expected to disable the OA unit (which it is doing
>> > right now). In my usecase, I can't disable OA unit, till the time GPU
>> > processes the MI_RPC requests scheduled already, which otherwise may
>> > lead to hangs.
>> > So, I'm waiting synchronously for requests to complete before disabling
>> > OA unit.
>> >
>> > Also, the event_destroy callback should be destroying the buffers
>> > associated with event. (Right now, in current design, event_destroy
>> > callback waits for the above operations to be completed before
>> > proceeding to destroy the buffer).
>> >
>> > If I try to create a function which destroys the output buffer according
>> > to all references being released, all these operations will have to be
>> > performed together in that function, in this serial order (so that when
>> > refcount drops to 0, i.e. requests have completed, OA unit will be
>> > disabled, and then the buffer destroyed). This would lead to these
>> > operations being decoupled from the event_stop and event_destroy
>> > functions. This may be non-intentional behaviour w.r.t. these callbacks
>> > from userspace perspective.
>> >
>> > Robert, any thoughts?
>>
>> Yeah now idea whether those perf callbacks are allowed to stall for a
>> longer time. If not we could simply push the cleanup/OA disabling into an
>> async work.
>> -Daniel
>
> Hi Daniel,
> Actually right now, the stop_event callback is not stalled. Instead I'm
> scheduling an async work fn to do the job. The event destroy is then
> waiting for the work fn to finish processing, before proceeding on to
> destroy the buffer.
> I'm also thinking through Chris' suggestions here and will try to come
> up with solution based on them.

A notable detail here is that most pmu methods are called in atomic
context by events/core.c via inter-processor interrupts (as a somewhat
unintended side effect of userspace opening i915_oa events as
specific-cpu events perf will make sure all methods are invoked on
that specific cpu). This means waiting in pmu->event_stop() isn't even
an option for us, though as noted it's only the destroy currently that
waits for the completion of the stop work fn. event_init() and
event->destroy() are two exceptions that are called in process
context. That said though, it does seem worth considering if we can
avoid waiting in the event_destroy() if not strictly necessary, and
chris's suggestion to follow a refcounting scheme sounds workable.

Regards,
- Robert

>
> Thanks,
> Sourab
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-25  6:02       ` Gupta, Sourab
  2015-06-25  7:42         ` Daniel Vetter
  2015-06-25  8:02         ` Chris Wilson
@ 2015-06-25 13:02         ` Robert Bragg
  2015-06-25 13:07           ` Robert Bragg
  2 siblings, 1 reply; 23+ messages in thread
From: Robert Bragg @ 2015-06-25 13:02 UTC (permalink / raw)
  To: Gupta, Sourab
  Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo, Bragg, Robert

On Thu, Jun 25, 2015 at 7:02 AM, Gupta, Sourab <sourab.gupta@intel.com> wrote:
> On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
>> On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
>> > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
>> > > From: Sourab Gupta <sourab.gupta@intel.com>
>> > >
>> > > To collect timestamps around any GPU workload, we need to insert
>> > > commands to capture them into the ringbuffer. Therefore, during the stop event
>> > > call, we need to wait for GPU to complete processing the last request for
>> > > which these commands were inserted.
>> > > We need to ensure this processing is done before event_destroy callback which
>> > > deallocates the buffer for holding the data.
>> >
>> > There's no reason for this to be synchronous. Just that you need an
>> > active reference on the output buffer to be only released after the
>> > final request.
>>
>> Yeah I think the interaction between OA sampling and GEM is the critical
>> piece here for both patch series. Step one is to have a per-pmu lock to
>> keep track of data private to OA and mmio based sampling as a starting
>> point. Then we need to figure out how to structure the connection without
>> OA/PMU and gem trampling over each another's feet.

From the initial i915_oa code I think the notable locks are:

the struct perf_event_context::lock spin lock taken by events/core.c
before calling any of our pmu methods (with the exception of
event_init and event->destroy()) so we at least don't need to worry
about pmu methods trampling each other.

I think we should only take struct_mutex to meet the existing
assumptions of gem code we're using. Since most pmu methods are in
interrupt context our calling into gem is generally constrained to
event_init or event->destroy() without having to use workers.

we have a pmu spin lock to protect OA register state, notably because
besides the pmu methods we also have hooks into gem context switching
or pinning/unpinning that may trigger updates to the OACONTROL state
vs e.g. disabling OACONTROL in pmu->stop() which may concurrently on
different cpus.

we have an oa_buffer.flush_lock spin lock since in addition to a
hrtimer periodically forwarding samples from OABUFFER to perf,
userspace may also explicitly request a flush via fsync() or we might
flush at pmu->stop(). As a further detail here, if the hrtimer
contends with userspace flushing the hrtimer won't ever actually wait
on flush_lock, assuming it's not necessary to then have a back-to-back
flush while the buffer will probably be empty.

I have to admit I haven't really scrutinized the locking details of
Sourab's work yet, but it may make sense to introduce some additional
exclusion scheme for managing access to to the fifo of pending RPC
commands. I need to double check with Sourab, but I think we'd already
discussed some changes to how forwarding of these RPC based reports
will be managed where I thought we might be able to avoid the need for
a worker and struct_mutex locking while forwarding. I think Sourab may
have just preferred to send something out before this refactor to try
and solicit broader, high level feedback on his work sooner. I imagine
if we do refactor to avoid the worker for forwarding though that will
affect the locking details regarding the fifo structure.

Regards,
- Robert

>>
>> Wrt requests those are refcounted already and just waiting on them doesn't
>> need dev->struct_mutex. That should be all you need, assuming you do
>> correctly refcount them ...
>> -Daniel
>
> Hi Daniel/Chris,
>
> Thanks for your feedback. I acknowledge the issue with increasing
> coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
> of data private to OA in my next version of patches.
>
> W.r.t. having the synchronous wait during event stop, I thought through
> the method of having active reference on output buffer. This will let us
> have a delayed buffer destroy (controlled by decrement of output buffer
> refcount as and when gpu requests complete). This will be decoupled from
> the event stop here. But that is not the only thing at play here.
>
> The event stop is expected to disable the OA unit (which it is doing
> right now). In my usecase, I can't disable OA unit, till the time GPU
> processes the MI_RPC requests scheduled already, which otherwise may
> lead to hangs.
> So, I'm waiting synchronously for requests to complete before disabling
> OA unit.
>
> Also, the event_destroy callback should be destroying the buffers
> associated with event. (Right now, in current design, event_destroy
> callback waits for the above operations to be completed before
> proceeding to destroy the buffer).
>
> If I try to create a function which destroys the output buffer according
> to all references being released, all these operations will have to be
> performed together in that function, in this serial order (so that when
> refcount drops to 0, i.e. requests have completed, OA unit will be
> disabled, and then the buffer destroyed). This would lead to these
> operations being decoupled from the event_stop and event_destroy
> functions. This may be non-intentional behaviour w.r.t. these callbacks
> from userspace perspective.
>
> Robert, any thoughts?
>

I think a few pertinent details here that inform how we need to handle this are:

1) Almost all the pmu methods are called in atomic context (except
event_init) as they are invoked via events/core.c via an
inter-processor interrupt so waiting for a completion in event_destroy
isn't really an option for us.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-25 13:02         ` Robert Bragg
@ 2015-06-25 13:07           ` Robert Bragg
  0 siblings, 0 replies; 23+ messages in thread
From: Robert Bragg @ 2015-06-25 13:07 UTC (permalink / raw)
  To: Gupta, Sourab
  Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Woo, Insoo, Bragg, Robert

>> Robert, any thoughts?
>>
>
> I think a few pertinent details here that inform how we need to handle this are:
>
> 1) Almost all the pmu methods are called in atomic context (except
> event_init) as they are invoked via events/core.c via an
> inter-processor interrupt so waiting for a completion in event_destroy
> isn't really an option for us.

Oops meant to delete this incomplete comment. Before double checking
events/core.c I was worried that event->destroy() might be called in
interrupt context, but that's not the case.

- Robert
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-25  8:02         ` Chris Wilson
@ 2015-06-25 17:31           ` Robert Bragg
  2015-06-25 17:37             ` Chris Wilson
  0 siblings, 1 reply; 23+ messages in thread
From: Robert Bragg @ 2015-06-25 17:31 UTC (permalink / raw)
  To: Chris Wilson, Gupta, Sourab, Daniel Vetter, intel-gfx, Woo,
	Insoo, Peter Zijlstra, Wu, Jabin, Bragg, Robert, Robert Bragg

On Thu, Jun 25, 2015 at 9:02 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote:
>> On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
>> > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
>> > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
>> > > > From: Sourab Gupta <sourab.gupta@intel.com>
>> > > >
>> > > > To collect timestamps around any GPU workload, we need to insert
>> > > > commands to capture them into the ringbuffer. Therefore, during the stop event
>> > > > call, we need to wait for GPU to complete processing the last request for
>> > > > which these commands were inserted.
>> > > > We need to ensure this processing is done before event_destroy callback which
>> > > > deallocates the buffer for holding the data.
>> > >
>> > > There's no reason for this to be synchronous. Just that you need an
>> > > active reference on the output buffer to be only released after the
>> > > final request.
>> >
>> > Yeah I think the interaction between OA sampling and GEM is the critical
>> > piece here for both patch series. Step one is to have a per-pmu lock to
>> > keep track of data private to OA and mmio based sampling as a starting
>> > point. Then we need to figure out how to structure the connection without
>> > OA/PMU and gem trampling over each another's feet.
>> >
>> > Wrt requests those are refcounted already and just waiting on them doesn't
>> > need dev->struct_mutex. That should be all you need, assuming you do
>> > correctly refcount them ...
>> > -Daniel
>>
>> Hi Daniel/Chris,
>>
>> Thanks for your feedback. I acknowledge the issue with increasing
>> coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
>> of data private to OA in my next version of patches.
>>
>> W.r.t. having the synchronous wait during event stop, I thought through
>> the method of having active reference on output buffer. This will let us
>> have a delayed buffer destroy (controlled by decrement of output buffer
>> refcount as and when gpu requests complete). This will be decoupled from
>> the event stop here. But that is not the only thing at play here.
>>
>> The event stop is expected to disable the OA unit (which it is doing
>> right now). In my usecase, I can't disable OA unit, till the time GPU
>> processes the MI_RPC requests scheduled already, which otherwise may
>> lead to hangs.
>> So, I'm waiting synchronously for requests to complete before disabling
>> OA unit.
>
> There's no issue here. Just hook into the retire-notification callback
> for the last active request of the oa buffer. If you are using LRI to
> disable the OA Context writing to the buffer, you can simply create a
> request for the LRI and use the active reference as the retire
> notification. (Otoh, if the oa buffer is inactive you can simply do the
> decoupling immediately.) You shouldn't need a object-free-notification
> callback aiui.
>
>> Also, the event_destroy callback should be destroying the buffers
>> associated with event. (Right now, in current design, event_destroy
>> callback waits for the above operations to be completed before
>> proceeding to destroy the buffer).
>>
>> If I try to create a function which destroys the output buffer according
>> to all references being released, all these operations will have to be
>> performed together in that function, in this serial order (so that when
>> refcount drops to 0, i.e. requests have completed, OA unit will be
>> disabled, and then the buffer destroyed). This would lead to these
>> operations being decoupled from the event_stop and event_destroy
>> functions. This may be non-intentional behaviour w.r.t. these callbacks
>> from userspace perspective.
>
> When the perf event is stopped, you stop generating perf events. But the
> buffer may still be alive, and may be resurrected if you a new event is
> started, but you will want to be sure to ensure that pending oa writes
> are ignored.
> -Chris

Just to summarise some things I know Sourab discussed with Chris on
IRC and I then also discussed with Sourab offline.

It does look like we can avoid any waiting in event->destroy(),
decoupling tracking of outstanding RPC commands+oacontrol disable and
the event active status. There's no strict reason OACONTROL needs to
be disabled when stopping or destroying an event if there are still
pending RPC commands, we just need to stop forwarding samples while
disabled and discard reports that land after an event is destroyed. We
can also update the hw.state and active state before OACONTROL is
disabled, whereas currently the code is updating this state outside of
core perf's context lock which isn't good and we might also end up
continuing to forward periodic samples from the oabuffer to perf when
the user expects the event to be disabled.

There was a discussion of updating oacontrol via LRIs, but my concern
here is with wanting to update oacontrol state for periodic sampling
use cases where it would seem awkward to handle that via LRIs.
Enabling/disabling periodic sampling can be considered orthogonal from
fully disabling oacontrol and it will be useful to use periodic
sampling in conjunction with RPC sampling. When we stop an event we
should be able to immediately stop periodic sampling even if we must
leave OA enabled for a pending RPC command. If the event is started
again we can quickly re-enable periodic sampling but it might be
awkward to consider an in-flight LRI we know is going to disable
oacontrol soon. Related to using LRIs though was the idea of using the
LRI request to help manage the lifetime of the destination buffer, by
adding the vmas of the dest buffer to the request's active list. It
seems like we should be able to do this kind of thing with the
requests associated with the MI_RPC commands instead.

Thinking about the details of waiting for the last RPC command before
destroying the dest buffer and disabling OACONTROL these are the
requirements I see:
- we want free the dest buffer in a finite time, since it's large
(i.e. don't want to assume it's ok to keep around if allocated once)
- must wait for last RPC commands to complete before disabling
oacontrol (and can free the dest buffer at the same point)
- in any subsequent event_init() we need to be able to verify we don't
/still/ have pending RPC commands, because an incompatible oacontrol
requirement at this point is a corner case where we really do have to
block and wait for those RPC commands to complete or say return
-EBUSY.

Without the retire-notification api that Chris has been experimenting
with (that could provide us a convenient callback when our RPC
commands retire), the current plan is to still schedule some work in
event->destroy() to wait for the last outstanding RPC command to
retire before disabling oacontrol as well as free the RPC destination
buffer, but notably there's no need to block on its completion. I
imagine this could later easily be adapted to work with a
retire-notification api, and maybe Sourab could even experiment with
Chris's wip code for this to compare.

In the end, I think the only time we'll really need to block waiting
for outstanding requests is in event_init() in cases where there are
still pending RPC commands and the previous oacontrol report-format is
smaller than the newly requested format (which should basically be
never I guess if in practice, most users will be requesting the
largest report format... and technically I suppose there are even lots
of cases where you could safely allow the report size to increase if
you know there's adequate space in the old dest buffer; though
probably not worth fussing about.).


Besides avoiding blocking in event->destroy(); I'd been thinking we
wouldn't need the worker for forwarding the RPC reports to perf and
could instead just use a hrtimer like we do for periodic samples.
Talking this through with Sourab though, he tried this, and it looks
like it would be awkward due to needing to drop the references on
request structs which may in-turn need to take struct_mutex to finally
free.

In terms of locking while forwarding samples and in the insert_cmd
hooks, for the fifo structures tracking pending commands we should
stop using struct_mutex (except if necessary for calling into gem) and
we should also avoid locking around all of the forwarding work with
any mutex that (if anything) just needs to protect updates to the fifo
itself. I guess there's some reasonable lock free way to handle adding
and removing from these fifos, but haven't considered that carefully
and don't have a strong opinion on particular details here.


Ok I think that pretty much summarises what was discussed offline,

Regards,
- Robert

>
> --
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-25 17:31           ` Robert Bragg
@ 2015-06-25 17:37             ` Chris Wilson
  2015-06-25 18:20               ` Chris Wilson
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Wilson @ 2015-06-25 17:37 UTC (permalink / raw)
  To: Robert Bragg
  Cc: Peter Zijlstra, intel-gfx, Wu, Jabin, Gupta, Sourab, Woo, Insoo,
	Bragg, Robert

On Thu, Jun 25, 2015 at 06:31:52PM +0100, Robert Bragg wrote:
> Thinking about the details of waiting for the last RPC command before
> destroying the dest buffer and disabling OACONTROL these are the
> requirements I see:
> - we want free the dest buffer in a finite time, since it's large
> (i.e. don't want to assume it's ok to keep around if allocated once)

Skipping to this point, no you don't. If you mark the buffer as
purgeable after you stop tracking it (though they may still be pending
writes from the GPU), the system will preferentially reuse the buffer's
backing storage when memory is tight (including waiting for the GPU).
Similarly, you need to unpin it as soon as possible then the address
space is available for reuse asap.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
  2015-06-25 17:37             ` Chris Wilson
@ 2015-06-25 18:20               ` Chris Wilson
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Wilson @ 2015-06-25 18:20 UTC (permalink / raw)
  To: Robert Bragg, Gupta, Sourab, Daniel Vetter, intel-gfx, Woo,
	Insoo, Peter Zijlstra, Wu, Jabin, Bragg, Robert

On Thu, Jun 25, 2015 at 06:37:18PM +0100, Chris Wilson wrote:
> On Thu, Jun 25, 2015 at 06:31:52PM +0100, Robert Bragg wrote:
> > Thinking about the details of waiting for the last RPC command before
> > destroying the dest buffer and disabling OACONTROL these are the
> > requirements I see:
> > - we want free the dest buffer in a finite time, since it's large
> > (i.e. don't want to assume it's ok to keep around if allocated once)
> 
> Skipping to this point, no you don't. If you mark the buffer as
> purgeable after you stop tracking it (though they may still be pending
> writes from the GPU), the system will preferentially reuse the buffer's
> backing storage when memory is tight (including waiting for the GPU).
> Similarly, you need to unpin it as soon as possible then the address
> space is available for reuse asap.

Or perhaps more to the point, you cannot free the buffer before the GPU
is finished. It doesn't matter if you want to free it synchronously, it
cannot happen before it would have been freed by dropping the active
reference anyway.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-06-25 18:20 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-22  9:55 [RFC 0/7] Introduce framework for forwarding generic non-OA performance sourab.gupta
2015-06-22  9:55 ` [RFC 1/7] drm/i915: Add a new PMU for handling non-OA counter data profiling requests sourab.gupta
2015-06-22  9:55 ` [RFC 2/7] drm/i915: Register routines for Gen perf PMU driver sourab.gupta
2015-06-22  9:55 ` [RFC 3/7] drm/i915: Introduce timestamp node for timestamp data collection sourab.gupta
2015-06-22  9:55 ` [RFC 4/7] drm/i915: Add mechanism for forwarding the data samples to userspace through Gen PMU perf interface sourab.gupta
2015-06-22 13:21   ` Chris Wilson
2015-06-22  9:55 ` [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU sourab.gupta
2015-06-22 13:22   ` Chris Wilson
2015-06-22 16:09     ` Daniel Vetter
2015-06-25  6:02       ` Gupta, Sourab
2015-06-25  7:42         ` Daniel Vetter
2015-06-25  8:27           ` Gupta, Sourab
2015-06-25 11:47             ` Robert Bragg
2015-06-25  8:02         ` Chris Wilson
2015-06-25 17:31           ` Robert Bragg
2015-06-25 17:37             ` Chris Wilson
2015-06-25 18:20               ` Chris Wilson
2015-06-25 13:02         ` Robert Bragg
2015-06-25 13:07           ` Robert Bragg
2015-06-22  9:55 ` [RFC 6/7] drm/i915: Add routines for inserting commands in the ringbuf for capturing timestamps sourab.gupta
2015-06-22  9:55 ` [RFC 7/7] drm/i915: Add support for retrieving MMIO register values in Gen Perf PMU sourab.gupta
2015-06-22 13:29   ` Chris Wilson
2015-06-22 16:06   ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.