* [RFC PATCH 0/3] Expose gpu counters via perf pmu driver @ 2014-10-22 15:28 Robert Bragg 2014-10-22 15:28 ` [RFC PATCH 1/3] perf: export perf_event_overflow Robert Bragg ` (4 more replies) 0 siblings, 5 replies; 16+ messages in thread From: Robert Bragg @ 2014-10-22 15:28 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs, Robert Bragg Although I haven't seen any precedent for drivers using perf pmus to expose device metrics, I've been experimenting with exposing some of the performance counters of Intel Gen graphics hardware recently and looking to see if it makes sense to build on the perf infrastructure for our use cases. I've got a basic pmu driver working to expose our Observation Architecture counters, and it seems like a fairly good fit. The main caveat is that to allow the permission model we would like I needed to make some changes in events/core which I'd really appreciate some feedback on... In this case we're using the driver to support some performance monitoring extensions in Mesa (AMD_performance_monitor + INTEL_performance_query) and we don't want to require OpenGL clients to run as root to be able to monitor a gpu context they own. Our desired permission model seems consistent with perf's current model whereby you would need privileges if you want to profile across all gpu contexts but not need special permissions to profile your own context. The awkward part is that it doesn't make sense for us to have userspace open a perf event with a specific pid as the way to avoid needing root permissions because a side effect of doing this is that the events will be dynamically added/deleted so as to only monitor while that process is scheduled and that's not really meaningful when we're monitoring the gpu. Conceptually I suppose we want to be able to open an event that's not associated with any cpu or process, but to keep things simple and fit with perf's current design, the pmu I have a.t.m expects an event to be opened for a specific cpu and unspecified process. To then subvert the cpu centric permission checks, I added a PERF_PMU_CAP_IS_DEVICE capability that a pmu driver can use to tell events/core that a pmu doesn't collect any cpu metrics and it can therefore skip its usual checks and assume the driver will implement its own checks as appropriate. In addition I also explicitly black list numerous attributes and PERF_SAMPLE_ flags that I don't think make sense for a device pmu. This could be handled in the pmu driver but it seemed better to do in events/core, avoiding duplication in case we later have multiple device pmus. I'd be interested to hear whether is sounds reasonable to others for us to expose gpu device metrics via a perf pmu and whether adding the PERF_PMU_CAP_IS_DEVICE flag as in my following patch could be acceptable. Patches: [RFC PATCH 1/3] perf: export perf_event_overflow [RFC PATCH 2/3] perf: Add PERF_PMU_CAP_IS_DEVICE flag The main change to core/events I'd really appreciate feedback on. [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture My current pmu driver, provided for context (work in progress). Early, high-level, feedback would be appreciated, though I think it could be good to focus on the core/events change first. I also plan to send this to the intel-gfx list for review. Essentially, this pmu allows us to configure the gpu to periodically write snapshots of performance counters (up to 64 32bit counters per snapshot) into a circular buffer. It then uses a 200Hz hrtimer to forward those snapshots to userspace as perf samples, with the counter snapshots written by the gpu attached as 'raw' data. If anyone is interested in more details about Haswell's gpu performance counters, the PRM can be found here: https://01.org/linuxgraphics/sites/default/files/documentation/ observability_performance_counters_haswell.pdf To see how I'm currently using this from userspace, I have a couple of intel-gpu-tools utilities; intel_oacounter_top_pmu + intel_gpu_trace_pmu: https://github.com/rib/intel-gpu-tools/commits/wip/rib/intel-i915-oa-pmu And the current code I have to use this in Mesa is here: https://github.com/rib/mesa/commits/wip/rib/i915_oa_perf Regards, - Robert drivers/gpu/drm/i915/Makefile | 1 + drivers/gpu/drm/i915/i915_dma.c | 2 + drivers/gpu/drm/i915/i915_drv.h | 33 ++ drivers/gpu/drm/i915/i915_oa_perf.c | 675 ++++++++++++++++++++++++++++++++++++ drivers/gpu/drm/i915/i915_reg.h | 87 +++++ include/linux/perf_event.h | 1 + include/uapi/drm/i915_drm.h | 21 ++ kernel/events/core.c | 40 ++- 8 files changed, 854 insertions(+), 6 deletions(-) create mode 100644 drivers/gpu/drm/i915/i915_oa_perf.c -- 2.1.2 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH 1/3] perf: export perf_event_overflow 2014-10-22 15:28 [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Robert Bragg @ 2014-10-22 15:28 ` Robert Bragg 2014-10-22 15:28 ` [RFC PATCH 2/3] perf: Add PERF_PMU_CAP_IS_DEVICE flag Robert Bragg ` (3 subsequent siblings) 4 siblings, 0 replies; 16+ messages in thread From: Robert Bragg @ 2014-10-22 15:28 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs, Robert Bragg To support pmu drivers in loadable modules, such as the i915 driver Signed-off-by: Robert Bragg <robert@sixbynine.org> --- kernel/events/core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/events/core.c b/kernel/events/core.c index 1cf24b3..9449180 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5478,6 +5478,7 @@ int perf_event_overflow(struct perf_event *event, { return __perf_event_overflow(event, 1, data, regs); } +EXPORT_SYMBOL_GPL(perf_event_overflow); /* * Generic software event infrastructure -- 2.1.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH 2/3] perf: Add PERF_PMU_CAP_IS_DEVICE flag 2014-10-22 15:28 [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Robert Bragg 2014-10-22 15:28 ` [RFC PATCH 1/3] perf: export perf_event_overflow Robert Bragg @ 2014-10-22 15:28 ` Robert Bragg 2014-10-22 15:28 ` [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture Robert Bragg ` (2 subsequent siblings) 4 siblings, 0 replies; 16+ messages in thread From: Robert Bragg @ 2014-10-22 15:28 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs, Robert Bragg The PERF_PMU_CAP_IS_DEVICE flag provides pmu drivers a way to declare that they only monitor device specific metrics and since they don't monitor any cpu metrics then perf should bypass any cpu centric security checks, as well as disallow cpu centric attributes. Signed-off-by: Robert Bragg <robert@sixbynine.org> --- include/linux/perf_event.h | 1 + kernel/events/core.c | 39 +++++++++++++++++++++++++++++++++------ 2 files changed, 34 insertions(+), 6 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 707617a..e1e0153 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -170,6 +170,7 @@ struct perf_event; * pmu::capabilities flags */ #define PERF_PMU_CAP_NO_INTERRUPT 0x01 +#define PERF_PMU_CAP_IS_DEVICE 0x02 /** * struct pmu - generic performance monitoring unit diff --git a/kernel/events/core.c b/kernel/events/core.c index 9449180..3ddb157 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3131,7 +3131,8 @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu) if (!task) { /* Must be root to operate on a CPU event: */ - if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN)) + if (!(pmu->capabilities & PERF_PMU_CAP_IS_DEVICE) && + perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN)) return ERR_PTR(-EACCES); /* @@ -7091,11 +7092,6 @@ SYSCALL_DEFINE5(perf_event_open, if (err) return err; - if (!attr.exclude_kernel) { - if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) - return -EACCES; - } - if (attr.freq) { if (attr.sample_freq > sysctl_perf_event_sample_rate) return -EINVAL; @@ -7154,6 +7150,37 @@ SYSCALL_DEFINE5(perf_event_open, goto err_cpus; } + if (event->pmu->capabilities & PERF_PMU_CAP_IS_DEVICE) { + + /* Don't allow cpu centric attributes... */ + if (event->attr.exclude_user || + event->attr.exclude_callchain_user || + event->attr.exclude_kernel || + event->attr.exclude_callchain_kernel || + event->attr.exclude_hv || + event->attr.exclude_idle || + event->attr.exclude_host || + event->attr.exclude_guest || + event->attr.mmap || + event->attr.comm || + event->attr.task) + return -EINVAL; + + if (attr.sample_type & + (PERF_SAMPLE_IP | + PERF_SAMPLE_TID | + PERF_SAMPLE_ADDR | + PERF_SAMPLE_CALLCHAIN | + PERF_SAMPLE_CPU | + PERF_SAMPLE_BRANCH_STACK | + PERF_SAMPLE_REGS_USER | + PERF_SAMPLE_STACK_USER)) + return -EINVAL; + } else if (!attr.exclude_kernel) { + if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) + return -EACCES; + } + if (flags & PERF_FLAG_PID_CGROUP) { err = perf_cgroup_connect(pid, event, &attr, group_leader); if (err) { -- 2.1.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture 2014-10-22 15:28 [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Robert Bragg 2014-10-22 15:28 ` [RFC PATCH 1/3] perf: export perf_event_overflow Robert Bragg 2014-10-22 15:28 ` [RFC PATCH 2/3] perf: Add PERF_PMU_CAP_IS_DEVICE flag Robert Bragg @ 2014-10-22 15:28 ` Robert Bragg 2014-10-23 7:47 ` Chris Wilson 2014-10-23 5:58 ` [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Ingo Molnar 2014-10-30 19:08 ` Peter Zijlstra 4 siblings, 1 reply; 16+ messages in thread From: Robert Bragg @ 2014-10-22 15:28 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs, Robert Bragg Gen graphics hardware can be set up to periodically write snapshots of performance counters into a circular buffer and this patch exposes that capability to userspace via the perf interface. Only Haswell is supported currently. Signed-off-by: Robert Bragg <robert@sixbynine.org> --- drivers/gpu/drm/i915/Makefile | 1 + drivers/gpu/drm/i915/i915_dma.c | 2 + drivers/gpu/drm/i915/i915_drv.h | 33 ++ drivers/gpu/drm/i915/i915_oa_perf.c | 675 ++++++++++++++++++++++++++++++++++++ drivers/gpu/drm/i915/i915_reg.h | 87 +++++ include/uapi/drm/i915_drm.h | 21 ++ 6 files changed, 819 insertions(+) create mode 100644 drivers/gpu/drm/i915/i915_oa_perf.c diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile index c1dd485..2ddd97d 100644 --- a/drivers/gpu/drm/i915/Makefile +++ b/drivers/gpu/drm/i915/Makefile @@ -14,6 +14,7 @@ i915-y := i915_drv.o \ intel_pm.o i915-$(CONFIG_COMPAT) += i915_ioc32.o i915-$(CONFIG_DEBUG_FS) += i915_debugfs.o +i915-$(CONFIG_PERF_EVENTS) += i915_oa_perf.o # GEM code i915-y += i915_cmd_parser.o \ diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c index 3f676f9..ce1e1ea 100644 --- a/drivers/gpu/drm/i915/i915_dma.c +++ b/drivers/gpu/drm/i915/i915_dma.c @@ -1792,6 +1792,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags) intel_gpu_ips_init(dev_priv); intel_init_runtime_pm(dev_priv); + i915_oa_pmu_register(dev); return 0; @@ -1839,6 +1840,7 @@ int i915_driver_unload(struct drm_device *dev) return ret; } + i915_oa_pmu_unregister(dev); intel_fini_runtime_pm(dev_priv); intel_gpu_ips_teardown(); diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 6fbd316..1b2c557 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -45,6 +45,7 @@ #include <linux/hashtable.h> #include <linux/intel-iommu.h> #include <linux/kref.h> +#include <linux/perf_event.h> #include <linux/pm_qos.h> /* General customization: @@ -1636,6 +1637,29 @@ struct drm_i915_private { */ struct workqueue_struct *dp_wq; +#ifdef CONFIG_PERF_EVENTS + struct { + struct pmu pmu; + spinlock_t lock; + struct hrtimer timer; + struct pt_regs dummy_regs; + + struct perf_event *exclusive_event; + struct intel_context *specific_ctx; + + struct { + struct kref refcount; + struct drm_i915_gem_object *obj; + u32 gtt_offset; + u8 *addr; + u32 head; + u32 tail; + int format; + int format_size; + } oa_buffer; + } oa_pmu; +#endif + /* Old dri1 support infrastructure, beware the dragons ya fools entering * here! */ struct i915_dri1_state dri1; @@ -2688,6 +2712,15 @@ int i915_parse_cmds(struct intel_engine_cs *ring, u32 batch_start_offset, bool is_master); +/* i915_oa_perf.c */ +#ifdef CONFIG_PERF_EVENTS +extern void i915_oa_pmu_register(struct drm_device *dev); +extern void i915_oa_pmu_unregister(struct drm_device *dev); +#else +static inline void i915_oa_pmu_register(struct drm_device *dev) {} +static inline void i915_oa_pmu_unregister(struct drm_device *dev) {} +#endif + /* i915_suspend.c */ extern int i915_save_state(struct drm_device *dev); extern int i915_restore_state(struct drm_device *dev); diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c new file mode 100644 index 0000000..d86aaf0 --- /dev/null +++ b/drivers/gpu/drm/i915/i915_oa_perf.c @@ -0,0 +1,675 @@ +#include <linux/perf_event.h> +#include <linux/sizes.h> + +#include "i915_drv.h" +#include "intel_ringbuffer.h" + +/* Must be a power of two */ +#define OA_BUFFER_SIZE SZ_16M +#define OA_TAKEN(tail, head) ((tail - head) & (OA_BUFFER_SIZE - 1)) + +#define FREQUENCY 200 +#define PERIOD max_t(u64, 10000, NSEC_PER_SEC / FREQUENCY) + +static int hsw_perf_format_sizes[] = { + 64, /* A13_HSW */ + 128, /* A29_HSW */ + 128, /* A13_B8_C8_HSW */ + + /* XXX: If we were to disallow this format we could avoid needing to + * handle snapshots being split in two when they don't factor into + * the buffer size... */ + 192, /* A29_B8_C8_HSW */ + 64, /* B4_C8_HSW */ + 256, /* A45_B8_C8_HSW */ + 128, /* B4_C8_A16_HSW */ + 64 /* C4_B8_HSW */ +}; + +static void forward_one_oa_snapshot_to_event(struct drm_i915_private *dev_priv, + u8 *snapshot, + struct perf_event *event) +{ + struct perf_sample_data data; + int snapshot_size = dev_priv->oa_pmu.oa_buffer.format_size; + struct perf_raw_record raw; + + perf_sample_data_init(&data, 0, event->hw.last_period); + + /* XXX: It seems strange that kernel/events/core.c only initialises + * data->type if event->attr.sample_id_all is set + * + * For now, we explicitly set this otherwise perf_event_overflow() + * may reference an uninitialised sample_type and may not actually + * forward our raw data. + */ + data.type = event->attr.sample_type; + + /* Note: the 32 bit size + raw data must be 8 byte aligned. + * + * So that we don't have to first copy the data out of the + * OABUFFER, we instead allow an overrun and forward the 32 bit + * report id of the next snapshot... + */ + raw.size = snapshot_size + 4; + raw.data = snapshot; + + data.raw = &raw; + + perf_event_overflow(event, &data, &dev_priv->oa_pmu.dummy_regs); +} + +static u32 forward_oa_snapshots(struct drm_i915_private *dev_priv, + u32 head, + u32 tail) +{ + struct perf_event *exclusive_event = dev_priv->oa_pmu.exclusive_event; + int snapshot_size = dev_priv->oa_pmu.oa_buffer.format_size; + u8 *oa_buf_base = dev_priv->oa_pmu.oa_buffer.addr; + u32 mask = (OA_BUFFER_SIZE - 1); + u8 scratch[snapshot_size + 4]; + u8 *snapshot; + u32 taken; + + head -= dev_priv->oa_pmu.oa_buffer.gtt_offset; + tail -= dev_priv->oa_pmu.oa_buffer.gtt_offset; + + /* Note: the gpu doesn't wrap the tail according to the OA buffer size + * so when we need to make sure our head/tail values are in-bounds we + * use the above mask. + */ + + while ((taken = OA_TAKEN(tail, head))) { + u32 before; + + /* The tail increases in 64 byte increments, not in + * format_size steps. */ + if (taken < snapshot_size) + break; + + /* As well as handling snapshots that are split in two we also + * need to pad snapshots at the end of the oabuffer so that + * forward_one_oa_snapshot_to_event() can safely overrun by 4 + * bytes for alignment. */ + before = OA_BUFFER_SIZE - (head & mask); + if (before <= snapshot_size) { + u32 after = snapshot_size - before; + + memcpy(scratch, oa_buf_base + (head & mask), before); + if (after) + memcpy(scratch + before, oa_buf_base, after); + snapshot = scratch; + } else + snapshot = oa_buf_base + (head & mask); + + head += snapshot_size; + + /* We currently only allow exclusive access to the counters + * so only have one event to forward too... */ + if (exclusive_event->state == PERF_EVENT_STATE_ACTIVE) + forward_one_oa_snapshot_to_event(dev_priv, snapshot, + exclusive_event); + } + + return dev_priv->oa_pmu.oa_buffer.gtt_offset + head; +} + +static void flush_oa_snapshots(struct drm_i915_private *dev_priv, + bool force_wake) +{ + unsigned long flags; + u32 oastatus2; + u32 oastatus1; + u32 head; + u32 tail; + + /* Can either flush via hrtimer callback or pmu methods/fops */ + if (!force_wake) { + + /* If the hrtimer triggers at the same time that we are + * responding to a userspace initiated flush then we can + * just bail out... + * + * FIXME: strictly this lock doesn't imply we are already + * flushing though it shouldn't really be a problem to skip + * the odd hrtimer flush anyway. + */ + if (!spin_trylock_irqsave(&dev_priv->oa_pmu.lock, flags)) + return; + } else + spin_lock_irqsave(&dev_priv->oa_pmu.lock, flags); + + WARN_ON(!dev_priv->oa_pmu.oa_buffer.addr); + + oastatus2 = I915_READ(OASTATUS2); + oastatus1 = I915_READ(OASTATUS1); + + head = oastatus2 & OASTATUS2_HEAD_MASK; + tail = oastatus1 & OASTATUS1_TAIL_MASK; + + if (oastatus1 & (OASTATUS1_OABUFFER_OVERFLOW | + OASTATUS1_REPORT_LOST)) { + + /* XXX: How can we convey report-lost errors to userspace? It + * doesn't look like perf's _REPORT_LOST mechanism is + * appropriate in this case; that's just for cases where we + * run out of space for samples in the perf circular buffer. + * + * Maybe we can claim a special report-id and use that to + * forward status flags? + */ + pr_debug("OA buffer read error: addr = %p, head = %u, offset = %u, tail = %u cnt o'flow = %d, buf o'flow = %d, rpt lost = %d\n", + dev_priv->oa_pmu.oa_buffer.addr, + head, + head - dev_priv->oa_pmu.oa_buffer.gtt_offset, + tail, + oastatus1 & OASTATUS1_COUNTER_OVERFLOW ? 1 : 0, + oastatus1 & OASTATUS1_OABUFFER_OVERFLOW ? 1 : 0, + oastatus1 & OASTATUS1_REPORT_LOST ? 1 : 0); + + I915_WRITE(OASTATUS1, oastatus1 & + ~(OASTATUS1_OABUFFER_OVERFLOW | + OASTATUS1_REPORT_LOST)); + } + + head = forward_oa_snapshots(dev_priv, head, tail); + + I915_WRITE(OASTATUS2, (head & OASTATUS2_HEAD_MASK) | OASTATUS2_GGTT); + + spin_unlock_irqrestore(&dev_priv->oa_pmu.lock, flags); +} + +static void +oa_buffer_free(struct kref *kref) +{ + struct drm_i915_private *i915 = + container_of(kref, typeof(*i915), oa_pmu.oa_buffer.refcount); + + BUG_ON(!mutex_is_locked(&i915->dev->struct_mutex)); + + vunmap(i915->oa_pmu.oa_buffer.addr); + i915_gem_object_ggtt_unpin(i915->oa_pmu.oa_buffer.obj); + drm_gem_object_unreference(&i915->oa_pmu.oa_buffer.obj->base); + + i915->oa_pmu.oa_buffer.obj = NULL; + i915->oa_pmu.oa_buffer.gtt_offset = 0; + i915->oa_pmu.oa_buffer.addr = NULL; +} + +static inline void oa_buffer_reference(struct drm_i915_private *i915) +{ + kref_get(&i915->oa_pmu.oa_buffer.refcount); +} + +static void oa_buffer_unreference(struct drm_i915_private *i915) +{ + WARN_ON(!i915->oa_pmu.oa_buffer.obj); + + kref_put(&i915->oa_pmu.oa_buffer.refcount, oa_buffer_free); +} + +static void i915_oa_event_destroy(struct perf_event *event) +{ + struct drm_i915_private *i915 = + container_of(event->pmu, typeof(*i915), oa_pmu.pmu); + + WARN_ON(event->parent); + + mutex_lock(&i915->dev->struct_mutex); + + oa_buffer_unreference(i915); + + if (i915->oa_pmu.specific_ctx) { + struct drm_i915_gem_object *obj; + + obj = i915->oa_pmu.specific_ctx->legacy_hw_ctx.rcs_state; + if (i915_gem_obj_is_pinned(obj)) + i915_gem_object_ggtt_unpin(obj); + i915->oa_pmu.specific_ctx = NULL; + } + + BUG_ON(i915->oa_pmu.exclusive_event != event); + i915->oa_pmu.exclusive_event = NULL; + + mutex_unlock(&i915->dev->struct_mutex); + + gen6_gt_force_wake_put(i915, FORCEWAKE_ALL); +} + +static void *vmap_oa_buffer(struct drm_i915_gem_object *obj) +{ + int i; + void *addr = NULL; + struct sg_page_iter sg_iter; + struct page **pages; + + pages = drm_malloc_ab(obj->base.size >> PAGE_SHIFT, sizeof(*pages)); + if (pages == NULL) { + DRM_DEBUG_DRIVER("Failed to get space for pages\n"); + goto finish; + } + + i = 0; + for_each_sg_page(obj->pages->sgl, &sg_iter, obj->pages->nents, 0) { + pages[i] = sg_page_iter_page(&sg_iter); + i++; + } + + addr = vmap(pages, i, 0, PAGE_KERNEL); + if (addr == NULL) { + DRM_DEBUG_DRIVER("Failed to vmap pages\n"); + goto finish; + } + +finish: + if (pages) + drm_free_large(pages); + return addr; +} + +static int init_oa_buffer(struct perf_event *event) +{ + struct drm_i915_private *dev_priv = + container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu); + struct drm_i915_gem_object *bo; + int ret; + + BUG_ON(!IS_HASWELL(dev_priv->dev)); + BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex)); + BUG_ON(dev_priv->oa_pmu.oa_buffer.obj); + + kref_init(&dev_priv->oa_pmu.oa_buffer.refcount); + + bo = i915_gem_alloc_object(dev_priv->dev, OA_BUFFER_SIZE); + if (bo == NULL) { + DRM_ERROR("Failed to allocate OA buffer\n"); + ret = -ENOMEM; + goto err; + } + dev_priv->oa_pmu.oa_buffer.obj = bo; + + ret = i915_gem_object_set_cache_level(bo, I915_CACHE_LLC); + if (ret) + goto err_unref; + + /* PreHSW required 512K alignment, HSW requires 16M */ + ret = i915_gem_obj_ggtt_pin(bo, SZ_16M, 0); + if (ret) + goto err_unref; + + dev_priv->oa_pmu.oa_buffer.gtt_offset = i915_gem_obj_ggtt_offset(bo); + dev_priv->oa_pmu.oa_buffer.addr = vmap_oa_buffer(bo); + + /* Pre-DevBDW: OABUFFER must be set with counters off, + * before OASTATUS1, but after OASTATUS2 */ + I915_WRITE(OASTATUS2, dev_priv->oa_pmu.oa_buffer.gtt_offset | + OASTATUS2_GGTT); /* head */ + I915_WRITE(GEN7_OABUFFER, dev_priv->oa_pmu.oa_buffer.gtt_offset); + I915_WRITE(OASTATUS1, dev_priv->oa_pmu.oa_buffer.gtt_offset | + OASTATUS1_OABUFFER_SIZE_16M); /* tail */ + + DRM_DEBUG_DRIVER("OA Buffer initialized, gtt offset = 0x%x, vaddr = %p", + dev_priv->oa_pmu.oa_buffer.gtt_offset, + dev_priv->oa_pmu.oa_buffer.addr); + + return 0; + +err_unref: + drm_gem_object_unreference_unlocked(&bo->base); +err: + return ret; +} + +static enum hrtimer_restart hrtimer_sample(struct hrtimer *hrtimer) +{ + struct drm_i915_private *i915 = + container_of(hrtimer, typeof(*i915), oa_pmu.timer); + + flush_oa_snapshots(i915, false); + + hrtimer_forward_now(hrtimer, ns_to_ktime(PERIOD)); + return HRTIMER_RESTART; +} + +static struct intel_context * +lookup_context(struct drm_i915_private *dev_priv, + struct file *user_filp, + u32 ctx_user_handle) +{ + struct intel_context *ctx; + + mutex_lock(&dev_priv->dev->struct_mutex); + list_for_each_entry(ctx, &dev_priv->context_list, link) { + struct drm_file *drm_file; + + if (!ctx->file_priv) + continue; + + drm_file = ctx->file_priv->file; + + if (user_filp->private_data == drm_file && + ctx->user_handle == ctx_user_handle) { + mutex_unlock(&dev_priv->dev->struct_mutex); + return ctx; + } + } + mutex_unlock(&dev_priv->dev->struct_mutex); + + return NULL; +} + +static int i915_oa_event_init(struct perf_event *event) +{ + struct perf_event_context *ctx = event->ctx; + struct drm_i915_private *dev_priv = + container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu); + int ret = 0; + + if (event->attr.type != event->pmu->type) + return -ENOENT; + + /* When tracing a specific pid events/core will enable/disable + * the event only while that pid is running on a cpu but that + * doesn't really make sense here. */ + if (ctx) { + if (ctx->task) + return -EINVAL; + } +#if 0 + else + pr_err("Unexpected NULL perf_event_context\n"); + + /* XXX: it looks like we get a NULL ctx, so check if setting + * pmu->task_ctx_nr to perf_invalid_context in _pmu_register + * implies events/core.c will also implicitly disallow + * associating a perf_oa event with a task? + */ +#endif + + /* To avoid the complexity of having to accurately filter + * counter snapshots and marshal to the appropriate client + * we currently only allow exclusive access */ + if (dev_priv->oa_pmu.oa_buffer.obj) + return -EBUSY; + + /* TODO: improve cooperation with the cmd_parser which provides + * another mechanism for enabling the OA counters. */ + if (I915_READ(OACONTROL) & OACONTROL_ENABLE) + return -EBUSY; + + /* Since we are limited to an exponential scale for + * programming the OA sampling period we don't allow userspace + * to pass a precise attr.sample_period. */ + if (event->attr.freq || + (event->attr.sample_period != 0 && + event->attr.sample_period != 1)) + return -EINVAL; + + /* Instead of allowing userspace to configure the period via + * attr.sample_period we instead accept an exponent whereby + * the sample_period will be: + * + * 80ns * 2^(period_exponent + 1) + * + * Programming a period of 160 nanoseconds would not be very + * polite, so higher frequencies are reserved for root. + */ + if (event->attr.sample_period) { + u64 period_exponent = + event->attr.config & I915_PERF_OA_TIMER_EXPONENT_MASK; + period_exponent >>= I915_PERF_OA_TIMER_EXPONENT_SHIFT; + + if (period_exponent < 15 && !capable(CAP_SYS_ADMIN)) + return -EACCES; + } + + if (!IS_HASWELL(dev_priv->dev)) + return -ENODEV; + + /* We bypass the default perf core perf_paranoid_cpu() || + * CAP_SYS_ADMIN check by using the PERF_PMU_CAP_IS_DEVICE + * flag and instead authenticate based on whether the current + * pid owns the specified context, or require CAP_SYS_ADMIN + * when collecting cross-context metrics. + */ + dev_priv->oa_pmu.specific_ctx = NULL; + if (event->attr.config & I915_PERF_OA_SINGLE_CONTEXT_ENABLE) { + u32 ctx_id = event->attr.config & I915_PERF_OA_CTX_ID_MASK; + unsigned int drm_fd = event->attr.config1; + struct fd fd = fdget(drm_fd); + + if (fd.file) { + dev_priv->oa_pmu.specific_ctx = + lookup_context(dev_priv, fd.file, ctx_id); + } + } + + if (!dev_priv->oa_pmu.specific_ctx && !capable(CAP_SYS_ADMIN)) + return -EACCES; + + mutex_lock(&dev_priv->dev->struct_mutex); + + /* XXX: Not sure that this is really acceptable... + * + * i915_gem_context.c currently owns pinning/unpinning legacy + * context buffers and although that code has a + * get_context_alignment() func to handle a different + * constraint for gen6 we are assuming it's fixed for gen7 + * here. Another option besides pinning here would be to + * instead hook into context switching and update the + * OACONTROL configuration on the fly. + */ + if (dev_priv->oa_pmu.specific_ctx) { + struct intel_context *ctx = dev_priv->oa_pmu.specific_ctx; + int ret; + + ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state, + 4096, 0); + if (ret) { + DRM_DEBUG_DRIVER("Couldn't pin %d\n", ret); + ret = -EBUSY; + goto err; + } + } + + if (!dev_priv->oa_pmu.oa_buffer.obj) + ret = init_oa_buffer(event); + else + oa_buffer_reference(dev_priv); + + if (ret) + goto err; + + BUG_ON(dev_priv->oa_pmu.exclusive_event); + dev_priv->oa_pmu.exclusive_event = event; + + event->destroy = i915_oa_event_destroy; + + mutex_unlock(&dev_priv->dev->struct_mutex); + + /* PRM - observability performance counters: + * + * OACONTROL, performance counter enable, note: + * + * "When this bit is set, in order to have coherent counts, + * RC6 power state and trunk clock gating must be disabled. + * This can be achieved by programming MMIO registers as + * 0xA094=0 and 0xA090[31]=1" + * + * 0xA094 corresponds to GEN6_RC_STATE + * 0xA090[31] corresponds to GEN6_RC_CONTROL, GEN6_RC_CTL_HW_ENABLE + */ + /* XXX: We should probably find a more refined way of disabling RC6 + * in cooperation with intel_pm.c. + * TODO: Find a way to disable clock gating too + */ + gen6_gt_force_wake_get(dev_priv, FORCEWAKE_ALL); + + return 0; + +err: + mutex_unlock(&dev_priv->dev->struct_mutex); + + return ret; +} + +static void i915_oa_event_start(struct perf_event *event, int flags) +{ + struct drm_i915_private *dev_priv = + container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu); + u64 report_format; + int snapshot_size; + unsigned long ctx_id; + u64 period_exponent; + + /* PRM - observability performance counters: + * + * OACONTROL, specific context enable: + * + * "OA unit level clock gating must be ENABLED when using + * specific ContextID feature." + * + * Assuming we don't ever disable OA unit level clock gating + * lets just assert that this condition is met... + */ + WARN_ONCE(I915_READ(GEN6_UCGCTL3) & GEN6_OACSUNIT_CLOCK_GATE_DISABLE, + "disabled OA unit level clock gating will result in incorrect per-context OA counters"); + + /* XXX: On Haswell, when threshold disable mode is desired, + * instead of setting the threshold enable to '0', we need to + * program it to '1' and set OASTARTTRIG1 bits 15:0 to 0 + * (threshold value of 0) + */ + I915_WRITE(OASTARTTRIG6, (OASTARTTRIG6_B4_TO_B7_THRESHOLD_ENABLE | + OASTARTTRIG6_B4_CUSTOM_EVENT_ENABLE)); + I915_WRITE(OASTARTTRIG5, 0); /* threshold value */ + + I915_WRITE(OASTARTTRIG2, (OASTARTTRIG2_B0_TO_B3_THRESHOLD_ENABLE | + OASTARTTRIG2_B0_CUSTOM_EVENT_ENABLE)); + I915_WRITE(OASTARTTRIG1, 0); /* threshold value */ + + /* Setup B0 as the gpu clock counter... */ + I915_WRITE(OACEC0_0, OACEC0_0_B0_COMPARE_GREATER_OR_EQUAL); /* to 0 */ + I915_WRITE(OACEC0_1, 0xfffe); /* Select NOA[0] */ + + period_exponent = event->attr.config & I915_PERF_OA_TIMER_EXPONENT_MASK; + period_exponent >>= I915_PERF_OA_TIMER_EXPONENT_SHIFT; + + if (dev_priv->oa_pmu.specific_ctx) { + struct intel_context *ctx = dev_priv->oa_pmu.specific_ctx; + + ctx_id = i915_gem_obj_ggtt_offset(ctx->legacy_hw_ctx.rcs_state); + } else + ctx_id = 0; + + report_format = event->attr.config & I915_PERF_OA_FORMAT_MASK; + report_format >>= I915_PERF_OA_FORMAT_SHIFT; + snapshot_size = hsw_perf_format_sizes[report_format]; + + I915_WRITE(OACONTROL, 0 | + (ctx_id & OACONTROL_CTX_MASK) | + period_exponent << OACONTROL_TIMER_PERIOD_SHIFT | + (event->attr.sample_period ? OACONTROL_TIMER_ENABLE : 0) | + report_format << OACONTROL_FORMAT_SHIFT| + (ctx_id ? OACONTROL_PER_CTX_ENABLE : 0) | + OACONTROL_ENABLE); + + if (event->attr.sample_period) { + __hrtimer_start_range_ns(&dev_priv->oa_pmu.timer, + ns_to_ktime(PERIOD), 0, + HRTIMER_MODE_REL_PINNED, 0); + } + + dev_priv->oa_pmu.oa_buffer.format = report_format; + dev_priv->oa_pmu.oa_buffer.format_size = snapshot_size; + + event->hw.state = 0; +} + +static void i915_oa_event_stop(struct perf_event *event, int flags) +{ + struct drm_i915_private *dev_priv = + container_of(event->pmu, typeof(*dev_priv), oa_pmu.pmu); + + I915_WRITE(OACONTROL, I915_READ(OACONTROL) & ~OACONTROL_ENABLE); + + if (event->attr.sample_period) { + hrtimer_cancel(&dev_priv->oa_pmu.timer); + flush_oa_snapshots(dev_priv, true); + } + + event->hw.state = PERF_HES_STOPPED; +} + +static int i915_oa_event_add(struct perf_event *event, int flags) +{ + if (flags & PERF_EF_START) + i915_oa_event_start(event, flags); + + return 0; +} + +static void i915_oa_event_del(struct perf_event *event, int flags) +{ + i915_oa_event_stop(event, flags); +} + +static void i915_oa_event_read(struct perf_event *event) +{ + struct drm_i915_private *i915 = + container_of(event->pmu, typeof(*i915), oa_pmu.pmu); + + /* We want userspace to be able to use a read() to explicitly + * flush OA counter snapshots... */ + if (event->attr.sample_period) + flush_oa_snapshots(i915, true); + + /* XXX: What counter would be useful here? */ + local64_set(&event->count, 0); +} + +static int i915_oa_event_event_idx(struct perf_event *event) +{ + return 0; +} + +void i915_oa_pmu_register(struct drm_device *dev) +{ + struct drm_i915_private *i915 = to_i915(dev); + + /* We need to be careful about forwarding cpu metrics to + * userspace considering that PERF_PMU_CAP_IS_DEVICE bypasses + * the events/core security check that stops an unprivileged + * process collecting metrics for other processes. + */ + i915->oa_pmu.dummy_regs = *task_pt_regs(current); + + hrtimer_init(&i915->oa_pmu.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + i915->oa_pmu.timer.function = hrtimer_sample; + + spin_lock_init(&i915->oa_pmu.lock); + + i915->oa_pmu.pmu.capabilities = PERF_PMU_CAP_IS_DEVICE; + i915->oa_pmu.pmu.task_ctx_nr = perf_invalid_context; + i915->oa_pmu.pmu.event_init = i915_oa_event_init; + i915->oa_pmu.pmu.add = i915_oa_event_add; + i915->oa_pmu.pmu.del = i915_oa_event_del; + i915->oa_pmu.pmu.start = i915_oa_event_start; + i915->oa_pmu.pmu.stop = i915_oa_event_stop; + i915->oa_pmu.pmu.read = i915_oa_event_read; + i915->oa_pmu.pmu.event_idx = i915_oa_event_event_idx; + + if (perf_pmu_register(&i915->oa_pmu.pmu, "i915_oa", -1)) + i915->oa_pmu.pmu.event_init = NULL; +} + +void i915_oa_pmu_unregister(struct drm_device *dev) +{ + struct drm_i915_private *i915 = to_i915(dev); + + if (i915->oa_pmu.pmu.event_init == NULL) + return; + + perf_pmu_unregister(&i915->oa_pmu.pmu); + i915->oa_pmu.pmu.event_init = NULL; +} diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h index 203062e..1e7cfd4 100644 --- a/drivers/gpu/drm/i915/i915_reg.h +++ b/drivers/gpu/drm/i915/i915_reg.h @@ -457,6 +457,92 @@ #define GEN7_3DPRIM_BASE_VERTEX 0x2440 #define OACONTROL 0x2360 +#define OACONTROL_CTX_MASK 0xFFFFF000 +#define OACONTROL_TIMER_PERIOD_MASK 0x3F +#define OACONTROL_TIMER_PERIOD_SHIFT 6 +#define OACONTROL_TIMER_ENABLE (1<<5) +#define OACONTROL_FORMAT_A13_HSW (0<<2) +#define OACONTROL_FORMAT_A29_HSW (1<<2) +#define OACONTROL_FORMAT_A13_B8_C8_HSW (2<<2) +#define OACONTROL_FORMAT_A29_B8_C8_HSW (3<<2) +#define OACONTROL_FORMAT_B4_C8_HSW (4<<2) +#define OACONTROL_FORMAT_A45_B8_C8_HSW (5<<2) +#define OACONTROL_FORMAT_B4_C8_A16_HSW (6<<2) +#define OACONTROL_FORMAT_C4_B8_HSW (7<<2) +#define OACONTROL_FORMAT_SHIFT 2 +#define OACONTROL_PER_CTX_ENABLE (1<<1) +#define OACONTROL_ENABLE (1<<0) + +#define OASTARTTRIG5 0x02720 +#define OASTARTTRIG5_THRESHOLD_VALUE_MASK 0xffff + +#define OASTARTTRIG6 0x02724 +#define OASTARTTRIG6_B4_TO_B7_THRESHOLD_ENABLE (1<<23) +#define OASTARTTRIG6_B4_CUSTOM_EVENT_ENABLE (1<<28) + +#define OASTARTTRIG1 0x02710 +#define OASTARTTRIG1_THRESHOLD_VALUE_MASK 0xffff + +#define OASTARTTRIG2 0x02714 +#define OASTARTTRIG2_B0_TO_B3_THRESHOLD_ENABLE (1<<23) +#define OASTARTTRIG2_B0_CUSTOM_EVENT_ENABLE (1<<28) + +#define OACEC0_0 0x2770 +#define OACEC0_0_B0_COMPARE_ANY_EQUAL 0 +#define OACEC0_0_B0_COMPARE_OR 0 +#define OACEC0_0_B0_COMPARE_GREATER_THAN 1 +#define OACEC0_0_B0_COMPARE_EQUAL 2 +#define OACEC0_0_B0_COMPARE_GREATER_OR_EQUAL 3 +#define OACEC0_0_B0_COMPARE_LESS_THAN 4 +#define OACEC0_0_B0_COMPARE_NOT_EQUAL 5 +#define OACEC0_0_B0_COMPARE_LESS_OR_EQUAL 6 +#define OACEC0_0_B0_COMPARE_VALUE_MASK 0xffff +#define OACEC0_0_B0_COMPARE_VALUE_SHIFT 3 + +#define OACEC0_1 0x2774 +#define OACEC0_1_B0_NOA_SELECT_MASK 0xffff + +#define GEN7_OABUFFER 0x23B0 /* R/W */ +#define GEN7_OABUFFER_OVERRUN_DISABLE (1<<3) +#define GEN7_OABUFFER_EDGE_TRIGGER (1<<2) +#define GEN7_OABUFFER_STOP_RESUME_ENABLE (1<<1) +#define GEN7_OABUFFER_RESUME (1<<0) + +#define GEN8_OABUFFER 0x2B14 /* R/W */ +#define GEN8_OABUFFER_SIZE_MASK 0x7 +#define GEN8_OABUFFER_SIZE_128K (0<<3) +#define GEN8_OABUFFER_SIZE_256K (1<<3) +#define GEN8_OABUFFER_SIZE_512K (2<<3) +#define GEN8_OABUFFER_SIZE_1M (3<<3) +#define GEN8_OABUFFER_SIZE_2M (4<<3) +#define GEN8_OABUFFER_SIZE_4M (5<<3) +#define GEN8_OABUFFER_SIZE_8M (6<<3) +#define GEN8_OABUFFER_SIZE_16M (7<<3) +#define GEN8_OABUFFER_EDGE_TRIGGER (1<<2) +#define GEN8_OABUFFER_OVERRUN_DISABLE (1<<1) +#define GEN8_OABUFFER_MEM_SELECT_GGTT (1<<0) + +#define OASTATUS1 0x2364 +#define OASTATUS1_TAIL_MASK 0xffffffc0 +#define OASTATUS1_OABUFFER_SIZE_128K (0<<3) +#define OASTATUS1_OABUFFER_SIZE_256K (1<<3) +#define OASTATUS1_OABUFFER_SIZE_512K (2<<3) +#define OASTATUS1_OABUFFER_SIZE_1M (3<<3) +#define OASTATUS1_OABUFFER_SIZE_2M (4<<3) +#define OASTATUS1_OABUFFER_SIZE_4M (5<<3) +#define OASTATUS1_OABUFFER_SIZE_8M (6<<3) +#define OASTATUS1_OABUFFER_SIZE_16M (7<<3) +#define OASTATUS1_COUNTER_OVERFLOW (1<<2) +#define OASTATUS1_OABUFFER_OVERFLOW (1<<1) +#define OASTATUS1_REPORT_LOST (1<<0) + + +#define OASTATUS2 0x2368 +#define OASTATUS2_HEAD_MASK 0xffffffc0 +#define OASTATUS2_GGTT 0x1 + +#define GEN8_OAHEADPTR 0x2B0C +#define GEN8_OATAILPTR 0x2B10 #define _GEN7_PIPEA_DE_LOAD_SL 0x70068 #define _GEN7_PIPEB_DE_LOAD_SL 0x71068 @@ -5551,6 +5637,7 @@ enum punit_power_well { # define GEN6_RCCUNIT_CLOCK_GATE_DISABLE (1 << 11) #define GEN6_UCGCTL3 0x9408 +# define GEN6_OACSUNIT_CLOCK_GATE_DISABLE (1 << 20) #define GEN7_UCGCTL4 0x940c #define GEN7_L3BANK2X_CLOCK_GATE_DISABLE (1<<25) diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index ff57f07..fd3b0cb 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -58,6 +58,27 @@ #define I915_ERROR_UEVENT "ERROR" #define I915_RESET_UEVENT "RESET" +/** + * DOC: perf events configuration exposed by i915 through /sys/bus/event_sources/drivers/i915_oa + * + */ +#define I915_PERF_OA_CTX_ID_MASK 0xffffffff +#define I915_PERF_OA_SINGLE_CONTEXT_ENABLE (1ULL << 32) + +#define I915_PERF_OA_FORMAT_SHIFT 33 +#define I915_PERF_OA_FORMAT_MASK (0x7ULL << 33) +#define I915_PERF_OA_FORMAT_A13_HSW (0ULL << 33) +#define I915_PERF_OA_FORMAT_A29_HSW (1ULL << 33) +#define I915_PERF_OA_FORMAT_A13_B8_C8_HSW (2ULL << 33) +#define I915_PERF_OA_FORMAT_A29_B8_C8_HSW (3ULL << 33) +#define I915_PERF_OA_FORMAT_B4_C8_HSW (4ULL << 33) +#define I915_PERF_OA_FORMAT_A45_B8_C8_HSW (5ULL << 33) +#define I915_PERF_OA_FORMAT_B4_C8_A16_HSW (6ULL << 33) +#define I915_PERF_OA_FORMAT_C4_B8_HSW (7ULL << 33) + +#define I915_PERF_OA_TIMER_EXPONENT_SHIFT 36 +#define I915_PERF_OA_TIMER_EXPONENT_MASK (0x3fULL << 36) + /* Each region is a minimum of 16k, and there are at most 255 of them. */ #define I915_NR_TEX_REGIONS 255 /* table size 2k - maximum due to use -- 2.1.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture 2014-10-22 15:28 ` [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture Robert Bragg @ 2014-10-23 7:47 ` Chris Wilson 2014-10-24 2:33 ` Robert Bragg 0 siblings, 1 reply; 16+ messages in thread From: Chris Wilson @ 2014-10-23 7:47 UTC (permalink / raw) To: Robert Bragg Cc: linux-kernel, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Rob Clark, Samuel Pitoiset, Ben Skeggs On Wed, Oct 22, 2014 at 04:28:51PM +0100, Robert Bragg wrote: > + /* XXX: Not sure that this is really acceptable... > + * > + * i915_gem_context.c currently owns pinning/unpinning legacy > + * context buffers and although that code has a > + * get_context_alignment() func to handle a different > + * constraint for gen6 we are assuming it's fixed for gen7 > + * here. Another option besides pinning here would be to > + * instead hook into context switching and update the > + * OACONTROL configuration on the fly. > + */ > + if (dev_priv->oa_pmu.specific_ctx) { > + struct intel_context *ctx = dev_priv->oa_pmu.specific_ctx; > + int ret; > + > + ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state, > + 4096, 0); Right if you pin it here with a different alignment, when we try to pin it with the required hw ctx alignment it will fail. Easiest way is to record the ctx->legacy_hw_ctx.alignment and reuse that here. > + if (ret) { > + DRM_DEBUG_DRIVER("Couldn't pin %d\n", ret); > + ret = -EBUSY; As an exercise, think of all the possible error values from pin() and tell me why overriding that here is a bad, bad idea. -Chris -- Chris Wilson, Intel Open Source Technology Centre ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture 2014-10-23 7:47 ` Chris Wilson @ 2014-10-24 2:33 ` Robert Bragg 2014-10-24 6:56 ` Chris Wilson 0 siblings, 1 reply; 16+ messages in thread From: Robert Bragg @ 2014-10-24 2:33 UTC (permalink / raw) To: Chris Wilson Cc: linux-kernel, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Rob Clark, Samuel Pitoiset, Ben Skeggs On Thu, Oct 23, 2014 at 8:47 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > On Wed, Oct 22, 2014 at 04:28:51PM +0100, Robert Bragg wrote: >> + /* XXX: Not sure that this is really acceptable... >> + * >> + * i915_gem_context.c currently owns pinning/unpinning legacy >> + * context buffers and although that code has a >> + * get_context_alignment() func to handle a different >> + * constraint for gen6 we are assuming it's fixed for gen7 >> + * here. Another option besides pinning here would be to >> + * instead hook into context switching and update the >> + * OACONTROL configuration on the fly. >> + */ >> + if (dev_priv->oa_pmu.specific_ctx) { >> + struct intel_context *ctx = dev_priv->oa_pmu.specific_ctx; >> + int ret; >> + >> + ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state, >> + 4096, 0); > > Right if you pin it here with a different alignment, when we try to pin > it with the required hw ctx alignment it will fail. Easiest way is to > record the ctx->legacy_hw_ctx.alignment and reuse that here. Ok I can look into that a bit more. I'm not currently sure I can assume the ctx will have been pinned before, to be able to record the alignment. Skimming i915_gem_context.c, it looks like we only pin the default context on creation and a user could open a perf even before we first switch to that context. I wonder if it would be ok to expose an i915_get_context_alignment() api to deal with this? > >> + if (ret) { >> + DRM_DEBUG_DRIVER("Couldn't pin %d\n", ret); >> + ret = -EBUSY; > > As an exercise, think of all the possible error values from pin() and > tell me why overriding that here is a bad, bad idea. Hmm, I'm not quite sure why I decided to squash the error code there, it looks pretty arbitrary. My take on your comment a.t.m is essentially that some of the pin() errors don't really represent a busy state where it would make sense for userspace to try again later; such as -ENODEV. Sorry if you saw a very specific case that offended you :-) I have removed the override locally. Thanks for taking a look. - Robert > -Chris > > -- > Chris Wilson, Intel Open Source Technology Centre ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture 2014-10-24 2:33 ` Robert Bragg @ 2014-10-24 6:56 ` Chris Wilson 0 siblings, 0 replies; 16+ messages in thread From: Chris Wilson @ 2014-10-24 6:56 UTC (permalink / raw) To: Robert Bragg Cc: linux-kernel, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Rob Clark, Samuel Pitoiset, Ben Skeggs On Fri, Oct 24, 2014 at 03:33:14AM +0100, Robert Bragg wrote: > On Thu, Oct 23, 2014 at 8:47 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > On Wed, Oct 22, 2014 at 04:28:51PM +0100, Robert Bragg wrote: > >> + /* XXX: Not sure that this is really acceptable... > >> + * > >> + * i915_gem_context.c currently owns pinning/unpinning legacy > >> + * context buffers and although that code has a > >> + * get_context_alignment() func to handle a different > >> + * constraint for gen6 we are assuming it's fixed for gen7 > >> + * here. Another option besides pinning here would be to > >> + * instead hook into context switching and update the > >> + * OACONTROL configuration on the fly. > >> + */ > >> + if (dev_priv->oa_pmu.specific_ctx) { > >> + struct intel_context *ctx = dev_priv->oa_pmu.specific_ctx; > >> + int ret; > >> + > >> + ret = i915_gem_obj_ggtt_pin(ctx->legacy_hw_ctx.rcs_state, > >> + 4096, 0); > > > > Right if you pin it here with a different alignment, when we try to pin > > it with the required hw ctx alignment it will fail. Easiest way is to > > record the ctx->legacy_hw_ctx.alignment and reuse that here. > > Ok I can look into that a bit more. I'm not currently sure I can assume the > ctx will have been pinned before, to be able to record the alignment. > Skimming i915_gem_context.c, it looks like we only pin the default context > on creation and a user could open a perf even before we first switch to that > context. > > I wonder if it would be ok to expose an i915_get_context_alignment() api to > deal with this? I would either add intel_context_pin_state()/unpin_state() or expose ctx->...state_alignment. Leaning towards the former so that we don't have too many places mucking around inside ctx. > > > >> + if (ret) { > >> + DRM_DEBUG_DRIVER("Couldn't pin %d\n", ret); > >> + ret = -EBUSY; > > > > As an exercise, think of all the possible error values from pin() and > > tell me why overriding that here is a bad, bad idea. > > Hmm, I'm not quite sure why I decided to squash the error code there, it > looks pretty arbitrary. My take on your comment a.t.m is essentially that > some of the pin() errors don't really represent a busy state where it would > make sense for userspace to try again later; such as -ENODEV. Sorry if you > saw a very specific case that offended you :-) I have removed the override > locally. Or EINTR/EAGAIN and try again immediately. ;) -Chris -- Chris Wilson, Intel Open Source Technology Centre ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver 2014-10-22 15:28 [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Robert Bragg ` (2 preceding siblings ...) 2014-10-22 15:28 ` [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture Robert Bragg @ 2014-10-23 5:58 ` Ingo Molnar 2014-10-24 13:39 ` Robert Bragg 2014-10-30 19:08 ` Peter Zijlstra 4 siblings, 1 reply; 16+ messages in thread From: Ingo Molnar @ 2014-10-23 5:58 UTC (permalink / raw) To: Robert Bragg Cc: linux-kernel, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs * Robert Bragg <robert@sixbynine.org> wrote: > [...] > > I'd be interested to hear whether is sounds reasonable to > others for us to expose gpu device metrics via a perf pmu and > whether adding the PERF_PMU_CAP_IS_DEVICE flag as in my > following patch could be acceptable. I think it's perfectly reasonable, it's one of the interesting kernel features I hoped for years would be implemented for perf. > [...] > > In addition I also explicitly black list numerous attributes > and PERF_SAMPLE_ flags that I don't think make sense for a > device pmu. This could be handled in the pmu driver but it > seemed better to do in events/core, avoiding duplication in > case we later have multiple device pmus. Btw., if the GPU is able to dump (part of) its current execution status, you could in theory even do instruction profiling and in-GPU profiling (symbol resolution, maybe even annotation, etc.) with close to standard perf tooling - which I think is currently mostly the domain of proprietary tools. Thanks, Ingo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver 2014-10-23 5:58 ` [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Ingo Molnar @ 2014-10-24 13:39 ` Robert Bragg 0 siblings, 0 replies; 16+ messages in thread From: Robert Bragg @ 2014-10-24 13:39 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs On Thu, Oct 23, 2014 at 6:58 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Robert Bragg <robert@sixbynine.org> wrote: > >> [...] >> >> I'd be interested to hear whether is sounds reasonable to >> others for us to expose gpu device metrics via a perf pmu and >> whether adding the PERF_PMU_CAP_IS_DEVICE flag as in my >> following patch could be acceptable. > > I think it's perfectly reasonable, it's one of the interesting > kernel features I hoped for years would be implemented for perf. Ok, that's good to hear, thanks. > >> [...] >> >> In addition I also explicitly black list numerous attributes >> and PERF_SAMPLE_ flags that I don't think make sense for a >> device pmu. This could be handled in the pmu driver but it >> seemed better to do in events/core, avoiding duplication in >> case we later have multiple device pmus. > > Btw., if the GPU is able to dump (part of) its current execution > status, you could in theory even do instruction profiling and > in-GPU profiling (symbol resolution, maybe even annotation, etc.) > with close to standard perf tooling - which I think is currently > mostly the domain of proprietary tools. I'm not entirely sure; but there are certainly quite a few hw debug features I haven't really explored yet that might lend themselves to supporting something like this. At least there's a breakpoint mechanism that looks like it might help for something like this, including providing a way to focus on specific threads of interest which would be pretty important here to reduce how much state would have to be periodically captured. I could be interested to experiment with this later for sure. With respect to the OA counters I'm looking to expose first, I would note that the counter snapshots being written periodically are written by a fixed function unit where we only have a limited set of layouts that we can choose. Beyond the OA counters I can certainly see us wanting further pmus for other metrics though. For reference Chris Wilson experimented with some related ideas last year, here: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/? h=perf&id=f32c19f4fb3a3bda92ab7bd0b9f95da14e81ca0a Regards, - Robert > > Thanks, > > Ingo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver 2014-10-22 15:28 [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Robert Bragg ` (3 preceding siblings ...) 2014-10-23 5:58 ` [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Ingo Molnar @ 2014-10-30 19:08 ` Peter Zijlstra 2014-11-03 21:47 ` Robert Bragg 4 siblings, 1 reply; 16+ messages in thread From: Peter Zijlstra @ 2014-10-30 19:08 UTC (permalink / raw) To: Robert Bragg Cc: linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs On Wed, Oct 22, 2014 at 04:28:48PM +0100, Robert Bragg wrote: > Our desired permission model seems consistent with perf's current model > whereby you would need privileges if you want to profile across all gpu > contexts but not need special permissions to profile your own context. > > The awkward part is that it doesn't make sense for us to have userspace > open a perf event with a specific pid as the way to avoid needing root > permissions because a side effect of doing this is that the events will > be dynamically added/deleted so as to only monitor while that process is > scheduled and that's not really meaningful when we're monitoring the > gpu. There is precedent in PERF_FLAG_PID_CGROUP to replace the pid argument with a fd to your object. And do I take it right that if you're able/allowed/etc.. to open/have the fd to the GPU/DRM/DRI whatever context you have the right credentials to also observe these counters? > Conceptually I suppose we want to be able to open an event that's not > associated with any cpu or process, but to keep things simple and fit > with perf's current design, the pmu I have a.t.m expects an event to be > opened for a specific cpu and unspecified process. There are no actual scheduling ramifications right? Let me ponder his for a little while more.. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver 2014-10-30 19:08 ` Peter Zijlstra @ 2014-11-03 21:47 ` Robert Bragg 2014-11-05 12:33 ` Peter Zijlstra 0 siblings, 1 reply; 16+ messages in thread From: Robert Bragg @ 2014-11-03 21:47 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs On Thu, Oct 30, 2014 at 7:08 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Oct 22, 2014 at 04:28:48PM +0100, Robert Bragg wrote: >> Our desired permission model seems consistent with perf's current model >> whereby you would need privileges if you want to profile across all gpu >> contexts but not need special permissions to profile your own context. >> >> The awkward part is that it doesn't make sense for us to have userspace >> open a perf event with a specific pid as the way to avoid needing root >> permissions because a side effect of doing this is that the events will >> be dynamically added/deleted so as to only monitor while that process is >> scheduled and that's not really meaningful when we're monitoring the >> gpu. > > There is precedent in PERF_FLAG_PID_CGROUP to replace the pid argument > with a fd to your object. Ah ok, interesting. > > And do I take it right that if you're able/allowed/etc.. to open/have > the fd to the GPU/DRM/DRI whatever context you have the right > credentials to also observe these counters? Right and in particular since we want to allow OpenGL clients to be able the profile their own gpu context with out any special privileges my current pmu driver accepts a device file descriptor via config1 + a context id via attr->config, both for checking credentials and uniquely identifying which context should be profiled. (A single client can open multiple contexts via one drm fd) That said though; when running as root it is not currently a requirement to pass any fd when configuring an event to profile across all gpu contexts. I'm just mentioning this because although I think it should be ok for us to use an fd to determine credentials and help specify a gpu context, an fd might not be necessary for system wide profiling cases. > >> Conceptually I suppose we want to be able to open an event that's not >> associated with any cpu or process, but to keep things simple and fit >> with perf's current design, the pmu I have a.t.m expects an event to be >> opened for a specific cpu and unspecified process. > > There are no actual scheduling ramifications right? Let me ponder his > for a little while more.. Ok, I can't say I'm familiar enough with the core perf infrastructure to entirely sure about this. I recall looking at how some of the uncore perf drivers were working and it looked like they had a similar issue where conceptually the pmu doesn't belong to a specific cpu and so the id would internally get mapped to some package state, shared by multiple cpus. My understanding had been that being associated with a specific cpu did have the side effect that most of the pmu methods for that event would then be invoked on that cpu through inter-process interrupts. At one point that had seemed slightly problematic because there weren't many places within my pmu driver where I could assume I was in process context and could sleep. This was a problem with an earlier version because the way I read registers had a slim chance of needing to sleep waiting for the gpu to come out of RC6, but isn't a problem any more. One thing that does come to mind here though is that I am overloading pmu->read() as a mechanism for userspace to trigger a flush of all counter snapshots currently in the gpu circular buffer to userspace as perf events. Perhaps it would be best if that work (which might be relatively costly at times) were done in the context of the process issuing the flush(), instead of under an IPI (assuming that has some effect on scheduler accounting). Regards, - Robert ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver 2014-11-03 21:47 ` Robert Bragg @ 2014-11-05 12:33 ` Peter Zijlstra 2014-11-06 0:37 ` Robert Bragg 0 siblings, 1 reply; 16+ messages in thread From: Peter Zijlstra @ 2014-11-05 12:33 UTC (permalink / raw) To: Robert Bragg Cc: linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs On Mon, Nov 03, 2014 at 09:47:17PM +0000, Robert Bragg wrote: > > And do I take it right that if you're able/allowed/etc.. to open/have > > the fd to the GPU/DRM/DRI whatever context you have the right > > credentials to also observe these counters? > > Right and in particular since we want to allow OpenGL clients to be > able the profile their own gpu context with out any special privileges > my current pmu driver accepts a device file descriptor via config1 + a > context id via attr->config, both for checking credentials and > uniquely identifying which context should be profiled. (A single > client can open multiple contexts via one drm fd) Ah interesting. So we've got fd+context_id+event_id to identify any one number provided by the GPU. > That said though; when running as root it is not currently a > requirement to pass any fd when configuring an event to profile across > all gpu contexts. I'm just mentioning this because although I think it > should be ok for us to use an fd to determine credentials and help > specify a gpu context, an fd might not be necessary for system wide > profiling cases. Hmm, how does root know what context_id to provide? Are those exposed somewhere? Is there also a root context, one that encompasses all others? > >> Conceptually I suppose we want to be able to open an event that's not > >> associated with any cpu or process, but to keep things simple and fit > >> with perf's current design, the pmu I have a.t.m expects an event to be > >> opened for a specific cpu and unspecified process. > > > > There are no actual scheduling ramifications right? Let me ponder his > > for a little while more.. > > Ok, I can't say I'm familiar enough with the core perf infrastructure > to entirely sure about this. Yeah, so I don't think so. Its on the device, nothing the CPU/scheduler does affects what the device does. > I recall looking at how some of the uncore perf drivers were working > and it looked like they had a similar issue where conceptually the pmu > doesn't belong to a specific cpu and so the id would internally get > mapped to some package state, shared by multiple cpus. Yeah, we could try and map these devices to a cpu on their node -- PCI devices are node local. But I'm not sure we need to start out by doing that. > My understanding had been that being associated with a specific cpu > did have the side effect that most of the pmu methods for that event > would then be invoked on that cpu through inter-process interrupts. At > one point that had seemed slightly problematic because there weren't > many places within my pmu driver where I could assume I was in process > context and could sleep. This was a problem with an earlier version > because the way I read registers had a slim chance of needing to sleep > waiting for the gpu to come out of RC6, but isn't a problem any more. Right, so I suppose we could make a new global context for these device like things and avoid some that song and dance. But we can do that later. > One thing that does come to mind here though is that I am overloading > pmu->read() as a mechanism for userspace to trigger a flush of all > counter snapshots currently in the gpu circular buffer to userspace as > perf events. Perhaps it would be best if that work (which might be > relatively costly at times) were done in the context of the process > issuing the flush(), instead of under an IPI (assuming that has some > effect on scheduler accounting). Right, so given you tell the GPU to periodically dump these stats (per context I presume), you can at a similar interval schedule whatever to flush this and update the relevant event->count values and have an NO-OP pmu::read() method. If the GPU provides interrupts to notify you of new data or whatnot, you can make that drive the thing. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver 2014-11-05 12:33 ` Peter Zijlstra @ 2014-11-06 0:37 ` Robert Bragg 2014-11-10 11:13 ` Ingo Molnar 0 siblings, 1 reply; 16+ messages in thread From: Robert Bragg @ 2014-11-06 0:37 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs On Wed, Nov 5, 2014 at 12:33 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Nov 03, 2014 at 09:47:17PM +0000, Robert Bragg wrote: > >> > And do I take it right that if you're able/allowed/etc.. to open/have >> > the fd to the GPU/DRM/DRI whatever context you have the right >> > credentials to also observe these counters? >> >> Right and in particular since we want to allow OpenGL clients to be >> able the profile their own gpu context with out any special privileges >> my current pmu driver accepts a device file descriptor via config1 + a >> context id via attr->config, both for checking credentials and >> uniquely identifying which context should be profiled. (A single >> client can open multiple contexts via one drm fd) > > Ah interesting. So we've got fd+context_id+event_id to identify any one > number provided by the GPU. Roughly. The fd represents the device we're interested in. Since a single application can manage multiple unique gpu contexts for submitting work we have the context_id to identify which one in particular we want to collect metrics for. The event_id here though really represents a set of counters that are written out together in a hardware specific report layout. On Haswell there are 8 different report layouts that basically trade off how many counters to include from 13 to 61 32bit counters plus 1 64bit timestamp. I exposed this format choice in the event configuration. It's notable that all of the counter values written in one report are captured atomically with respect to the gpu clock. Within the reports most of the counters are hard-wired and they are referred to as Aggregating counters, including things like: * number of cycles the render engine was busy for * number of cycles the gpu was active * number of cycles the gpu was stalled (i'll just gloss over what distinguishes each of these states) * number of active cycles spent running a vertex shader * number of stalled cycles spent running a vertex shader * number of vertex shader threads spawned * number of active cycles spent running a pixel shader * number of stalled cycles spent running a pixel shader" * number of pixel shader threads spawned ... The values are aggregated across all of the gpu's execution units (e.g. up to 40 units on Haswell) Besides these aggregating counters the reports also include a gpu clock counter which allows us to normalize these values into something more intuitive for profiling. There is a further small set of counters referred to as B counters in the public prms that are also included in these reports and the hardware has some configurability for these counters but given the constrains on configuring them, the expectation would be to just allow userspace to specify a enum for certain pre-defined configurations. (E.g. a configuration that exposes a well defined set of B counters useful for OpenGL profiling vs GPGPU profiling) I had considered uniquely identifying each of the A counters with separate perf event ids, but I think the main reasons I decided against that in the end are: Since they are written atomically the counters in a snapshot are all related and the analysis to derive useful values for benchmarking typically needs to refer to multiple counters in a single snapshot at a time. E.g. to report the "Average cycles per vertex shader thread" would need to measure the number of cycles spent running a vertex shader / the number of vertex shader threads spawned. If we split the counters up we'd then need to do work to correlate them again in userspace. My other concern was actually with memory bandwidth, considering that it's possible to request the gpu to write out periodic snapshots at a very high frequency (we can program a period as low as 160 nanoseconds) and pushing this to the limit (running as root + overriding perf_event_max_sample_rate) can start to expose some interesting details about how the gpu is working - though notable observer effects too. I was expecting memory bandwidth to be the limiting factor for what resolution we can achieve this way and splitting the counters up looked like it would have quite a big impact, due to the extra sample headers and that the gpu timestamp would need to be repeated with each counter. e.g. in the most extreme case, instead of 8byte header + 61 counters * 4 bytes + 8byte timestamp every 160ns ~= 1.6GB/s, each counter would need to be paired with a gpu timestamp + header so we could have 61 * (8 + 4 + 8)bytes ~= 7.6GB/s. To be fair though it's likely that if the counters were split up we probably wouldn't often need a full set of 61 counters. One last thing to mention here is that this first pmu driver that I have written only relates to one very specific observation unit within the gpu that happens to expose counters via reports/snapshots. There are other interesting gpu counters I could imagine exposing through separate pmu drivers too where the counters might simply be accessed via mmio and for those cases I would imagine having a 1:1 mapping between event-ids and counters. > >> That said though; when running as root it is not currently a >> requirement to pass any fd when configuring an event to profile across >> all gpu contexts. I'm just mentioning this because although I think it >> should be ok for us to use an fd to determine credentials and help >> specify a gpu context, an fd might not be necessary for system wide >> profiling cases. > > Hmm, how does root know what context_id to provide? Are those exposed > somewhere? Is there also a root context, one that encompasses all > others? No, it's just that the observation unit has two modes of operation; either we can ask the unit to only aggregate counters for a specific context_id or tell it to aggregate across all contexts. > >> >> Conceptually I suppose we want to be able to open an event that's not >> >> associated with any cpu or process, but to keep things simple and fit >> >> with perf's current design, the pmu I have a.t.m expects an event to be >> >> opened for a specific cpu and unspecified process. >> > >> > There are no actual scheduling ramifications right? Let me ponder his >> > for a little while more.. >> >> Ok, I can't say I'm familiar enough with the core perf infrastructure >> to entirely sure about this. > > Yeah, so I don't think so. Its on the device, nothing the CPU/scheduler > does affects what the device does. > >> I recall looking at how some of the uncore perf drivers were working >> and it looked like they had a similar issue where conceptually the pmu >> doesn't belong to a specific cpu and so the id would internally get >> mapped to some package state, shared by multiple cpus. > > Yeah, we could try and map these devices to a cpu on their node -- PCI > devices are node local. But I'm not sure we need to start out by doing > that. > >> My understanding had been that being associated with a specific cpu >> did have the side effect that most of the pmu methods for that event >> would then be invoked on that cpu through inter-process interrupts. At >> one point that had seemed slightly problematic because there weren't >> many places within my pmu driver where I could assume I was in process >> context and could sleep. This was a problem with an earlier version >> because the way I read registers had a slim chance of needing to sleep >> waiting for the gpu to come out of RC6, but isn't a problem any more. > > Right, so I suppose we could make a new global context for these device > like things and avoid some that song and dance. But we can do that > later. sure, at least for now it seems workable. > >> One thing that does come to mind here though is that I am overloading >> pmu->read() as a mechanism for userspace to trigger a flush of all >> counter snapshots currently in the gpu circular buffer to userspace as >> perf events. Perhaps it would be best if that work (which might be >> relatively costly at times) were done in the context of the process >> issuing the flush(), instead of under an IPI (assuming that has some >> effect on scheduler accounting). > > Right, so given you tell the GPU to periodically dump these stats (per > context I presume), you can at a similar interval schedule whatever to > flush this and update the relevant event->count values and have an NO-OP > pmu::read() method. > > If the GPU provides interrupts to notify you of new data or whatnot, you > can make that drive the thing. > Right, I'm already ensuring the events will be forwarded within a finite time using a hrtimer, currently at 200Hz but there are also times where userspace wants to pull at the driver too. The use case here is supporting the INTEL_performance_query OpenGL extension, where an application which can submit work to render on the gpu and can also start and stop performance queries around specific work and then ask for the results. Given how the queries are delimited Mesa can determine when the work being queried has completed and at that point the application can request the results of the query. In this model Mesa will have configured a perf event to deliver periodic counter snapshots, but it only really cares about snapshots that fall between the start and end of a query. For this use case the periodic snapshots are just to detect counters wrapping and so the period will be relatively low at ~50milliseconds. At the end of a query Mesa won't know whether there are any periodic snapshots that fell between the start-end so it wants to explicitly flush at a point where it knows any snapshots will be ready if there are any. Alternatively I think I could arrange it so that Mesa relies on knowing the driver will forward snapshots @ 200Hz and we could delay informing the application that results are ready until we are certain they must have been forwarded. I think the api could allow us to do that (except for one awkward case where the application can demand a synchronous response where we'd potentially have to sleep) My concern here is having to rely on a fixed and relatively high frequency for forwarding events which seems like it should be left as an implementation detail that userspace shouldn't need to know. I'm guessing it could also be good at some point for the hrtimer frequency to relate to the buffer size + report sizes + timer frequency instead of being fixed, but this could be difficult to change if userspace needs to make assumptions about it, it could also increase the time userspace would have to wait before it could be sure outstanding snapshots have been received. Hopefully that explains why I'm overloading read() like this currently. Regards - Robert ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver 2014-11-06 0:37 ` Robert Bragg @ 2014-11-10 11:13 ` Ingo Molnar 2014-11-12 23:33 ` Robert Bragg 0 siblings, 1 reply; 16+ messages in thread From: Ingo Molnar @ 2014-11-10 11:13 UTC (permalink / raw) To: Robert Bragg Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs * Robert Bragg <robert@sixbynine.org> wrote: > On Wed, Nov 5, 2014 at 12:33 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Nov 03, 2014 at 09:47:17PM +0000, Robert Bragg wrote: > > > >> > And do I take it right that if you're able/allowed/etc.. to open/have > >> > the fd to the GPU/DRM/DRI whatever context you have the right > >> > credentials to also observe these counters? > >> > >> Right and in particular since we want to allow OpenGL clients to be > >> able the profile their own gpu context with out any special privileges > >> my current pmu driver accepts a device file descriptor via config1 + a > >> context id via attr->config, both for checking credentials and > >> uniquely identifying which context should be profiled. (A single > >> client can open multiple contexts via one drm fd) > > > > Ah interesting. So we've got fd+context_id+event_id to identify any one > > number provided by the GPU. > > Roughly. > > The fd represents the device we're interested in. > > Since a single application can manage multiple unique gpu contexts for > submitting work we have the context_id to identify which one in > particular we want to collect metrics for. > > The event_id here though really represents a set of counters that are > written out together in a hardware specific report layout. > > On Haswell there are 8 different report layouts that basically trade > off how many counters to include from 13 to 61 32bit counters plus 1 > 64bit timestamp. I exposed this format choice in the event > configuration. It's notable that all of the counter values written in > one report are captured atomically with respect to the gpu clock. > > Within the reports most of the counters are hard-wired and they are > referred to as Aggregating counters, including things like: > > * number of cycles the render engine was busy for > * number of cycles the gpu was active > * number of cycles the gpu was stalled > (i'll just gloss over what distinguishes each of these states) > * number of active cycles spent running a vertex shader > * number of stalled cycles spent running a vertex shader > * number of vertex shader threads spawned > * number of active cycles spent running a pixel shader > * number of stalled cycles spent running a pixel shader" > * number of pixel shader threads spawned > ... Just curious: Beyond aggregated counts, do the GPU reports also allow sampling the PC of the vertex shader and pixel shader execution? That would allow effective annotated disassembly of them and bottleneck analysis - much like 'perf annotate' and how you can drill into annotated assembly code in 'perf report' and 'perf top'. Secondly, do you also have cache hit/miss counters (with sampling ability) for the various caches the GPU utilizes: such as the LLC it shares with the CPU, or GPU-specific caches (if any) such as the vertex cache? Most GPU shader performance problems relate to memory access patterns and the above aggregate counts only tell us the global picture. Thirdly, if taken branch instructions block/stall non-taken threads within an execution unit (like it happens on other vector CPUs) then being able to measure/sample current effective thread concurrency within an execution unit is generally useful as well, to be able to analyze this major class of GPU/GPGPU performance problems. > The values are aggregated across all of the gpu's execution > units (e.g. up to 40 units on Haswell) > > Besides these aggregating counters the reports also include a > gpu clock counter which allows us to normalize these values > into something more intuitive for profiling. Modern GPUs can also change their clock frequency depending on load - is the GPU clock normalized by the hardware to a known fixed frequency, or does it change as the GPU's clock changes? > [...] > > I had considered uniquely identifying each of the A counters > with separate perf event ids, but I think the main reasons I > decided against that in the end are: > > Since they are written atomically the counters in a snapshot > are all related and the analysis to derive useful values for > benchmarking typically needs to refer to multiple counters in a > single snapshot at a time. E.g. to report the "Average cycles > per vertex shader thread" would need to measure the number of > cycles spent running a vertex shader / the number of vertex > shader threads spawned. If we split the counters up we'd then > need to do work to correlate them again in userspace. > > My other concern was actually with memory bandwidth, > considering that it's possible to request the gpu to write out > periodic snapshots at a very high frequency (we can program a > period as low as 160 nanoseconds) and pushing this to the limit > (running as root + overriding perf_event_max_sample_rate) can > start to expose some interesting details about how the gpu is > working - though notable observer effects too. I was expecting > memory bandwidth to be the limiting factor for what resolution > we can achieve this way and splitting the counters up looked > like it would have quite a big impact, due to the extra sample > headers and that the gpu timestamp would need to be repeated > with each counter. e.g. in the most extreme case, instead of > 8byte header + 61 counters * 4 bytes + 8byte timestamp every > 160ns ~= 1.6GB/s, each counter would need to be paired with a > gpu timestamp + header so we could have 61 * (8 + 4 + 8)bytes > ~= 7.6GB/s. To be fair though it's likely that if the counters > were split up we probably wouldn't often need a full set of 61 > counters. If you really want to collect such high frequency data then you are probably right in trying to compress the report format as much as possible. > One last thing to mention here is that this first pmu driver > that I have written only relates to one very specific > observation unit within the gpu that happens to expose counters > via reports/snapshots. There are other interesting gpu counters > I could imagine exposing through separate pmu drivers too where > the counters might simply be accessed via mmio and for those > cases I would imagine having a 1:1 mapping between event-ids > and counters. I'd strong suggest thinking about sampling as well, if the hardware exposes sample information: at least for profiling CPU loads the difference is like day and night, compared to aggregated counts and self-profiling. > > [...] > > > > If the GPU provides interrupts to notify you of new data or > > whatnot, you can make that drive the thing. > > Right, I'm already ensuring the events will be forwarded within > a finite time using a hrtimer, currently at 200Hz but there are > also times where userspace wants to pull at the driver too. > > The use case here is supporting the INTEL_performance_query > OpenGL extension, where an application which can submit work to > render on the gpu and can also start and stop performance > queries around specific work and then ask for the results. > Given how the queries are delimited Mesa can determine when the > work being queried has completed and at that point the > application can request the results of the query. > > In this model Mesa will have configured a perf event to deliver > periodic counter snapshots, but it only really cares about > snapshots that fall between the start and end of a query. For > this use case the periodic snapshots are just to detect > counters wrapping and so the period will be relatively low at > ~50milliseconds. At the end of a query Mesa won't know whether > there are any periodic snapshots that fell between the > start-end so it wants to explicitly flush at a point where it > knows any snapshots will be ready if there are any. > > Alternatively I think I could arrange it so that Mesa relies on > knowing the driver will forward snapshots @ 200Hz and we could > delay informing the application that results are ready until we > are certain they must have been forwarded. I think the api > could allow us to do that (except for one awkward case where > the application can demand a synchronous response where we'd > potentially have to sleep) My concern here is having to rely on > a fixed and relatively high frequency for forwarding events > which seems like it should be left as an implementation detail > that userspace shouldn't need to know. It's a very good idea to not expose such limitations to user-space - the GPU driver doing the necessary hrtimer polling to construct a proper count is a much higher quality solution. The last thing you want to ask yourself when seeing some weird profiling result is 'did user-space properly poll the PMU or did we overflow??'. Instrumentation needs to be rock solid dependable and fast, in that order. Thanks, Ingo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver 2014-11-10 11:13 ` Ingo Molnar @ 2014-11-12 23:33 ` Robert Bragg 2014-11-16 9:27 ` Ingo Molnar 0 siblings, 1 reply; 16+ messages in thread From: Robert Bragg @ 2014-11-12 23:33 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs On Mon, Nov 10, 2014 at 11:13 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Robert Bragg <robert@sixbynine.org> wrote: > <snip> >> On Haswell there are 8 different report layouts that basically trade >> off how many counters to include from 13 to 61 32bit counters plus 1 >> 64bit timestamp. I exposed this format choice in the event >> configuration. It's notable that all of the counter values written in >> one report are captured atomically with respect to the gpu clock. >> >> Within the reports most of the counters are hard-wired and they are >> referred to as Aggregating counters, including things like: >> >> * number of cycles the render engine was busy for >> * number of cycles the gpu was active >> * number of cycles the gpu was stalled >> (i'll just gloss over what distinguishes each of these states) >> * number of active cycles spent running a vertex shader >> * number of stalled cycles spent running a vertex shader >> * number of vertex shader threads spawned >> * number of active cycles spent running a pixel shader >> * number of stalled cycles spent running a pixel shader" >> * number of pixel shader threads spawned >> ... > > Just curious: > > Beyond aggregated counts, do the GPU reports also allow sampling > the PC of the vertex shader and pixel shader execution? > > That would allow effective annotated disassembly of them and > bottleneck analysis - much like 'perf annotate' and how you can > drill into annotated assembly code in 'perf report' and 'perf > top'. No, I'm afraid these particular counter reports from the OA unit can't give us access to EU instruction pointers or other EU registers, even considering the set of configurable counters that can be exposed besides the aggregate counters. These OA counters are more-or-less just boolean event counters. Because your train of thought got me wondering though if it would be possible to sample instruction pointers of EU threads periodically; I spent a bit of time investigating how it could potentially be implemented, out of curiosity. I found at least one possible approach, but one thing that became apparent is that it wouldn't really be possible to handle neatly from the kernel and would need tightly coupled support from Mesa in userspace too... Gen EUs have some support for exception handling where an exception could be triggered periodically (not internally by the gpu, but rather by the cpu) and the EUs made to run a given 'system routine' which would be able to sample the instruction pointer of the interrupted threads. One of the difficulties is that it wouldn't be possible for the kernel to directly setup a system routine for profiling like this, since the pointer for the routine is set via a STATE_SIP command that requires a pointer relative to the 'instruction base pointer' which is state that's really owned and setup by userspace drivers. Incidentally our current driver stack doesn't currently utilise system routines for anything and so at least something like this wouldn't conflict with an existing feature. Some experiments were done with system routines by Ben Widawsky some years ago now, with the aim of using them for debugging as opposed to profiling, and that means he has some code knocking around (intel-gpu-tools/debugger) that could make it possible to put together an experiment for this. For now I'd like to continue with enabling access to the OA counters via perf if possible, since that's much lower hanging fruit but should still allow a decent range of profiling tools. If I get a chance though, I'm tempted to see if I can use Ben's code as a basis to experiment with this idea. > > Secondly, do you also have cache hit/miss counters (with sampling > ability) for the various caches the GPU utilizes: such as the LLC > it shares with the CPU, or GPU-specific caches (if any) such as > the vertex cache? Most GPU shader performance problems relate to > memory access patterns and the above aggregate counts only tell > us the global picture. Right, we can expose some of these via OA counter reports, through the configurable counters. E.g. we can get a counter for the number of L3 cache read/write transactions via the LLC which can be converted into a throughput. There are also other interesting counters relating to the texture samplers for example, that are a common bottleneck. My initial i915_oa driver doesn't look at exposing those yet since we still need to work through an approval process for some of the details. My first interest was to start with creating a driver to expose the features and counters we already have published public docs for, which in turn let me send out this RFC sooner rather than later. > > Thirdly, if taken branch instructions block/stall non-taken > threads within an execution unit (like it happens on other vector > CPUs) then being able to measure/sample current effective thread > concurrency within an execution unit is generally useful as well, > to be able to analyze this major class of GPU/GPGPU performance > problems. Right, Gen EUs try to co-issue instructions from multiple threads at the same time, so long as they aren't contending for the same units. I'm not currently sure of a way to get insight into this for Haswell, but for Broadwell we gain some more aggregate EU counters (actually some of them become customisable) and then it's possible to count the issuing of instructions for some of the sub-units that allow co-issuing. > >> The values are aggregated across all of the gpu's execution >> units (e.g. up to 40 units on Haswell) >> >> Besides these aggregating counters the reports also include a >> gpu clock counter which allows us to normalize these values >> into something more intuitive for profiling. > > Modern GPUs can also change their clock frequency depending on > load - is the GPU clock normalized by the hardware to a known > fixed frequency, or does it change as the GPU's clock changes? Sadly on Haswell, while these OA counters are enabled we need to disable RC6 and also render trunk clock gating, so this obviously has an impact on profiling that needs to be take into account. On Broadwell I think we should be able to enable both though and in that case the gpu will automatically write additional counter snapshots when transitioning in and out of RC6 as well as when the clock frequency changes. <snip> > >> One last thing to mention here is that this first pmu driver >> that I have written only relates to one very specific >> observation unit within the gpu that happens to expose counters >> via reports/snapshots. There are other interesting gpu counters >> I could imagine exposing through separate pmu drivers too where >> the counters might simply be accessed via mmio and for those >> cases I would imagine having a 1:1 mapping between event-ids >> and counters. > > I'd strong suggest thinking about sampling as well, if the > hardware exposes sample information: at least for profiling CPU > loads the difference is like day and night, compared to > aggregated counts and self-profiling. Here I was thinking of counters or data that can be sampled via mmio using a hrtimer. E.g. the current gpu frequency or the energy usage. I'm not currently aware of any capability for the gpu to say trigger an interrupt after a threshold number of events occurs (like clock cycles) so I think we may generally be limited to a wall clock time domain for sampling. As above, I'll also keep in mind, experimenting with being able to sample EU IPs at some point too. > >> > [...] >> > >> > If the GPU provides interrupts to notify you of new data or >> > whatnot, you can make that drive the thing. >> >> Right, I'm already ensuring the events will be forwarded within >> a finite time using a hrtimer, currently at 200Hz but there are >> also times where userspace wants to pull at the driver too. >> >> The use case here is supporting the INTEL_performance_query >> OpenGL extension, where an application which can submit work to >> render on the gpu and can also start and stop performance >> queries around specific work and then ask for the results. >> Given how the queries are delimited Mesa can determine when the >> work being queried has completed and at that point the >> application can request the results of the query. >> >> In this model Mesa will have configured a perf event to deliver >> periodic counter snapshots, but it only really cares about >> snapshots that fall between the start and end of a query. For >> this use case the periodic snapshots are just to detect >> counters wrapping and so the period will be relatively low at >> ~50milliseconds. At the end of a query Mesa won't know whether >> there are any periodic snapshots that fell between the >> start-end so it wants to explicitly flush at a point where it >> knows any snapshots will be ready if there are any. >> >> Alternatively I think I could arrange it so that Mesa relies on >> knowing the driver will forward snapshots @ 200Hz and we could >> delay informing the application that results are ready until we >> are certain they must have been forwarded. I think the api >> could allow us to do that (except for one awkward case where >> the application can demand a synchronous response where we'd >> potentially have to sleep) My concern here is having to rely on >> a fixed and relatively high frequency for forwarding events >> which seems like it should be left as an implementation detail >> that userspace shouldn't need to know. > > It's a very good idea to not expose such limitations to > user-space - the GPU driver doing the necessary hrtimer polling > to construct a proper count is a much higher quality solution. That sounds preferable. I'm open to suggestions for finding another way for userspace to initiate a flush besides through read() in case there's a concern that might be set a bad precedent. For the i915_oa driver it seems ok at the moment since we don't currently report a useful counter through read() and for the main use case where we want the flushing we expect that most of the time there won't be any significant cost involved in flushing since we'll be using a very low timer period. Maybe this will bite us later though. > > The last thing you want to ask yourself when seeing some weird > profiling result is 'did user-space properly poll the PMU or did > we overflow??'. Instrumentation needs to be rock solid dependable > and fast, in that order. That sounds like good advice. Thanks, - Robert ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/3] Expose gpu counters via perf pmu driver 2014-11-12 23:33 ` Robert Bragg @ 2014-11-16 9:27 ` Ingo Molnar 0 siblings, 0 replies; 16+ messages in thread From: Ingo Molnar @ 2014-11-16 9:27 UTC (permalink / raw) To: Robert Bragg Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Daniel Vetter, Chris Wilson, Rob Clark, Samuel Pitoiset, Ben Skeggs * Robert Bragg <robert@sixbynine.org> wrote: > > I'd strong[ly] suggest thinking about sampling as well, if > > the hardware exposes sample information: at least for > > profiling CPU loads the difference is like day and night, > > compared to aggregated counts and self-profiling. > > Here I was thinking of counters or data that can be sampled via > mmio using a hrtimer. E.g. the current gpu frequency or the > energy usage. I'm not currently aware of any capability for the > gpu to say trigger an interrupt after a threshold number of > events occurs (like clock cycles) so I think we may generally > be limited to a wall clock time domain for sampling. In general hrtimer-driven polling gives pretty good profiling information as well - key is to be able to get a sample of EU thread execution state. (Trigger thresholds and so can be useful as well, but are a second order concern in terms of profiling quality.) > > It's a very good idea to not expose such limitations to > > user-space - the GPU driver doing the necessary hrtimer > > polling to construct a proper count is a much higher quality > > solution. > > That sounds preferable. > > I'm open to suggestions for finding another way for userspace > to initiate a flush besides through read() in case there's a > concern that might be set a bad precedent. For the i915_oa > driver it seems ok at the moment since we don't currently > report a useful counter through read() and for the main use > case where we want the flushing we expect that most of the time > there won't be any significant cost involved in flushing since > we'll be using a very low timer period. Maybe this will bite us > later though. You could add an ioctl() as well - we are not religious about them, there's always things that are special enough to not warrant a generic syscall. Anyway, aggregate counts alone are obviously very useful to analyzing GPU performance, so your initial approach looks perfectly acceptable to me already. Thanks, Ingo ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2014-11-16 9:27 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-10-22 15:28 [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Robert Bragg 2014-10-22 15:28 ` [RFC PATCH 1/3] perf: export perf_event_overflow Robert Bragg 2014-10-22 15:28 ` [RFC PATCH 2/3] perf: Add PERF_PMU_CAP_IS_DEVICE flag Robert Bragg 2014-10-22 15:28 ` [RFC PATCH 3/3] i915: Expose PMU for Observation Architecture Robert Bragg 2014-10-23 7:47 ` Chris Wilson 2014-10-24 2:33 ` Robert Bragg 2014-10-24 6:56 ` Chris Wilson 2014-10-23 5:58 ` [RFC PATCH 0/3] Expose gpu counters via perf pmu driver Ingo Molnar 2014-10-24 13:39 ` Robert Bragg 2014-10-30 19:08 ` Peter Zijlstra 2014-11-03 21:47 ` Robert Bragg 2014-11-05 12:33 ` Peter Zijlstra 2014-11-06 0:37 ` Robert Bragg 2014-11-10 11:13 ` Ingo Molnar 2014-11-12 23:33 ` Robert Bragg 2014-11-16 9:27 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).