All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
@ 2017-11-15 12:13 Sagar Arun Kamble
  2017-11-15 12:13 ` [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time Sagar Arun Kamble
                   ` (7 more replies)
  0 siblings, 8 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-11-15 12:13 UTC (permalink / raw)
  To: intel-gfx

We can compute system time corresponding to GPU timestamp by taking a
reference point (CPU monotonic time, GPU timestamp) and then adding
delta time computed using timecounter/cyclecounter support in kernel.
We have to configure cyclecounter with the GPU timestamp frequency.
Earlier approach that was based on cross-timestamp is not needed. It
was being used to approximate the frequency based on invalid assumptions
(possibly drift was being seen in the time due to precision issue).
The precision of time from GPU clocks is already in ns and timecounter
takes care of it as verified over variable durations.

This series adds base timecounter/cyclecounter changes and changes to
get GPU and CPU timestamps in OA samples.

Sagar Arun Kamble (1):
  drm/i915/perf: Add support to correlate GPU timestamp with system time

Sourab Gupta (3):
  drm/i915/perf: Add support for collecting 64 bit timestamps with OA
    reports
  drm/i915/perf: Extract raw GPU timestamps from OA reports
  drm/i915/perf: Send system clock monotonic time in perf samples

 drivers/gpu/drm/i915/i915_drv.h  |  11 ++++
 drivers/gpu/drm/i915/i915_perf.c | 124 ++++++++++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_reg.h  |   6 ++
 include/uapi/drm/i915_drm.h      |  14 +++++
 4 files changed, 154 insertions(+), 1 deletion(-)

-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time
  2017-11-15 12:13 [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events Sagar Arun Kamble
@ 2017-11-15 12:13 ` Sagar Arun Kamble
  2017-11-15 12:25   ` Chris Wilson
  2017-11-15 12:13 ` [RFC 2/4] drm/i915/perf: Add support for collecting 64 bit timestamps with OA reports Sagar Arun Kamble
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-11-15 12:13 UTC (permalink / raw)
  To: intel-gfx; +Cc: Sourab Gupta, Matthew Auld

We can compute system time corresponding to GPU timestamp by taking a
reference point and then adding delta time computed using timecounter
and cyclecounter support in kernel. We have to configure cyclecounter
with the GPU timestamp frequency.
In further patches we can leverage timecounter_cyc2time function to
compute the system time corresponding to GPU timestamp cycles derived
from OA report. Important thing to note is timecounter_cyc2time considers
time backwards if delta timestamp is more than half the max ns time
covered by counter. (It will be ~35min for 36 bit counter. If this much
sampling duration is needed we will have to update tc->nsec by explicitly
reading the timecounter after duration less than 35min during sampling)

On enabling perf stream we start the timecounter/cyclecounter and while
collecting OA samples we translate GPU timestamp to System timestamp.

Earlier approach that was based on cross-timestamp is not needed. It
was being used to approximate the frequency based on invalid assumptions
(possibly drift was being seen in the time due to precision issue).
The precision of time from GPU clocks is already in ns and timecounter
takes care of it as verified over variable durations.
Cross-timestamp might be valuable to get the precise reference point
w.r.t device and system time but it needs a support for reading ART
counter from i915 which I find not available for i915. Hence, we are
compensating the offset to setup the reference point ourselves.
With this approach we have very fine precision of device and system time
only differing by 5-10us.

Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sourab Gupta <sourab.gupta@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |  9 +++++++
 drivers/gpu/drm/i915/i915_perf.c | 53 ++++++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h  |  6 +++++
 3 files changed, 68 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 2158a75..e08bc85 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -43,6 +43,7 @@
 #include <linux/pm_qos.h>
 #include <linux/reservation.h>
 #include <linux/shmem_fs.h>
+#include <linux/timecounter.h>
 
 #include <drm/drmP.h>
 #include <drm/intel-gtt.h>
@@ -2149,6 +2150,14 @@ struct i915_perf_stream {
 	 * @oa_config: The OA configuration used by the stream.
 	 */
 	struct i915_oa_config *oa_config;
+
+	/**
+	 * System time correlation variables.
+	 */
+	struct cyclecounter cc;
+	spinlock_t systime_lock;
+	struct timespec64 start_systime;
+	struct timecounter tc;
 };
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 00be015..72ddc34 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -192,6 +192,7 @@
  */
 
 #include <linux/anon_inodes.h>
+#include <linux/clocksource.h>
 #include <linux/sizes.h>
 #include <linux/uuid.h>
 
@@ -2391,6 +2392,56 @@ static unsigned int i915_perf_poll(struct file *file, poll_table *wait)
 }
 
 /**
+ * i915_cyclecounter_read - read raw cycle/timestamp counter
+ * @cc: cyclecounter structure
+ */
+static u64 i915_cyclecounter_read(const struct cyclecounter *cc)
+{
+	struct i915_perf_stream *stream = container_of(cc, typeof(*stream), cc);
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	u64 ts_count;
+
+	intel_runtime_pm_get(dev_priv);
+	ts_count = I915_READ64_2x32(GEN4_TIMESTAMP,
+				    GEN7_TIMESTAMP_UDW);
+	intel_runtime_pm_put(dev_priv);
+
+	return ts_count;
+}
+
+static void i915_perf_init_cyclecounter(struct i915_perf_stream *stream)
+{
+	struct drm_i915_private *dev_priv = stream->dev_priv;
+	int cs_ts_freq = dev_priv->perf.oa.timestamp_frequency;
+	struct cyclecounter *cc = &stream->cc;
+	u32 maxsec;
+
+	cc->read = i915_cyclecounter_read;
+	cc->mask = CYCLECOUNTER_MASK(CS_TIMESTAMP_WIDTH(dev_priv));
+	maxsec = cc->mask / cs_ts_freq;
+
+	clocks_calc_mult_shift(&cc->mult, &cc->shift, cs_ts_freq,
+			       NSEC_PER_SEC, maxsec);
+}
+
+static void i915_perf_init_timecounter(struct i915_perf_stream *stream)
+{
+#define SYSTIME_START_OFFSET	350000 /* Counter read takes about 350us */
+	unsigned long flags;
+	u64 ns;
+
+	i915_perf_init_cyclecounter(stream);
+	spin_lock_init(&stream->systime_lock);
+
+	getnstimeofday64(&stream->start_systime);
+	ns = timespec64_to_ns(&stream->start_systime) + SYSTIME_START_OFFSET;
+
+	spin_lock_irqsave(&stream->systime_lock, flags);
+	timecounter_init(&stream->tc, &stream->cc, ns);
+	spin_unlock_irqrestore(&stream->systime_lock, flags);
+}
+
+/**
  * i915_perf_enable_locked - handle `I915_PERF_IOCTL_ENABLE` ioctl
  * @stream: A disabled i915 perf stream
  *
@@ -2408,6 +2459,8 @@ static void i915_perf_enable_locked(struct i915_perf_stream *stream)
 	/* Allow stream->ops->enable() to refer to this */
 	stream->enabled = true;
 
+	i915_perf_init_timecounter(stream);
+
 	if (stream->ops->enable)
 		stream->ops->enable(stream);
 }
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index cfdf4f8..e7e6966 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -8882,6 +8882,12 @@ enum skl_power_gate {
 
 /* Gen4+ Timestamp and Pipe Frame time stamp registers */
 #define GEN4_TIMESTAMP		_MMIO(0x2358)
+#define GEN7_TIMESTAMP_UDW	_MMIO(0x235C)
+#define PRE_GEN7_TIMESTAMP_WIDTH	32
+#define GEN7_TIMESTAMP_WIDTH		36
+#define CS_TIMESTAMP_WIDTH(dev_priv) \
+	(INTEL_GEN(dev_priv) < 7 ? PRE_GEN7_TIMESTAMP_WIDTH : \
+				   GEN7_TIMESTAMP_WIDTH)
 #define ILK_TIMESTAMP_HI	_MMIO(0x70070)
 #define IVB_TIMESTAMP_CTR	_MMIO(0x44070)
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 2/4] drm/i915/perf: Add support for collecting 64 bit timestamps with OA reports
  2017-11-15 12:13 [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events Sagar Arun Kamble
  2017-11-15 12:13 ` [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time Sagar Arun Kamble
@ 2017-11-15 12:13 ` Sagar Arun Kamble
  2017-12-06 16:01   ` Lionel Landwerlin
  2017-11-15 12:13 ` [RFC 3/4] drm/i915/perf: Extract raw GPU timestamps from " Sagar Arun Kamble
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-11-15 12:13 UTC (permalink / raw)
  To: intel-gfx; +Cc: Sourab Gupta, Matthew Auld

From: Sourab Gupta <sourab.gupta@intel.com>

With this patch, for RCS, timestamps and OA reports can be collected
together, and provided to userspace in separate sample fields.
Next patch adds changes to derive timestamp from OA report.

v2: Rebase. Limiting the changes to only OA sample read. Updated sample
flag name.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sourab Gupta <sourab.gupta@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/i915_perf.c | 20 +++++++++++++++++++-
 include/uapi/drm/i915_drm.h      |  7 +++++++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 72ddc34..f7e748c 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -293,6 +293,7 @@
 #define OAREPORT_REASON_CTX_SWITCH     (1<<3)
 #define OAREPORT_REASON_CLK_RATIO      (1<<5)
 
+#define I915_PERF_TS_SAMPLE_SIZE	8
 
 /* For sysctl proc_dointvec_minmax of i915_oa_max_sample_rate
  *
@@ -333,7 +334,8 @@
 	[I915_OA_FORMAT_C4_B8]		    = { 7, 64 },
 };
 
-#define SAMPLE_OA_REPORT      (1<<0)
+#define SAMPLE_OA_REPORT	BIT(0)
+#define SAMPLE_GPU_TS		BIT(1)
 
 /**
  * struct perf_open_properties - for validated properties given to open a stream
@@ -599,6 +601,7 @@ static int append_oa_sample(struct i915_perf_stream *stream,
 	int report_size = dev_priv->perf.oa.oa_buffer.format_size;
 	struct drm_i915_perf_record_header header;
 	u32 sample_flags = stream->sample_flags;
+	u64 gpu_ts = 0;
 
 	header.type = DRM_I915_PERF_RECORD_SAMPLE;
 	header.pad = 0;
@@ -615,6 +618,13 @@ static int append_oa_sample(struct i915_perf_stream *stream,
 	if (sample_flags & SAMPLE_OA_REPORT) {
 		if (copy_to_user(buf, report, report_size))
 			return -EFAULT;
+		buf += report_size;
+	}
+
+	if (sample_flags & SAMPLE_GPU_TS) {
+		/* Timestamp to be populated from OA report */
+		if (copy_to_user(buf, &gpu_ts, I915_PERF_TS_SAMPLE_SIZE))
+			return -EFAULT;
 	}
 
 	(*offset) += header.size;
@@ -2100,6 +2110,11 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 	stream->sample_flags |= SAMPLE_OA_REPORT;
 	stream->sample_size += format_size;
 
+	if (props->sample_flags & SAMPLE_GPU_TS) {
+		stream->sample_flags |= SAMPLE_GPU_TS;
+		stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
+	}
+
 	dev_priv->perf.oa.oa_buffer.format_size = format_size;
 	if (WARN_ON(dev_priv->perf.oa.oa_buffer.format_size == 0))
 		return -EINVAL;
@@ -2815,6 +2830,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_PROP_SAMPLE_OA:
 			props->sample_flags |= SAMPLE_OA_REPORT;
 			break;
+		case DRM_I915_PERF_PROP_SAMPLE_GPU_TS:
+			props->sample_flags |= SAMPLE_GPU_TS;
+			break;
 		case DRM_I915_PERF_PROP_OA_METRICS_SET:
 			if (value == 0) {
 				DRM_DEBUG("Unknown OA metric set ID\n");
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index b579859..0b9249e 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1447,6 +1447,12 @@ enum drm_i915_perf_property_id {
 	DRM_I915_PERF_PROP_SAMPLE_OA,
 
 	/**
+	 * The value of this property set to 1 requests inclusion of GPU
+	 * timestamp in the perf sample data.
+	 */
+	DRM_I915_PERF_PROP_SAMPLE_GPU_TS,
+
+	/**
 	 * The value specifies which set of OA unit metrics should be
 	 * be configured, defining the contents of any OA unit reports.
 	 */
@@ -1532,6 +1538,7 @@ enum drm_i915_perf_record_type {
 	 *     struct drm_i915_perf_record_header header;
 	 *
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
+	 *     { u64 gpu_timestamp; } && DRM_I915_PERF_PROP_SAMPLE_GPU_TS
 	 * };
 	 */
 	DRM_I915_PERF_RECORD_SAMPLE = 1,
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 3/4] drm/i915/perf: Extract raw GPU timestamps from OA reports
  2017-11-15 12:13 [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events Sagar Arun Kamble
  2017-11-15 12:13 ` [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time Sagar Arun Kamble
  2017-11-15 12:13 ` [RFC 2/4] drm/i915/perf: Add support for collecting 64 bit timestamps with OA reports Sagar Arun Kamble
@ 2017-11-15 12:13 ` Sagar Arun Kamble
  2017-12-06 19:55   ` Lionel Landwerlin
  2017-11-15 12:13 ` [RFC 4/4] drm/i915/perf: Send system clock monotonic time in perf samples Sagar Arun Kamble
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-11-15 12:13 UTC (permalink / raw)
  To: intel-gfx; +Cc: Sourab Gupta, Matthew Auld

From: Sourab Gupta <sourab.gupta@intel.com>

The OA reports contain the least significant 32 bits of the gpu timestamp.
This patch enables retrieval of the timestamp field from OA reports, to
forward as 64 bit raw gpu timestamps in the perf samples.

v2: Rebase w.r.t new timecounter support.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sourab Gupta <sourab.gupta@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h  |  2 ++
 drivers/gpu/drm/i915/i915_perf.c | 26 +++++++++++++++++++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e08bc85..5534cd2 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2151,6 +2151,8 @@ struct i915_perf_stream {
 	 */
 	struct i915_oa_config *oa_config;
 
+	u64 last_gpu_ts;
+
 	/**
 	 * System time correlation variables.
 	 */
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index f7e748c..3b721d7 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -575,6 +575,26 @@ static int append_oa_status(struct i915_perf_stream *stream,
 }
 
 /**
+ * get_gpu_ts_from_oa_report - Retrieve absolute gpu timestamp from OA report
+ *
+ * Note: We are assuming that we're updating last_gpu_ts frequently enough so
+ * that it's never possible to see multiple overflows before we compare
+ * sample_ts to last_gpu_ts. Since this is significantly large duration
+ * (~6min for 80ns ts base), we can safely assume so.
+ */
+static u64 get_gpu_ts_from_oa_report(struct i915_perf_stream *stream,
+				     const u8 *report)
+{
+	u32 sample_ts = *(u32 *)(report + 4);
+	u32 delta;
+
+	delta = sample_ts - (u32)stream->last_gpu_ts;
+	stream->last_gpu_ts += delta;
+
+	return stream->last_gpu_ts;
+}
+
+/**
  * append_oa_sample - Copies single OA report into userspace read() buffer.
  * @stream: An i915-perf stream opened for OA metrics
  * @buf: destination buffer given by userspace
@@ -622,7 +642,9 @@ static int append_oa_sample(struct i915_perf_stream *stream,
 	}
 
 	if (sample_flags & SAMPLE_GPU_TS) {
-		/* Timestamp to be populated from OA report */
+		/* Timestamp populated from OA report */
+		gpu_ts = get_gpu_ts_from_oa_report(stream, report);
+
 		if (copy_to_user(buf, &gpu_ts, I915_PERF_TS_SAMPLE_SIZE))
 			return -EFAULT;
 	}
@@ -2421,6 +2443,8 @@ static u64 i915_cyclecounter_read(const struct cyclecounter *cc)
 				    GEN7_TIMESTAMP_UDW);
 	intel_runtime_pm_put(dev_priv);
 
+	stream->last_gpu_ts = ts_count;
+
 	return ts_count;
 }
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 4/4] drm/i915/perf: Send system clock monotonic time in perf samples
  2017-11-15 12:13 [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events Sagar Arun Kamble
                   ` (2 preceding siblings ...)
  2017-11-15 12:13 ` [RFC 3/4] drm/i915/perf: Extract raw GPU timestamps from " Sagar Arun Kamble
@ 2017-11-15 12:13 ` Sagar Arun Kamble
  2017-11-15 12:31   ` Chris Wilson
                     ` (2 more replies)
  2017-11-15 12:30 ` ✗ Fi.CI.BAT: warning for GPU/CPU timestamps correlation for relating OA samples with system events Patchwork
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-11-15 12:13 UTC (permalink / raw)
  To: intel-gfx; +Cc: Sourab Gupta, Matthew Auld

From: Sourab Gupta <sourab.gupta@intel.com>

Currently, we have the ability to only forward the GPU timestamps in the
samples (which are generated via OA reports). This limits the ability to
correlate these samples with the system events.

An ability is therefore needed to report timestamps in different clock
domains, such as CLOCK_MONOTONIC, in the perf samples to be of more
practical use to the userspace. This ability becomes important
when we want to correlate/plot GPU events/samples with other system events
on the same timeline (e.g. vblank events, or timestamps when work was
submitted to kernel, etc.)

The patch here proposes a mechanism to achieve this. The correlation
between gpu time and system time is established using the timestamp clock
associated with the command stream, abstracted as timecounter/cyclecounter
to retrieve gpu/system time correlated values.

v2: Added i915_driver_init_late() function to capture the new late init
phase for perf (Chris)

v3: Removed cross-timestamp changes.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sourab Gupta <sourab.gupta@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/i915_perf.c | 27 +++++++++++++++++++++++++++
 include/uapi/drm/i915_drm.h      |  7 +++++++
 2 files changed, 34 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 3b721d7..94ee924 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -336,6 +336,7 @@
 
 #define SAMPLE_OA_REPORT	BIT(0)
 #define SAMPLE_GPU_TS		BIT(1)
+#define SAMPLE_SYSTEM_TS	BIT(2)
 
 /**
  * struct perf_open_properties - for validated properties given to open a stream
@@ -622,6 +623,7 @@ static int append_oa_sample(struct i915_perf_stream *stream,
 	struct drm_i915_perf_record_header header;
 	u32 sample_flags = stream->sample_flags;
 	u64 gpu_ts = 0;
+	u64 system_ts = 0;
 
 	header.type = DRM_I915_PERF_RECORD_SAMPLE;
 	header.pad = 0;
@@ -647,6 +649,23 @@ static int append_oa_sample(struct i915_perf_stream *stream,
 
 		if (copy_to_user(buf, &gpu_ts, I915_PERF_TS_SAMPLE_SIZE))
 			return -EFAULT;
+		buf += I915_PERF_TS_SAMPLE_SIZE;
+	}
+
+	if (sample_flags & SAMPLE_SYSTEM_TS) {
+		gpu_ts = get_gpu_ts_from_oa_report(stream, report);
+		/*
+		 * XXX: timecounter_cyc2time considers time backwards if delta
+		 * timestamp is more than half the max ns time covered by
+		 * counter. It will be ~35min for 36 bit counter. If this much
+		 * sampling duration is needed we will have to update tc->nsec
+		 * by explicitly reading the timecounter (timecounter_read)
+		 * before this duration.
+		 */
+		system_ts = timecounter_cyc2time(&stream->tc, gpu_ts);
+
+		if (copy_to_user(buf, &system_ts, I915_PERF_TS_SAMPLE_SIZE))
+			return -EFAULT;
 	}
 
 	(*offset) += header.size;
@@ -2137,6 +2156,11 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
 		stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
 	}
 
+	if (props->sample_flags & SAMPLE_SYSTEM_TS) {
+		stream->sample_flags |= SAMPLE_SYSTEM_TS;
+		stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
+	}
+
 	dev_priv->perf.oa.oa_buffer.format_size = format_size;
 	if (WARN_ON(dev_priv->perf.oa.oa_buffer.format_size == 0))
 		return -EINVAL;
@@ -2857,6 +2881,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
 		case DRM_I915_PERF_PROP_SAMPLE_GPU_TS:
 			props->sample_flags |= SAMPLE_GPU_TS;
 			break;
+		case DRM_I915_PERF_PROP_SAMPLE_SYSTEM_TS:
+			props->sample_flags |= SAMPLE_SYSTEM_TS;
+			break;
 		case DRM_I915_PERF_PROP_OA_METRICS_SET:
 			if (value == 0) {
 				DRM_DEBUG("Unknown OA metric set ID\n");
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 0b9249e..283859c 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1453,6 +1453,12 @@ enum drm_i915_perf_property_id {
 	DRM_I915_PERF_PROP_SAMPLE_GPU_TS,
 
 	/**
+	 * This property requests inclusion of CLOCK_MONOTONIC system time in
+	 * the perf sample data.
+	 */
+	DRM_I915_PERF_PROP_SAMPLE_SYSTEM_TS,
+
+	/**
 	 * The value specifies which set of OA unit metrics should be
 	 * be configured, defining the contents of any OA unit reports.
 	 */
@@ -1539,6 +1545,7 @@ enum drm_i915_perf_record_type {
 	 *
 	 *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
 	 *     { u64 gpu_timestamp; } && DRM_I915_PERF_PROP_SAMPLE_GPU_TS
+	 *     { u64 system_timestamp; } && DRM_I915_PERF_PROP_SAMPLE_SYSTEM_TS
 	 * };
 	 */
 	DRM_I915_PERF_RECORD_SAMPLE = 1,
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time
  2017-11-15 12:13 ` [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time Sagar Arun Kamble
@ 2017-11-15 12:25   ` Chris Wilson
  2017-11-15 16:41     ` Sagar Arun Kamble
                       ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Chris Wilson @ 2017-11-15 12:25 UTC (permalink / raw)
  To: Sagar Arun Kamble, intel-gfx; +Cc: Sourab Gupta, Matthew Auld

Quoting Sagar Arun Kamble (2017-11-15 12:13:51)
>  #include <drm/drmP.h>
>  #include <drm/intel-gtt.h>
> @@ -2149,6 +2150,14 @@ struct i915_perf_stream {
>          * @oa_config: The OA configuration used by the stream.
>          */
>         struct i915_oa_config *oa_config;
> +
> +       /**
> +        * System time correlation variables.
> +        */
> +       struct cyclecounter cc;
> +       spinlock_t systime_lock;
> +       struct timespec64 start_systime;
> +       struct timecounter tc;

This pattern is repeated a lot by struct timecounter users. (I'm still
trying to understand why the common case is not catered for by a
convenience timecounter api.)

>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
> index 00be015..72ddc34 100644
> --- a/drivers/gpu/drm/i915/i915_perf.c
> +++ b/drivers/gpu/drm/i915/i915_perf.c
> @@ -192,6 +192,7 @@
>   */
>  
>  #include <linux/anon_inodes.h>
> +#include <linux/clocksource.h>
>  #include <linux/sizes.h>
>  #include <linux/uuid.h>
>  
> @@ -2391,6 +2392,56 @@ static unsigned int i915_perf_poll(struct file *file, poll_table *wait)
>  }
>  
>  /**
> + * i915_cyclecounter_read - read raw cycle/timestamp counter
> + * @cc: cyclecounter structure
> + */
> +static u64 i915_cyclecounter_read(const struct cyclecounter *cc)
> +{
> +       struct i915_perf_stream *stream = container_of(cc, typeof(*stream), cc);
> +       struct drm_i915_private *dev_priv = stream->dev_priv;
> +       u64 ts_count;
> +
> +       intel_runtime_pm_get(dev_priv);
> +       ts_count = I915_READ64_2x32(GEN4_TIMESTAMP,
> +                                   GEN7_TIMESTAMP_UDW);
> +       intel_runtime_pm_put(dev_priv);
> +
> +       return ts_count;
> +}
> +
> +static void i915_perf_init_cyclecounter(struct i915_perf_stream *stream)
> +{
> +       struct drm_i915_private *dev_priv = stream->dev_priv;
> +       int cs_ts_freq = dev_priv->perf.oa.timestamp_frequency;
> +       struct cyclecounter *cc = &stream->cc;
> +       u32 maxsec;
> +
> +       cc->read = i915_cyclecounter_read;
> +       cc->mask = CYCLECOUNTER_MASK(CS_TIMESTAMP_WIDTH(dev_priv));
> +       maxsec = cc->mask / cs_ts_freq;
> +
> +       clocks_calc_mult_shift(&cc->mult, &cc->shift, cs_ts_freq,
> +                              NSEC_PER_SEC, maxsec);
> +}
> +
> +static void i915_perf_init_timecounter(struct i915_perf_stream *stream)
> +{
> +#define SYSTIME_START_OFFSET   350000 /* Counter read takes about 350us */
> +       unsigned long flags;
> +       u64 ns;
> +
> +       i915_perf_init_cyclecounter(stream);
> +       spin_lock_init(&stream->systime_lock);
> +
> +       getnstimeofday64(&stream->start_systime);
> +       ns = timespec64_to_ns(&stream->start_systime) + SYSTIME_START_OFFSET;

Use ktime directly. Or else Arnd will be back with a patch to fix it.
(All non-ktime interfaces are effectively deprecated; obsolete for
drivers.)

> +       spin_lock_irqsave(&stream->systime_lock, flags);
> +       timecounter_init(&stream->tc, &stream->cc, ns);
> +       spin_unlock_irqrestore(&stream->systime_lock, flags);
> +}
> +
> +/**
>   * i915_perf_enable_locked - handle `I915_PERF_IOCTL_ENABLE` ioctl
>   * @stream: A disabled i915 perf stream
>   *
> @@ -2408,6 +2459,8 @@ static void i915_perf_enable_locked(struct i915_perf_stream *stream)
>         /* Allow stream->ops->enable() to refer to this */
>         stream->enabled = true;
>  
> +       i915_perf_init_timecounter(stream);
> +
>         if (stream->ops->enable)
>                 stream->ops->enable(stream);
>  }
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index cfdf4f8..e7e6966 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -8882,6 +8882,12 @@ enum skl_power_gate {
>  
>  /* Gen4+ Timestamp and Pipe Frame time stamp registers */
>  #define GEN4_TIMESTAMP         _MMIO(0x2358)
> +#define GEN7_TIMESTAMP_UDW     _MMIO(0x235C)
> +#define PRE_GEN7_TIMESTAMP_WIDTH       32
> +#define GEN7_TIMESTAMP_WIDTH           36
> +#define CS_TIMESTAMP_WIDTH(dev_priv) \
> +       (INTEL_GEN(dev_priv) < 7 ? PRE_GEN7_TIMESTAMP_WIDTH : \
> +                                  GEN7_TIMESTAMP_WIDTH)

s/PRE_GEN7/GEN4/ would be consistent.
If you really want to add support for earlier, I9XX_.

Ok. I can accept the justification, and we are not the only ones who do
the cyclecounter -> timecounter correction like this.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* ✗ Fi.CI.BAT: warning for GPU/CPU timestamps correlation for relating OA samples with system events
  2017-11-15 12:13 [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events Sagar Arun Kamble
                   ` (3 preceding siblings ...)
  2017-11-15 12:13 ` [RFC 4/4] drm/i915/perf: Send system clock monotonic time in perf samples Sagar Arun Kamble
@ 2017-11-15 12:30 ` Patchwork
  2017-12-05 14:16 ` [RFC 0/4] " Lionel Landwerlin
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Patchwork @ 2017-11-15 12:30 UTC (permalink / raw)
  To: Sagar Arun Kamble; +Cc: intel-gfx

== Series Details ==

Series: GPU/CPU timestamps correlation for relating OA samples with system events
URL   : https://patchwork.freedesktop.org/series/33869/
State : warning

== Summary ==

Series 33869v1 GPU/CPU timestamps correlation for relating OA samples with system events
https://patchwork.freedesktop.org/api/1.0/series/33869/revisions/1/mbox/

Warning: Kernel 32bit buildtest failed

Test chamelium:
        Subgroup dp-crc-fast:
                pass       -> FAIL       (fi-kbl-7500u) fdo#102514
Test kms_pipe_crc_basic:
        Subgroup read-crc-pipe-c:
                skip       -> PASS       (fi-hsw-4770r)
        Subgroup read-crc-pipe-c-frame-sequence:
                pass       -> SKIP       (fi-hsw-4770r)

fdo#102514 https://bugs.freedesktop.org/show_bug.cgi?id=102514

fi-bdw-5557u     total:289  pass:268  dwarn:0   dfail:0   fail:0   skip:21  time:446s
fi-bdw-gvtdvm    total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:453s
fi-blb-e6850     total:289  pass:223  dwarn:1   dfail:0   fail:0   skip:65  time:385s
fi-bsw-n3050     total:289  pass:243  dwarn:0   dfail:0   fail:0   skip:46  time:536s
fi-bwr-2160      total:289  pass:183  dwarn:0   dfail:0   fail:0   skip:106 time:277s
fi-bxt-dsi       total:289  pass:259  dwarn:0   dfail:0   fail:0   skip:30  time:508s
fi-bxt-j4205     total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:503s
fi-byt-j1900     total:289  pass:254  dwarn:0   dfail:0   fail:0   skip:35  time:503s
fi-byt-n2820     total:289  pass:250  dwarn:0   dfail:0   fail:0   skip:39  time:489s
fi-elk-e7500     total:289  pass:229  dwarn:0   dfail:0   fail:0   skip:60  time:430s
fi-glk-1         total:289  pass:261  dwarn:0   dfail:0   fail:0   skip:28  time:540s
fi-hsw-4770      total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:431s
fi-hsw-4770r     total:289  pass:261  dwarn:0   dfail:0   fail:0   skip:28  time:439s
fi-ilk-650       total:289  pass:228  dwarn:0   dfail:0   fail:0   skip:61  time:429s
fi-ivb-3520m     total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:483s
fi-ivb-3770      total:289  pass:260  dwarn:0   dfail:0   fail:0   skip:29  time:463s
fi-kbl-7500u     total:289  pass:263  dwarn:1   dfail:0   fail:1   skip:24  time:482s
fi-kbl-7560u     total:289  pass:270  dwarn:0   dfail:0   fail:0   skip:19  time:534s
fi-kbl-7567u     total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:471s
fi-kbl-r         total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:531s
fi-pnv-d510      total:289  pass:222  dwarn:1   dfail:0   fail:0   skip:66  time:567s
fi-skl-6260u     total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:456s
fi-skl-6600u     total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:548s
fi-skl-6700hq    total:289  pass:263  dwarn:0   dfail:0   fail:0   skip:26  time:563s
fi-skl-6700k     total:289  pass:265  dwarn:0   dfail:0   fail:0   skip:24  time:518s
fi-skl-6770hq    total:289  pass:269  dwarn:0   dfail:0   fail:0   skip:20  time:500s
fi-skl-gvtdvm    total:289  pass:266  dwarn:0   dfail:0   fail:0   skip:23  time:460s
fi-snb-2520m     total:289  pass:250  dwarn:0   dfail:0   fail:0   skip:39  time:560s
fi-snb-2600      total:289  pass:249  dwarn:0   dfail:0   fail:0   skip:40  time:423s
Blacklisted hosts:
fi-cfl-s         total:289  pass:254  dwarn:3   dfail:0   fail:0   skip:32  time:540s
fi-cnl-y         total:289  pass:262  dwarn:0   dfail:0   fail:0   skip:27  time:549s

2cac5c93bc9427509e5164289e810bdcabfc0351 drm-tip: 2017y-11m-15d-09h-10m-41s UTC integration manifest
c61b6ec45639 drm/i915/perf: Send system clock monotonic time in perf samples
46882968ea89 drm/i915/perf: Extract raw GPU timestamps from OA reports
065be89cf048 drm/i915/perf: Add support for collecting 64 bit timestamps with OA reports
878fc922eb52 drm/i915/perf: Add support to correlate GPU timestamp with system time

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7140/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 4/4] drm/i915/perf: Send system clock monotonic time in perf samples
  2017-11-15 12:13 ` [RFC 4/4] drm/i915/perf: Send system clock monotonic time in perf samples Sagar Arun Kamble
@ 2017-11-15 12:31   ` Chris Wilson
  2017-11-15 16:51     ` Sagar Arun Kamble
  2017-11-15 17:54   ` Sagar Arun Kamble
  2017-12-05 14:22   ` Lionel Landwerlin
  2 siblings, 1 reply; 45+ messages in thread
From: Chris Wilson @ 2017-11-15 12:31 UTC (permalink / raw)
  To: Sagar Arun Kamble, intel-gfx; +Cc: Sourab Gupta, Matthew Auld

Quoting Sagar Arun Kamble (2017-11-15 12:13:54)
> From: Sourab Gupta <sourab.gupta@intel.com>
> 
> Currently, we have the ability to only forward the GPU timestamps in the
> samples (which are generated via OA reports). This limits the ability to
> correlate these samples with the system events.
> 
> An ability is therefore needed to report timestamps in different clock
> domains, such as CLOCK_MONOTONIC, in the perf samples to be of more
> practical use to the userspace. This ability becomes important
> when we want to correlate/plot GPU events/samples with other system events
> on the same timeline (e.g. vblank events, or timestamps when work was
> submitted to kernel, etc.)
> 
> The patch here proposes a mechanism to achieve this. The correlation
> between gpu time and system time is established using the timestamp clock
> associated with the command stream, abstracted as timecounter/cyclecounter
> to retrieve gpu/system time correlated values.
> 
> v2: Added i915_driver_init_late() function to capture the new late init
> phase for perf (Chris)
> 
> v3: Removed cross-timestamp changes.
> 
> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sourab Gupta <sourab.gupta@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_perf.c | 27 +++++++++++++++++++++++++++
>  include/uapi/drm/i915_drm.h      |  7 +++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
> index 3b721d7..94ee924 100644
> --- a/drivers/gpu/drm/i915/i915_perf.c
> +++ b/drivers/gpu/drm/i915/i915_perf.c
> @@ -336,6 +336,7 @@
>  
>  #define SAMPLE_OA_REPORT       BIT(0)
>  #define SAMPLE_GPU_TS          BIT(1)
> +#define SAMPLE_SYSTEM_TS       BIT(2)
>  
>  /**
>   * struct perf_open_properties - for validated properties given to open a stream
> @@ -622,6 +623,7 @@ static int append_oa_sample(struct i915_perf_stream *stream,
>         struct drm_i915_perf_record_header header;
>         u32 sample_flags = stream->sample_flags;
>         u64 gpu_ts = 0;
> +       u64 system_ts = 0;
>  
>         header.type = DRM_I915_PERF_RECORD_SAMPLE;
>         header.pad = 0;
> @@ -647,6 +649,23 @@ static int append_oa_sample(struct i915_perf_stream *stream,
>  
>                 if (copy_to_user(buf, &gpu_ts, I915_PERF_TS_SAMPLE_SIZE))
>                         return -EFAULT;
> +               buf += I915_PERF_TS_SAMPLE_SIZE;
> +       }
> +
> +       if (sample_flags & SAMPLE_SYSTEM_TS) {
> +               gpu_ts = get_gpu_ts_from_oa_report(stream, report);

Scope your variables. Stops us from being confused as to where else
gpu_ts or sys_ts may be reused. For instance I first thought you were
using SAMPLE_GPU_TS to initialise gpu_ts

> +               /*
> +                * XXX: timecounter_cyc2time considers time backwards if delta
> +                * timestamp is more than half the max ns time covered by
> +                * counter. It will be ~35min for 36 bit counter. If this much
> +                * sampling duration is needed we will have to update tc->nsec
> +                * by explicitly reading the timecounter (timecounter_read)
> +                * before this duration.
> +                */
> +               system_ts = timecounter_cyc2time(&stream->tc, gpu_ts);
> +
> +               if (copy_to_user(buf, &system_ts, I915_PERF_TS_SAMPLE_SIZE))
> +                       return -EFAULT;

Advance buf.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time
  2017-11-15 12:25   ` Chris Wilson
@ 2017-11-15 16:41     ` Sagar Arun Kamble
  2017-11-23  7:34     ` Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time] Sagar Arun Kamble
  2017-12-05 13:58     ` [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time Lionel Landwerlin
  2 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-11-15 16:41 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Sourab Gupta, Matthew Auld



On 11/15/2017 5:55 PM, Chris Wilson wrote:
> Quoting Sagar Arun Kamble (2017-11-15 12:13:51)
>>   #include <drm/drmP.h>
>>   #include <drm/intel-gtt.h>
>> @@ -2149,6 +2150,14 @@ struct i915_perf_stream {
>>           * @oa_config: The OA configuration used by the stream.
>>           */
>>          struct i915_oa_config *oa_config;
>> +
>> +       /**
>> +        * System time correlation variables.
>> +        */
>> +       struct cyclecounter cc;
>> +       spinlock_t systime_lock;
>> +       struct timespec64 start_systime;
>> +       struct timecounter tc;
> This pattern is repeated a lot by struct timecounter users. (I'm still
> trying to understand why the common case is not catered for by a
> convenience timecounter api.)
Yes. Looks good case to change the cyclecounter* member in timecounter 
to cyclecounter.
And then we can avoid separate cyclecounter.
>
>>   };
>>   
>>   /**
>> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
>> index 00be015..72ddc34 100644
>> --- a/drivers/gpu/drm/i915/i915_perf.c
>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>> @@ -192,6 +192,7 @@
>>    */
>>   
>>   #include <linux/anon_inodes.h>
>> +#include <linux/clocksource.h>
>>   #include <linux/sizes.h>
>>   #include <linux/uuid.h>
>>   
>> @@ -2391,6 +2392,56 @@ static unsigned int i915_perf_poll(struct file *file, poll_table *wait)
>>   }
>>   
>>   /**
>> + * i915_cyclecounter_read - read raw cycle/timestamp counter
>> + * @cc: cyclecounter structure
>> + */
>> +static u64 i915_cyclecounter_read(const struct cyclecounter *cc)
>> +{
>> +       struct i915_perf_stream *stream = container_of(cc, typeof(*stream), cc);
>> +       struct drm_i915_private *dev_priv = stream->dev_priv;
>> +       u64 ts_count;
>> +
>> +       intel_runtime_pm_get(dev_priv);
>> +       ts_count = I915_READ64_2x32(GEN4_TIMESTAMP,
>> +                                   GEN7_TIMESTAMP_UDW);
>> +       intel_runtime_pm_put(dev_priv);
>> +
>> +       return ts_count;
>> +}
>> +
>> +static void i915_perf_init_cyclecounter(struct i915_perf_stream *stream)
>> +{
>> +       struct drm_i915_private *dev_priv = stream->dev_priv;
>> +       int cs_ts_freq = dev_priv->perf.oa.timestamp_frequency;
>> +       struct cyclecounter *cc = &stream->cc;
>> +       u32 maxsec;
>> +
>> +       cc->read = i915_cyclecounter_read;
>> +       cc->mask = CYCLECOUNTER_MASK(CS_TIMESTAMP_WIDTH(dev_priv));
>> +       maxsec = cc->mask / cs_ts_freq;
>> +
>> +       clocks_calc_mult_shift(&cc->mult, &cc->shift, cs_ts_freq,
>> +                              NSEC_PER_SEC, maxsec);
>> +}
>> +
>> +static void i915_perf_init_timecounter(struct i915_perf_stream *stream)
>> +{
>> +#define SYSTIME_START_OFFSET   350000 /* Counter read takes about 350us */
>> +       unsigned long flags;
>> +       u64 ns;
>> +
>> +       i915_perf_init_cyclecounter(stream);
>> +       spin_lock_init(&stream->systime_lock);
>> +
>> +       getnstimeofday64(&stream->start_systime);
>> +       ns = timespec64_to_ns(&stream->start_systime) + SYSTIME_START_OFFSET;
> Use ktime directly. Or else Arnd will be back with a patch to fix it.
> (All non-ktime interfaces are effectively deprecated; obsolete for
> drivers.)
Ok. Will update.
>
>> +       spin_lock_irqsave(&stream->systime_lock, flags);
>> +       timecounter_init(&stream->tc, &stream->cc, ns);
>> +       spin_unlock_irqrestore(&stream->systime_lock, flags);
>> +}
>> +
>> +/**
>>    * i915_perf_enable_locked - handle `I915_PERF_IOCTL_ENABLE` ioctl
>>    * @stream: A disabled i915 perf stream
>>    *
>> @@ -2408,6 +2459,8 @@ static void i915_perf_enable_locked(struct i915_perf_stream *stream)
>>          /* Allow stream->ops->enable() to refer to this */
>>          stream->enabled = true;
>>   
>> +       i915_perf_init_timecounter(stream);
>> +
>>          if (stream->ops->enable)
>>                  stream->ops->enable(stream);
>>   }
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index cfdf4f8..e7e6966 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -8882,6 +8882,12 @@ enum skl_power_gate {
>>   
>>   /* Gen4+ Timestamp and Pipe Frame time stamp registers */
>>   #define GEN4_TIMESTAMP         _MMIO(0x2358)
>> +#define GEN7_TIMESTAMP_UDW     _MMIO(0x235C)
>> +#define PRE_GEN7_TIMESTAMP_WIDTH       32
>> +#define GEN7_TIMESTAMP_WIDTH           36
>> +#define CS_TIMESTAMP_WIDTH(dev_priv) \
>> +       (INTEL_GEN(dev_priv) < 7 ? PRE_GEN7_TIMESTAMP_WIDTH : \
>> +                                  GEN7_TIMESTAMP_WIDTH)
> s/PRE_GEN7/GEN4/ would be consistent.
> If you really want to add support for earlier, I9XX_.
Yes.
>
> Ok. I can accept the justification, and we are not the only ones who do
> the cyclecounter -> timecounter correction like this.
> -Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 4/4] drm/i915/perf: Send system clock monotonic time in perf samples
  2017-11-15 12:31   ` Chris Wilson
@ 2017-11-15 16:51     ` Sagar Arun Kamble
  0 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-11-15 16:51 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Sourab Gupta, Matthew Auld



On 11/15/2017 6:01 PM, Chris Wilson wrote:
> Quoting Sagar Arun Kamble (2017-11-15 12:13:54)
>> From: Sourab Gupta <sourab.gupta@intel.com>
>>
>> Currently, we have the ability to only forward the GPU timestamps in the
>> samples (which are generated via OA reports). This limits the ability to
>> correlate these samples with the system events.
>>
>> An ability is therefore needed to report timestamps in different clock
>> domains, such as CLOCK_MONOTONIC, in the perf samples to be of more
>> practical use to the userspace. This ability becomes important
>> when we want to correlate/plot GPU events/samples with other system events
>> on the same timeline (e.g. vblank events, or timestamps when work was
>> submitted to kernel, etc.)
>>
>> The patch here proposes a mechanism to achieve this. The correlation
>> between gpu time and system time is established using the timestamp clock
>> associated with the command stream, abstracted as timecounter/cyclecounter
>> to retrieve gpu/system time correlated values.
>>
>> v2: Added i915_driver_init_late() function to capture the new late init
>> phase for perf (Chris)
>>
>> v3: Removed cross-timestamp changes.
>>
>> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
>> Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
>> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Sourab Gupta <sourab.gupta@intel.com>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_perf.c | 27 +++++++++++++++++++++++++++
>>   include/uapi/drm/i915_drm.h      |  7 +++++++
>>   2 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
>> index 3b721d7..94ee924 100644
>> --- a/drivers/gpu/drm/i915/i915_perf.c
>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>> @@ -336,6 +336,7 @@
>>   
>>   #define SAMPLE_OA_REPORT       BIT(0)
>>   #define SAMPLE_GPU_TS          BIT(1)
>> +#define SAMPLE_SYSTEM_TS       BIT(2)
>>   
>>   /**
>>    * struct perf_open_properties - for validated properties given to open a stream
>> @@ -622,6 +623,7 @@ static int append_oa_sample(struct i915_perf_stream *stream,
>>          struct drm_i915_perf_record_header header;
>>          u32 sample_flags = stream->sample_flags;
>>          u64 gpu_ts = 0;
>> +       u64 system_ts = 0;
>>   
>>          header.type = DRM_I915_PERF_RECORD_SAMPLE;
>>          header.pad = 0;
>> @@ -647,6 +649,23 @@ static int append_oa_sample(struct i915_perf_stream *stream,
>>   
>>                  if (copy_to_user(buf, &gpu_ts, I915_PERF_TS_SAMPLE_SIZE))
>>                          return -EFAULT;
>> +               buf += I915_PERF_TS_SAMPLE_SIZE;
>> +       }
>> +
>> +       if (sample_flags & SAMPLE_SYSTEM_TS) {
>> +               gpu_ts = get_gpu_ts_from_oa_report(stream, report);
> Scope your variables. Stops us from being confused as to where else
> gpu_ts or sys_ts may be reused. For instance I first thought you were
> using SAMPLE_GPU_TS to initialise gpu_ts
Yes
>> +               /*
>> +                * XXX: timecounter_cyc2time considers time backwards if delta
>> +                * timestamp is more than half the max ns time covered by
>> +                * counter. It will be ~35min for 36 bit counter. If this much
>> +                * sampling duration is needed we will have to update tc->nsec
>> +                * by explicitly reading the timecounter (timecounter_read)
>> +                * before this duration.
>> +                */
>> +               system_ts = timecounter_cyc2time(&stream->tc, gpu_ts);
>> +
>> +               if (copy_to_user(buf, &system_ts, I915_PERF_TS_SAMPLE_SIZE))
>> +                       return -EFAULT;
> Advance buf.
Had kept advance logic only if there is new field to be added as this 
advance is missing for
SAMPLE_OA_REPORT currently in drm-tip. Will fix.
Thanks for the review :)

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 4/4] drm/i915/perf: Send system clock monotonic time in perf samples
  2017-11-15 12:13 ` [RFC 4/4] drm/i915/perf: Send system clock monotonic time in perf samples Sagar Arun Kamble
  2017-11-15 12:31   ` Chris Wilson
@ 2017-11-15 17:54   ` Sagar Arun Kamble
  2017-12-05 14:22   ` Lionel Landwerlin
  2 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-11-15 17:54 UTC (permalink / raw)
  To: intel-gfx; +Cc: Sourab Gupta, Matthew Auld



On 11/15/2017 5:43 PM, Sagar Arun Kamble wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
>
> Currently, we have the ability to only forward the GPU timestamps in the
> samples (which are generated via OA reports). This limits the ability to
> correlate these samples with the system events.
>
> An ability is therefore needed to report timestamps in different clock
> domains, such as CLOCK_MONOTONIC, in the perf samples to be of more
> practical use to the userspace. This ability becomes important
> when we want to correlate/plot GPU events/samples with other system events
> on the same timeline (e.g. vblank events, or timestamps when work was
> submitted to kernel, etc.)
>
> The patch here proposes a mechanism to achieve this. The correlation
> between gpu time and system time is established using the timestamp clock
> associated with the command stream, abstracted as timecounter/cyclecounter
> to retrieve gpu/system time correlated values.
>
> v2: Added i915_driver_init_late() function to capture the new late init
> phase for perf (Chris)
>
> v3: Removed cross-timestamp changes.
>
> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sourab Gupta <sourab.gupta@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_perf.c | 27 +++++++++++++++++++++++++++
>   include/uapi/drm/i915_drm.h      |  7 +++++++
>   2 files changed, 34 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
> index 3b721d7..94ee924 100644
> --- a/drivers/gpu/drm/i915/i915_perf.c
> +++ b/drivers/gpu/drm/i915/i915_perf.c
> @@ -336,6 +336,7 @@
>   
>   #define SAMPLE_OA_REPORT	BIT(0)
>   #define SAMPLE_GPU_TS		BIT(1)
> +#define SAMPLE_SYSTEM_TS	BIT(2)
>   
>   /**
>    * struct perf_open_properties - for validated properties given to open a stream
> @@ -622,6 +623,7 @@ static int append_oa_sample(struct i915_perf_stream *stream,
>   	struct drm_i915_perf_record_header header;
>   	u32 sample_flags = stream->sample_flags;
>   	u64 gpu_ts = 0;
> +	u64 system_ts = 0;
>   
>   	header.type = DRM_I915_PERF_RECORD_SAMPLE;
>   	header.pad = 0;
> @@ -647,6 +649,23 @@ static int append_oa_sample(struct i915_perf_stream *stream,
>   
>   		if (copy_to_user(buf, &gpu_ts, I915_PERF_TS_SAMPLE_SIZE))
>   			return -EFAULT;
> +		buf += I915_PERF_TS_SAMPLE_SIZE;
> +	}
> +
> +	if (sample_flags & SAMPLE_SYSTEM_TS) {
> +		gpu_ts = get_gpu_ts_from_oa_report(stream, report);
> +		/*
> +		 * XXX: timecounter_cyc2time considers time backwards if delta
> +		 * timestamp is more than half the max ns time covered by
> +		 * counter. It will be ~35min for 36 bit counter. If this much
> +		 * sampling duration is needed we will have to update tc->nsec
> +		 * by explicitly reading the timecounter (timecounter_read)
> +		 * before this duration.
This is implemented as overflow watchdog in mlx5e_timestamp_init 
(drivers/net/ethernet/mellanox).
Will need similar mechanism here.
> +		 */
> +		system_ts = timecounter_cyc2time(&stream->tc, gpu_ts);
> +
> +		if (copy_to_user(buf, &system_ts, I915_PERF_TS_SAMPLE_SIZE))
> +			return -EFAULT;
>   	}
>   
>   	(*offset) += header.size;
> @@ -2137,6 +2156,11 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
>   		stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
>   	}
>   
> +	if (props->sample_flags & SAMPLE_SYSTEM_TS) {
> +		stream->sample_flags |= SAMPLE_SYSTEM_TS;
> +		stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
> +	}
> +
>   	dev_priv->perf.oa.oa_buffer.format_size = format_size;
>   	if (WARN_ON(dev_priv->perf.oa.oa_buffer.format_size == 0))
>   		return -EINVAL;
> @@ -2857,6 +2881,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
>   		case DRM_I915_PERF_PROP_SAMPLE_GPU_TS:
>   			props->sample_flags |= SAMPLE_GPU_TS;
>   			break;
> +		case DRM_I915_PERF_PROP_SAMPLE_SYSTEM_TS:
> +			props->sample_flags |= SAMPLE_SYSTEM_TS;
> +			break;
>   		case DRM_I915_PERF_PROP_OA_METRICS_SET:
>   			if (value == 0) {
>   				DRM_DEBUG("Unknown OA metric set ID\n");
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 0b9249e..283859c 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1453,6 +1453,12 @@ enum drm_i915_perf_property_id {
>   	DRM_I915_PERF_PROP_SAMPLE_GPU_TS,
>   
>   	/**
> +	 * This property requests inclusion of CLOCK_MONOTONIC system time in
> +	 * the perf sample data.
> +	 */
> +	DRM_I915_PERF_PROP_SAMPLE_SYSTEM_TS,
> +
> +	/**
>   	 * The value specifies which set of OA unit metrics should be
>   	 * be configured, defining the contents of any OA unit reports.
>   	 */
> @@ -1539,6 +1545,7 @@ enum drm_i915_perf_record_type {
>   	 *
>   	 *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
>   	 *     { u64 gpu_timestamp; } && DRM_I915_PERF_PROP_SAMPLE_GPU_TS
> +	 *     { u64 system_timestamp; } && DRM_I915_PERF_PROP_SAMPLE_SYSTEM_TS
>   	 * };
>   	 */
>   	DRM_I915_PERF_RECORD_SAMPLE = 1,

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time]
  2017-11-15 12:25   ` Chris Wilson
  2017-11-15 16:41     ` Sagar Arun Kamble
@ 2017-11-23  7:34     ` Sagar Arun Kamble
  2017-11-23 18:59       ` Thomas Gleixner
  2017-12-05 13:58     ` [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time Lionel Landwerlin
  2 siblings, 1 reply; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-11-23  7:34 UTC (permalink / raw)
  To: John Stultz, Thomas Gleixner, Stephen Boyd
  Cc: Chris Wilson, linux-kernel, Sagar A Kamble

Hi,

We needed inputs on possible optimization that can be done to timecounter/cyclecounter structures/usage.
This mail is in response to review of patch https://patchwork.freedesktop.org/patch/188448/.

As Chris's observation below, about dozen of timecounter users in the kernel have below structures
defined individually:

spinlock_t lock;
struct cyclecounter cc;
struct timecounter tc;

Can we move lock and cc to tc? That way it will be convenient.
Also it will allow unifying the locking/overflow watchdog handling across all drivers.

Please suggest.

Thanks
Sagar

On 11/15/2017 5:55 PM, Chris Wilson wrote:
> Quoting Sagar Arun Kamble (2017-11-15 12:13:51)
>>   #include <drm/drmP.h>
>>   #include <drm/intel-gtt.h>
>> @@ -2149,6 +2150,14 @@ struct i915_perf_stream {
>>           * @oa_config: The OA configuration used by the stream.
>>           */
>>          struct i915_oa_config *oa_config;
>> +
>> +       /**
>> +        * System time correlation variables.
>> +        */
>> +       struct cyclecounter cc;
>> +       spinlock_t systime_lock;
>> +       struct timespec64 start_systime;
>> +       struct timecounter tc;
> This pattern is repeated a lot by struct timecounter users. (I'm still
> trying to understand why the common case is not catered for by a
> convenience timecounter api.)
>
>>   };
>>   
>>   /**
>> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
>> index 00be015..72ddc34 100644
>> --- a/drivers/gpu/drm/i915/i915_perf.c
>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>> @@ -192,6 +192,7 @@
>>    */
>>   
>>   #include <linux/anon_inodes.h>
>> +#include <linux/clocksource.h>
>>   #include <linux/sizes.h>
>>   #include <linux/uuid.h>
>>   
>> @@ -2391,6 +2392,56 @@ static unsigned int i915_perf_poll(struct file *file, poll_table *wait)
>>   }
>>   
>>   /**
>> + * i915_cyclecounter_read - read raw cycle/timestamp counter
>> + * @cc: cyclecounter structure
>> + */
>> +static u64 i915_cyclecounter_read(const struct cyclecounter *cc)
>> +{
>> +       struct i915_perf_stream *stream = container_of(cc, typeof(*stream), cc);
>> +       struct drm_i915_private *dev_priv = stream->dev_priv;
>> +       u64 ts_count;
>> +
>> +       intel_runtime_pm_get(dev_priv);
>> +       ts_count = I915_READ64_2x32(GEN4_TIMESTAMP,
>> +                                   GEN7_TIMESTAMP_UDW);
>> +       intel_runtime_pm_put(dev_priv);
>> +
>> +       return ts_count;
>> +}
>> +
>> +static void i915_perf_init_cyclecounter(struct i915_perf_stream *stream)
>> +{
>> +       struct drm_i915_private *dev_priv = stream->dev_priv;
>> +       int cs_ts_freq = dev_priv->perf.oa.timestamp_frequency;
>> +       struct cyclecounter *cc = &stream->cc;
>> +       u32 maxsec;
>> +
>> +       cc->read = i915_cyclecounter_read;
>> +       cc->mask = CYCLECOUNTER_MASK(CS_TIMESTAMP_WIDTH(dev_priv));
>> +       maxsec = cc->mask / cs_ts_freq;
>> +
>> +       clocks_calc_mult_shift(&cc->mult, &cc->shift, cs_ts_freq,
>> +                              NSEC_PER_SEC, maxsec);
>> +}
>> +
>> +static void i915_perf_init_timecounter(struct i915_perf_stream *stream)
>> +{
>> +#define SYSTIME_START_OFFSET   350000 /* Counter read takes about 350us */
>> +       unsigned long flags;
>> +       u64 ns;
>> +
>> +       i915_perf_init_cyclecounter(stream);
>> +       spin_lock_init(&stream->systime_lock);
>> +
>> +       getnstimeofday64(&stream->start_systime);
>> +       ns = timespec64_to_ns(&stream->start_systime) + SYSTIME_START_OFFSET;
> Use ktime directly. Or else Arnd will be back with a patch to fix it.
> (All non-ktime interfaces are effectively deprecated; obsolete for
> drivers.)
>
>> +       spin_lock_irqsave(&stream->systime_lock, flags);
>> +       timecounter_init(&stream->tc, &stream->cc, ns);
>> +       spin_unlock_irqrestore(&stream->systime_lock, flags);
>> +}
>> +
>> +/**
>>    * i915_perf_enable_locked - handle `I915_PERF_IOCTL_ENABLE` ioctl
>>    * @stream: A disabled i915 perf stream
>>    *
>> @@ -2408,6 +2459,8 @@ static void i915_perf_enable_locked(struct i915_perf_stream *stream)
>>          /* Allow stream->ops->enable() to refer to this */
>>          stream->enabled = true;
>>   
>> +       i915_perf_init_timecounter(stream);
>> +
>>          if (stream->ops->enable)
>>                  stream->ops->enable(stream);
>>   }
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index cfdf4f8..e7e6966 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -8882,6 +8882,12 @@ enum skl_power_gate {
>>   
>>   /* Gen4+ Timestamp and Pipe Frame time stamp registers */
>>   #define GEN4_TIMESTAMP         _MMIO(0x2358)
>> +#define GEN7_TIMESTAMP_UDW     _MMIO(0x235C)
>> +#define PRE_GEN7_TIMESTAMP_WIDTH       32
>> +#define GEN7_TIMESTAMP_WIDTH           36
>> +#define CS_TIMESTAMP_WIDTH(dev_priv) \
>> +       (INTEL_GEN(dev_priv) < 7 ? PRE_GEN7_TIMESTAMP_WIDTH : \
>> +                                  GEN7_TIMESTAMP_WIDTH)
> s/PRE_GEN7/GEN4/ would be consistent.
> If you really want to add support for earlier, I9XX_.
>
> Ok. I can accept the justification, and we are not the only ones who do
> the cyclecounter -> timecounter correction like this.
> -Chris

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time]
  2017-11-23  7:34     ` Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time] Sagar Arun Kamble
@ 2017-11-23 18:59       ` Thomas Gleixner
  2017-11-24  9:06         ` Sagar Arun Kamble
  0 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2017-11-23 18:59 UTC (permalink / raw)
  To: Sagar Arun Kamble; +Cc: John Stultz, Stephen Boyd, Chris Wilson, linux-kernel

On Thu, 23 Nov 2017, Sagar Arun Kamble wrote:
> We needed inputs on possible optimization that can be done to
> timecounter/cyclecounter structures/usage.
> This mail is in response to review of patch
> https://patchwork.freedesktop.org/patch/188448/.
> 
> As Chris's observation below, about dozen of timecounter users in the kernel
> have below structures
> defined individually:
> 
> spinlock_t lock;
> struct cyclecounter cc;
> struct timecounter tc;
> 
> Can we move lock and cc to tc? That way it will be convenient.
> Also it will allow unifying the locking/overflow watchdog handling across all
> drivers.

Looks like none of the timecounter usage sites has a real need to separate
timecounter and cyclecounter.

The lock is a different question. The locking of the various drivers
differs and I have no idea how you want to handle that. Just sticking the
lock into the datastructure and then not making use of it in the
timercounter code and leave it to the callsites does not make sense.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time]
  2017-11-23 18:59       ` Thomas Gleixner
@ 2017-11-24  9:06         ` Sagar Arun Kamble
  2017-11-24 13:31           ` Thomas Gleixner
  0 siblings, 1 reply; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-11-24  9:06 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: John Stultz, Stephen Boyd, Chris Wilson, linux-kernel



On 11/24/2017 12:29 AM, Thomas Gleixner wrote:
> On Thu, 23 Nov 2017, Sagar Arun Kamble wrote:
>> We needed inputs on possible optimization that can be done to
>> timecounter/cyclecounter structures/usage.
>> This mail is in response to review of patch
>> https://patchwork.freedesktop.org/patch/188448/.
>>
>> As Chris's observation below, about dozen of timecounter users in the kernel
>> have below structures
>> defined individually:
>>
>> spinlock_t lock;
>> struct cyclecounter cc;
>> struct timecounter tc;
>>
>> Can we move lock and cc to tc? That way it will be convenient.
>> Also it will allow unifying the locking/overflow watchdog handling across all
>> drivers.
> Looks like none of the timecounter usage sites has a real need to separate
> timecounter and cyclecounter.

Yes. Will share patch for this change.

> The lock is a different question. The locking of the various drivers
> differs and I have no idea how you want to handle that. Just sticking the
> lock into the datastructure and then not making use of it in the
> timercounter code and leave it to the callsites does not make sense.

Most of the locks are held around timecounter_read. In some instances it is held when cyclecounter is
updated standalone or is updated along with timecounter calls.
Was thinking if we move the lock in timecounter functions, drivers just have to do locking around its
operations on cyclecounter. But then another problem I see is there are variation of locking calls
like lock_irqsave, lock_bh, write_lock_irqsave (some using rwlock_t). Should this all locking be left
to driver only then?

> Thanks,
>
> 	tglx

Thanks
Sagar

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time]
  2017-11-24  9:06         ` Sagar Arun Kamble
@ 2017-11-24 13:31           ` Thomas Gleixner
  2017-11-27 10:05               ` Sagar Arun Kamble
  0 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2017-11-24 13:31 UTC (permalink / raw)
  To: Sagar Arun Kamble; +Cc: John Stultz, Stephen Boyd, Chris Wilson, linux-kernel

On Fri, 24 Nov 2017, Sagar Arun Kamble wrote:
> On 11/24/2017 12:29 AM, Thomas Gleixner wrote:
> > On Thu, 23 Nov 2017, Sagar Arun Kamble wrote:
> > > We needed inputs on possible optimization that can be done to
> > > timecounter/cyclecounter structures/usage.
> > > This mail is in response to review of patch
> > > https://patchwork.freedesktop.org/patch/188448/.
> > > 
> > > As Chris's observation below, about dozen of timecounter users in the
> > > kernel
> > > have below structures
> > > defined individually:
> > > 
> > > spinlock_t lock;
> > > struct cyclecounter cc;
> > > struct timecounter tc;
> > > 
> > > Can we move lock and cc to tc? That way it will be convenient.
> > > Also it will allow unifying the locking/overflow watchdog handling across
> > > all
> > > drivers.
> > Looks like none of the timecounter usage sites has a real need to separate
> > timecounter and cyclecounter.
> 
> Yes. Will share patch for this change.
> 
> > The lock is a different question. The locking of the various drivers
> > differs and I have no idea how you want to handle that. Just sticking the
> > lock into the datastructure and then not making use of it in the
> > timercounter code and leave it to the callsites does not make sense.
> 
> Most of the locks are held around timecounter_read. In some instances it
> is held when cyclecounter is updated standalone or is updated along with
> timecounter calls.  Was thinking if we move the lock in timecounter
> functions, drivers just have to do locking around its operations on
> cyclecounter. But then another problem I see is there are variation of
> locking calls like lock_irqsave, lock_bh, write_lock_irqsave (some using
> rwlock_t). Should this all locking be left to driver only then?

You could have the lock in the struct and protect the inner workings in the
related core functions.

That might remove locking requirements from some of the callers and the
others still have their own thing around it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time]
  2017-11-24 13:31           ` Thomas Gleixner
@ 2017-11-27 10:05               ` Sagar Arun Kamble
  0 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-11-27 10:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, Stephen Boyd, Chris Wilson,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA



On 11/24/2017 7:01 PM, Thomas Gleixner wrote:
> On Fri, 24 Nov 2017, Sagar Arun Kamble wrote:
>> On 11/24/2017 12:29 AM, Thomas Gleixner wrote:
>>> On Thu, 23 Nov 2017, Sagar Arun Kamble wrote:
>>>> We needed inputs on possible optimization that can be done to
>>>> timecounter/cyclecounter structures/usage.
>>>> This mail is in response to review of patch
>>>> https://patchwork.freedesktop.org/patch/188448/.
>>>>
>>>> As Chris's observation below, about dozen of timecounter users in the
>>>> kernel
>>>> have below structures
>>>> defined individually:
>>>>
>>>> spinlock_t lock;
>>>> struct cyclecounter cc;
>>>> struct timecounter tc;
>>>>
>>>> Can we move lock and cc to tc? That way it will be convenient.
>>>> Also it will allow unifying the locking/overflow watchdog handling across
>>>> all
>>>> drivers.
>>> Looks like none of the timecounter usage sites has a real need to separate
>>> timecounter and cyclecounter.
>> Yes. Will share patch for this change.
>>
>>> The lock is a different question. The locking of the various drivers
>>> differs and I have no idea how you want to handle that. Just sticking the
>>> lock into the datastructure and then not making use of it in the
>>> timercounter code and leave it to the callsites does not make sense.
>> Most of the locks are held around timecounter_read. In some instances it
>> is held when cyclecounter is updated standalone or is updated along with
>> timecounter calls.  Was thinking if we move the lock in timecounter
>> functions, drivers just have to do locking around its operations on
>> cyclecounter. But then another problem I see is there are variation of
>> locking calls like lock_irqsave, lock_bh, write_lock_irqsave (some using
>> rwlock_t). Should this all locking be left to driver only then?
> You could have the lock in the struct and protect the inner workings in the
> related core functions.
>
> That might remove locking requirements from some of the callers and the
> others still have their own thing around it.

For drivers having static/fixed cyclecounter, we can rely only on lock inside timecounter.
Most of the network drivers update cyclecounter at runtime and they will have to rely on two locks if
we add one to timecounter. This may not be efficient for them. Also the lock in timecounter has to be less restrictive (may be seqlock) I guess.

Cc'd Mellanox list for inputs on this.

I have started feeling that the current approach of drivers managing the locks is the right one so better leave the
lock out of timecounter.

> Thanks,
>
> 	tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time]
@ 2017-11-27 10:05               ` Sagar Arun Kamble
  0 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-11-27 10:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, Stephen Boyd, Chris Wilson, linux-kernel, netdev,
	linux-rdma



On 11/24/2017 7:01 PM, Thomas Gleixner wrote:
> On Fri, 24 Nov 2017, Sagar Arun Kamble wrote:
>> On 11/24/2017 12:29 AM, Thomas Gleixner wrote:
>>> On Thu, 23 Nov 2017, Sagar Arun Kamble wrote:
>>>> We needed inputs on possible optimization that can be done to
>>>> timecounter/cyclecounter structures/usage.
>>>> This mail is in response to review of patch
>>>> https://patchwork.freedesktop.org/patch/188448/.
>>>>
>>>> As Chris's observation below, about dozen of timecounter users in the
>>>> kernel
>>>> have below structures
>>>> defined individually:
>>>>
>>>> spinlock_t lock;
>>>> struct cyclecounter cc;
>>>> struct timecounter tc;
>>>>
>>>> Can we move lock and cc to tc? That way it will be convenient.
>>>> Also it will allow unifying the locking/overflow watchdog handling across
>>>> all
>>>> drivers.
>>> Looks like none of the timecounter usage sites has a real need to separate
>>> timecounter and cyclecounter.
>> Yes. Will share patch for this change.
>>
>>> The lock is a different question. The locking of the various drivers
>>> differs and I have no idea how you want to handle that. Just sticking the
>>> lock into the datastructure and then not making use of it in the
>>> timercounter code and leave it to the callsites does not make sense.
>> Most of the locks are held around timecounter_read. In some instances it
>> is held when cyclecounter is updated standalone or is updated along with
>> timecounter calls.  Was thinking if we move the lock in timecounter
>> functions, drivers just have to do locking around its operations on
>> cyclecounter. But then another problem I see is there are variation of
>> locking calls like lock_irqsave, lock_bh, write_lock_irqsave (some using
>> rwlock_t). Should this all locking be left to driver only then?
> You could have the lock in the struct and protect the inner workings in the
> related core functions.
>
> That might remove locking requirements from some of the callers and the
> others still have their own thing around it.

For drivers having static/fixed cyclecounter, we can rely only on lock inside timecounter.
Most of the network drivers update cyclecounter at runtime and they will have to rely on two locks if
we add one to timecounter. This may not be efficient for them. Also the lock in timecounter has to be less restrictive (may be seqlock) I guess.

Cc'd Mellanox list for inputs on this.

I have started feeling that the current approach of drivers managing the locks is the right one so better leave the
lock out of timecounter.

> Thanks,
>
> 	tglx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time]
  2017-11-27 10:05               ` Sagar Arun Kamble
@ 2017-11-30 21:03                   ` Saeed Mahameed
  -1 siblings, 0 replies; 45+ messages in thread
From: Saeed Mahameed @ 2017-11-30 21:03 UTC (permalink / raw)
  To: Sagar Arun Kamble
  Cc: Thomas Gleixner, John Stultz, Stephen Boyd, Chris Wilson,
	linux-kernel, Linux Netdev List,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, Nov 27, 2017 at 2:05 AM, Sagar Arun Kamble
<sagar.a.kamble-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>
>
> On 11/24/2017 7:01 PM, Thomas Gleixner wrote:
>>
>> On Fri, 24 Nov 2017, Sagar Arun Kamble wrote:
>>>
>>> On 11/24/2017 12:29 AM, Thomas Gleixner wrote:
>>>>
>>>> On Thu, 23 Nov 2017, Sagar Arun Kamble wrote:
>>>>>
>>>>> We needed inputs on possible optimization that can be done to
>>>>> timecounter/cyclecounter structures/usage.
>>>>> This mail is in response to review of patch
>>>>> https://patchwork.freedesktop.org/patch/188448/.
>>>>>
>>>>> As Chris's observation below, about dozen of timecounter users in the
>>>>> kernel
>>>>> have below structures
>>>>> defined individually:
>>>>>
>>>>> spinlock_t lock;
>>>>> struct cyclecounter cc;
>>>>> struct timecounter tc;
>>>>>
>>>>> Can we move lock and cc to tc? That way it will be convenient.
>>>>> Also it will allow unifying the locking/overflow watchdog handling
>>>>> across
>>>>> all
>>>>> drivers.
>>>>
>>>> Looks like none of the timecounter usage sites has a real need to
>>>> separate
>>>> timecounter and cyclecounter.
>>>
>>> Yes. Will share patch for this change.
>>>
>>>> The lock is a different question. The locking of the various drivers
>>>> differs and I have no idea how you want to handle that. Just sticking
>>>> the
>>>> lock into the datastructure and then not making use of it in the
>>>> timercounter code and leave it to the callsites does not make sense.
>>>
>>> Most of the locks are held around timecounter_read. In some instances it
>>> is held when cyclecounter is updated standalone or is updated along with
>>> timecounter calls.  Was thinking if we move the lock in timecounter
>>> functions, drivers just have to do locking around its operations on
>>> cyclecounter. But then another problem I see is there are variation of
>>> locking calls like lock_irqsave, lock_bh, write_lock_irqsave (some using
>>> rwlock_t). Should this all locking be left to driver only then?
>>
>> You could have the lock in the struct and protect the inner workings in
>> the
>> related core functions.
>>
>> That might remove locking requirements from some of the callers and the
>> others still have their own thing around it.
>
>
> For drivers having static/fixed cyclecounter, we can rely only on lock
> inside timecounter.
> Most of the network drivers update cyclecounter at runtime and they will
> have to rely on two locks if
> we add one to timecounter. This may not be efficient for them. Also the lock
> in timecounter has to be less restrictive (may be seqlock) I guess.
>
> Cc'd Mellanox list for inputs on this.
>
> I have started feeling that the current approach of drivers managing the
> locks is the right one so better leave the
> lock out of timecounter.
>

I agree here,

In mlx5 we rely on our own read/write lock to serialize access to
mlx5_clock struct (mlx5 timecounter and cyclecounter).
the access is not as simple as
lock()
call time_counter_API
unlock()

Sometimes we also explicitly update/adjust timecycles counters with
mlx5 specific calculations after we read the timecounter all inside
our lock.
e.g.
@mlx5_ptp_adjfreq()

    write_lock_irqsave(&clock->lock, flags);
    timecounter_read(&clock->tc);
    clock->cycles.mult = neg_adj ? clock->nominal_c_mult - diff :
                       clock->nominal_c_mult + diff;
    write_unlock_irqrestore(&clock->lock, flags);

So i don't think it will be a simple task to have a generic thread
safe timecounter API, without the need to specifically adjust it for
all driver use-cases.
Also as said above, in runtime it is not obvious in which context the
timecounter will be accessed irq/soft irq/user.

let's keep it as is, and let the driver decide which locking scheme is
most suitable for it.

Thanks,
Saeed.

>> Thanks,
>>
>>         tglx
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time]
@ 2017-11-30 21:03                   ` Saeed Mahameed
  0 siblings, 0 replies; 45+ messages in thread
From: Saeed Mahameed @ 2017-11-30 21:03 UTC (permalink / raw)
  To: Sagar Arun Kamble
  Cc: Thomas Gleixner, John Stultz, Stephen Boyd, Chris Wilson,
	linux-kernel, Linux Netdev List, linux-rdma

On Mon, Nov 27, 2017 at 2:05 AM, Sagar Arun Kamble
<sagar.a.kamble@intel.com> wrote:
>
>
> On 11/24/2017 7:01 PM, Thomas Gleixner wrote:
>>
>> On Fri, 24 Nov 2017, Sagar Arun Kamble wrote:
>>>
>>> On 11/24/2017 12:29 AM, Thomas Gleixner wrote:
>>>>
>>>> On Thu, 23 Nov 2017, Sagar Arun Kamble wrote:
>>>>>
>>>>> We needed inputs on possible optimization that can be done to
>>>>> timecounter/cyclecounter structures/usage.
>>>>> This mail is in response to review of patch
>>>>> https://patchwork.freedesktop.org/patch/188448/.
>>>>>
>>>>> As Chris's observation below, about dozen of timecounter users in the
>>>>> kernel
>>>>> have below structures
>>>>> defined individually:
>>>>>
>>>>> spinlock_t lock;
>>>>> struct cyclecounter cc;
>>>>> struct timecounter tc;
>>>>>
>>>>> Can we move lock and cc to tc? That way it will be convenient.
>>>>> Also it will allow unifying the locking/overflow watchdog handling
>>>>> across
>>>>> all
>>>>> drivers.
>>>>
>>>> Looks like none of the timecounter usage sites has a real need to
>>>> separate
>>>> timecounter and cyclecounter.
>>>
>>> Yes. Will share patch for this change.
>>>
>>>> The lock is a different question. The locking of the various drivers
>>>> differs and I have no idea how you want to handle that. Just sticking
>>>> the
>>>> lock into the datastructure and then not making use of it in the
>>>> timercounter code and leave it to the callsites does not make sense.
>>>
>>> Most of the locks are held around timecounter_read. In some instances it
>>> is held when cyclecounter is updated standalone or is updated along with
>>> timecounter calls.  Was thinking if we move the lock in timecounter
>>> functions, drivers just have to do locking around its operations on
>>> cyclecounter. But then another problem I see is there are variation of
>>> locking calls like lock_irqsave, lock_bh, write_lock_irqsave (some using
>>> rwlock_t). Should this all locking be left to driver only then?
>>
>> You could have the lock in the struct and protect the inner workings in
>> the
>> related core functions.
>>
>> That might remove locking requirements from some of the callers and the
>> others still have their own thing around it.
>
>
> For drivers having static/fixed cyclecounter, we can rely only on lock
> inside timecounter.
> Most of the network drivers update cyclecounter at runtime and they will
> have to rely on two locks if
> we add one to timecounter. This may not be efficient for them. Also the lock
> in timecounter has to be less restrictive (may be seqlock) I guess.
>
> Cc'd Mellanox list for inputs on this.
>
> I have started feeling that the current approach of drivers managing the
> locks is the right one so better leave the
> lock out of timecounter.
>

I agree here,

In mlx5 we rely on our own read/write lock to serialize access to
mlx5_clock struct (mlx5 timecounter and cyclecounter).
the access is not as simple as
lock()
call time_counter_API
unlock()

Sometimes we also explicitly update/adjust timecycles counters with
mlx5 specific calculations after we read the timecounter all inside
our lock.
e.g.
@mlx5_ptp_adjfreq()

    write_lock_irqsave(&clock->lock, flags);
    timecounter_read(&clock->tc);
    clock->cycles.mult = neg_adj ? clock->nominal_c_mult - diff :
                       clock->nominal_c_mult + diff;
    write_unlock_irqrestore(&clock->lock, flags);

So i don't think it will be a simple task to have a generic thread
safe timecounter API, without the need to specifically adjust it for
all driver use-cases.
Also as said above, in runtime it is not obvious in which context the
timecounter will be accessed irq/soft irq/user.

let's keep it as is, and let the driver decide which locking scheme is
most suitable for it.

Thanks,
Saeed.

>> Thanks,
>>
>>         tglx
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time]
  2017-11-30 21:03                   ` Saeed Mahameed
@ 2017-12-01  7:42                       ` Sagar Arun Kamble
  -1 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-12-01  7:42 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Thomas Gleixner, John Stultz, Stephen Boyd, Chris Wilson,
	linux-kernel, Linux Netdev List,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA



On 12/1/2017 2:33 AM, Saeed Mahameed wrote:
> On Mon, Nov 27, 2017 at 2:05 AM, Sagar Arun Kamble
> <sagar.a.kamble-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>
>> On 11/24/2017 7:01 PM, Thomas Gleixner wrote:
>>> On Fri, 24 Nov 2017, Sagar Arun Kamble wrote:
>>>> On 11/24/2017 12:29 AM, Thomas Gleixner wrote:
>>>>> On Thu, 23 Nov 2017, Sagar Arun Kamble wrote:
>>>>>> We needed inputs on possible optimization that can be done to
>>>>>> timecounter/cyclecounter structures/usage.
>>>>>> This mail is in response to review of patch
>>>>>> https://patchwork.freedesktop.org/patch/188448/.
>>>>>>
>>>>>> As Chris's observation below, about dozen of timecounter users in the
>>>>>> kernel
>>>>>> have below structures
>>>>>> defined individually:
>>>>>>
>>>>>> spinlock_t lock;
>>>>>> struct cyclecounter cc;
>>>>>> struct timecounter tc;
>>>>>>
>>>>>> Can we move lock and cc to tc? That way it will be convenient.
>>>>>> Also it will allow unifying the locking/overflow watchdog handling
>>>>>> across
>>>>>> all
>>>>>> drivers.
>>>>> Looks like none of the timecounter usage sites has a real need to
>>>>> separate
>>>>> timecounter and cyclecounter.
>>>> Yes. Will share patch for this change.
>>>>
>>>>> The lock is a different question. The locking of the various drivers
>>>>> differs and I have no idea how you want to handle that. Just sticking
>>>>> the
>>>>> lock into the datastructure and then not making use of it in the
>>>>> timercounter code and leave it to the callsites does not make sense.
>>>> Most of the locks are held around timecounter_read. In some instances it
>>>> is held when cyclecounter is updated standalone or is updated along with
>>>> timecounter calls.  Was thinking if we move the lock in timecounter
>>>> functions, drivers just have to do locking around its operations on
>>>> cyclecounter. But then another problem I see is there are variation of
>>>> locking calls like lock_irqsave, lock_bh, write_lock_irqsave (some using
>>>> rwlock_t). Should this all locking be left to driver only then?
>>> You could have the lock in the struct and protect the inner workings in
>>> the
>>> related core functions.
>>>
>>> That might remove locking requirements from some of the callers and the
>>> others still have their own thing around it.
>>
>> For drivers having static/fixed cyclecounter, we can rely only on lock
>> inside timecounter.
>> Most of the network drivers update cyclecounter at runtime and they will
>> have to rely on two locks if
>> we add one to timecounter. This may not be efficient for them. Also the lock
>> in timecounter has to be less restrictive (may be seqlock) I guess.
>>
>> Cc'd Mellanox list for inputs on this.
>>
>> I have started feeling that the current approach of drivers managing the
>> locks is the right one so better leave the
>> lock out of timecounter.
>>
> I agree here,
>
> In mlx5 we rely on our own read/write lock to serialize access to
> mlx5_clock struct (mlx5 timecounter and cyclecounter).
> the access is not as simple as
> lock()
> call time_counter_API
> unlock()
>
> Sometimes we also explicitly update/adjust timecycles counters with
> mlx5 specific calculations after we read the timecounter all inside
> our lock.
> e.g.
> @mlx5_ptp_adjfreq()
>
>      write_lock_irqsave(&clock->lock, flags);
>      timecounter_read(&clock->tc);
>      clock->cycles.mult = neg_adj ? clock->nominal_c_mult - diff :
>                         clock->nominal_c_mult + diff;
>      write_unlock_irqrestore(&clock->lock, flags);
>
> So i don't think it will be a simple task to have a generic thread
> safe timecounter API, without the need to specifically adjust it for
> all driver use-cases.
> Also as said above, in runtime it is not obvious in which context the
> timecounter will be accessed irq/soft irq/user.
>
> let's keep it as is, and let the driver decide which locking scheme is
> most suitable for it.

Yes. Thanks for your inputs Saeed.

Regards
Sagar

>
> Thanks,
> Saeed.
>
>>> Thanks,
>>>
>>>          tglx
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time]
@ 2017-12-01  7:42                       ` Sagar Arun Kamble
  0 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-12-01  7:42 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Thomas Gleixner, John Stultz, Stephen Boyd, Chris Wilson,
	linux-kernel, Linux Netdev List, linux-rdma



On 12/1/2017 2:33 AM, Saeed Mahameed wrote:
> On Mon, Nov 27, 2017 at 2:05 AM, Sagar Arun Kamble
> <sagar.a.kamble@intel.com> wrote:
>>
>> On 11/24/2017 7:01 PM, Thomas Gleixner wrote:
>>> On Fri, 24 Nov 2017, Sagar Arun Kamble wrote:
>>>> On 11/24/2017 12:29 AM, Thomas Gleixner wrote:
>>>>> On Thu, 23 Nov 2017, Sagar Arun Kamble wrote:
>>>>>> We needed inputs on possible optimization that can be done to
>>>>>> timecounter/cyclecounter structures/usage.
>>>>>> This mail is in response to review of patch
>>>>>> https://patchwork.freedesktop.org/patch/188448/.
>>>>>>
>>>>>> As Chris's observation below, about dozen of timecounter users in the
>>>>>> kernel
>>>>>> have below structures
>>>>>> defined individually:
>>>>>>
>>>>>> spinlock_t lock;
>>>>>> struct cyclecounter cc;
>>>>>> struct timecounter tc;
>>>>>>
>>>>>> Can we move lock and cc to tc? That way it will be convenient.
>>>>>> Also it will allow unifying the locking/overflow watchdog handling
>>>>>> across
>>>>>> all
>>>>>> drivers.
>>>>> Looks like none of the timecounter usage sites has a real need to
>>>>> separate
>>>>> timecounter and cyclecounter.
>>>> Yes. Will share patch for this change.
>>>>
>>>>> The lock is a different question. The locking of the various drivers
>>>>> differs and I have no idea how you want to handle that. Just sticking
>>>>> the
>>>>> lock into the datastructure and then not making use of it in the
>>>>> timercounter code and leave it to the callsites does not make sense.
>>>> Most of the locks are held around timecounter_read. In some instances it
>>>> is held when cyclecounter is updated standalone or is updated along with
>>>> timecounter calls.  Was thinking if we move the lock in timecounter
>>>> functions, drivers just have to do locking around its operations on
>>>> cyclecounter. But then another problem I see is there are variation of
>>>> locking calls like lock_irqsave, lock_bh, write_lock_irqsave (some using
>>>> rwlock_t). Should this all locking be left to driver only then?
>>> You could have the lock in the struct and protect the inner workings in
>>> the
>>> related core functions.
>>>
>>> That might remove locking requirements from some of the callers and the
>>> others still have their own thing around it.
>>
>> For drivers having static/fixed cyclecounter, we can rely only on lock
>> inside timecounter.
>> Most of the network drivers update cyclecounter at runtime and they will
>> have to rely on two locks if
>> we add one to timecounter. This may not be efficient for them. Also the lock
>> in timecounter has to be less restrictive (may be seqlock) I guess.
>>
>> Cc'd Mellanox list for inputs on this.
>>
>> I have started feeling that the current approach of drivers managing the
>> locks is the right one so better leave the
>> lock out of timecounter.
>>
> I agree here,
>
> In mlx5 we rely on our own read/write lock to serialize access to
> mlx5_clock struct (mlx5 timecounter and cyclecounter).
> the access is not as simple as
> lock()
> call time_counter_API
> unlock()
>
> Sometimes we also explicitly update/adjust timecycles counters with
> mlx5 specific calculations after we read the timecounter all inside
> our lock.
> e.g.
> @mlx5_ptp_adjfreq()
>
>      write_lock_irqsave(&clock->lock, flags);
>      timecounter_read(&clock->tc);
>      clock->cycles.mult = neg_adj ? clock->nominal_c_mult - diff :
>                         clock->nominal_c_mult + diff;
>      write_unlock_irqrestore(&clock->lock, flags);
>
> So i don't think it will be a simple task to have a generic thread
> safe timecounter API, without the need to specifically adjust it for
> all driver use-cases.
> Also as said above, in runtime it is not obvious in which context the
> timecounter will be accessed irq/soft irq/user.
>
> let's keep it as is, and let the driver decide which locking scheme is
> most suitable for it.

Yes. Thanks for your inputs Saeed.

Regards
Sagar

>
> Thanks,
> Saeed.
>
>>> Thanks,
>>>
>>>          tglx
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time
  2017-11-15 12:25   ` Chris Wilson
  2017-11-15 16:41     ` Sagar Arun Kamble
  2017-11-23  7:34     ` Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time] Sagar Arun Kamble
@ 2017-12-05 13:58     ` Lionel Landwerlin
  2017-12-06  8:17       ` Sagar Arun Kamble
  2 siblings, 1 reply; 45+ messages in thread
From: Lionel Landwerlin @ 2017-12-05 13:58 UTC (permalink / raw)
  To: Chris Wilson, Sagar Arun Kamble, intel-gfx; +Cc: Sourab Gupta, Matthew Auld

On 15/11/17 12:25, Chris Wilson wrote:
> Quoting Sagar Arun Kamble (2017-11-15 12:13:51)
>>   #include <drm/drmP.h>
>>   #include <drm/intel-gtt.h>
>> @@ -2149,6 +2150,14 @@ struct i915_perf_stream {
>>           * @oa_config: The OA configuration used by the stream.
>>           */
>>          struct i915_oa_config *oa_config;
>> +
>> +       /**
>> +        * System time correlation variables.
>> +        */
>> +       struct cyclecounter cc;
>> +       spinlock_t systime_lock;
>> +       struct timespec64 start_systime;
>> +       struct timecounter tc;
> This pattern is repeated a lot by struct timecounter users. (I'm still
> trying to understand why the common case is not catered for by a
> convenience timecounter api.)
>
>>   };
>>   
>>   /**
>> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
>> index 00be015..72ddc34 100644
>> --- a/drivers/gpu/drm/i915/i915_perf.c
>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>> @@ -192,6 +192,7 @@
>>    */
>>   
>>   #include <linux/anon_inodes.h>
>> +#include <linux/clocksource.h>
>>   #include <linux/sizes.h>
>>   #include <linux/uuid.h>
>>   
>> @@ -2391,6 +2392,56 @@ static unsigned int i915_perf_poll(struct file *file, poll_table *wait)
>>   }
>>   
>>   /**
>> + * i915_cyclecounter_read - read raw cycle/timestamp counter
>> + * @cc: cyclecounter structure
>> + */
>> +static u64 i915_cyclecounter_read(const struct cyclecounter *cc)
>> +{
>> +       struct i915_perf_stream *stream = container_of(cc, typeof(*stream), cc);
>> +       struct drm_i915_private *dev_priv = stream->dev_priv;
>> +       u64 ts_count;
>> +
>> +       intel_runtime_pm_get(dev_priv);
>> +       ts_count = I915_READ64_2x32(GEN4_TIMESTAMP,
>> +                                   GEN7_TIMESTAMP_UDW);
>> +       intel_runtime_pm_put(dev_priv);
>> +
>> +       return ts_count;
>> +}
>> +
>> +static void i915_perf_init_cyclecounter(struct i915_perf_stream *stream)
>> +{
>> +       struct drm_i915_private *dev_priv = stream->dev_priv;
>> +       int cs_ts_freq = dev_priv->perf.oa.timestamp_frequency;
>> +       struct cyclecounter *cc = &stream->cc;
>> +       u32 maxsec;
>> +
>> +       cc->read = i915_cyclecounter_read;
>> +       cc->mask = CYCLECOUNTER_MASK(CS_TIMESTAMP_WIDTH(dev_priv));
>> +       maxsec = cc->mask / cs_ts_freq;
>> +
>> +       clocks_calc_mult_shift(&cc->mult, &cc->shift, cs_ts_freq,
>> +                              NSEC_PER_SEC, maxsec);
>> +}
>> +
>> +static void i915_perf_init_timecounter(struct i915_perf_stream *stream)
>> +{
>> +#define SYSTIME_START_OFFSET   350000 /* Counter read takes about 350us */
>> +       unsigned long flags;
>> +       u64 ns;
>> +
>> +       i915_perf_init_cyclecounter(stream);
>> +       spin_lock_init(&stream->systime_lock);
>> +
>> +       getnstimeofday64(&stream->start_systime);
>> +       ns = timespec64_to_ns(&stream->start_systime) + SYSTIME_START_OFFSET;
> Use ktime directly. Or else Arnd will be back with a patch to fix it.
> (All non-ktime interfaces are effectively deprecated; obsolete for
> drivers.)
>
>> +       spin_lock_irqsave(&stream->systime_lock, flags);
>> +       timecounter_init(&stream->tc, &stream->cc, ns);
>> +       spin_unlock_irqrestore(&stream->systime_lock, flags);
>> +}
>> +
>> +/**
>>    * i915_perf_enable_locked - handle `I915_PERF_IOCTL_ENABLE` ioctl
>>    * @stream: A disabled i915 perf stream
>>    *
>> @@ -2408,6 +2459,8 @@ static void i915_perf_enable_locked(struct i915_perf_stream *stream)
>>          /* Allow stream->ops->enable() to refer to this */
>>          stream->enabled = true;
>>   
>> +       i915_perf_init_timecounter(stream);
>> +
>>          if (stream->ops->enable)
>>                  stream->ops->enable(stream);
>>   }
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index cfdf4f8..e7e6966 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -8882,6 +8882,12 @@ enum skl_power_gate {
>>   
>>   /* Gen4+ Timestamp and Pipe Frame time stamp registers */
>>   #define GEN4_TIMESTAMP         _MMIO(0x2358)
>> +#define GEN7_TIMESTAMP_UDW     _MMIO(0x235C)
>> +#define PRE_GEN7_TIMESTAMP_WIDTH       32
>> +#define GEN7_TIMESTAMP_WIDTH           36
>> +#define CS_TIMESTAMP_WIDTH(dev_priv) \
>> +       (INTEL_GEN(dev_priv) < 7 ? PRE_GEN7_TIMESTAMP_WIDTH : \
>> +                                  GEN7_TIMESTAMP_WIDTH)
> s/PRE_GEN7/GEN4/ would be consistent.
> If you really want to add support for earlier, I9XX_.

Why not use the RING_TIMESTAMP() RING_TIMESTAMP_UDW() ?
There's already used for the reg_read ioctl.

>
> Ok. I can accept the justification, and we are not the only ones who do
> the cyclecounter -> timecounter correction like this.
> -Chris
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-11-15 12:13 [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events Sagar Arun Kamble
                   ` (4 preceding siblings ...)
  2017-11-15 12:30 ` ✗ Fi.CI.BAT: warning for GPU/CPU timestamps correlation for relating OA samples with system events Patchwork
@ 2017-12-05 14:16 ` Lionel Landwerlin
  2017-12-05 14:28   ` Robert Bragg
  2017-12-06 20:02 ` Lionel Landwerlin
  2017-12-07  0:48 ` Robert Bragg
  7 siblings, 1 reply; 45+ messages in thread
From: Lionel Landwerlin @ 2017-12-05 14:16 UTC (permalink / raw)
  To: Sagar Arun Kamble, intel-gfx

Hey Sagar,

Sorry for the delay looking into this series.
I've done some userspace/UI work in GPUTop to try to correlate perf 
samples/tracepoints with i915 perf reports.

I wanted to avoid having to add too much logic into the kernel and tried 
to sample both cpu clocks & gpu timestamps from userspace.
So far that's not working. People more knowledgable than I would have 
realized that the kernel can sneak in work into syscalls.
So result is that 2 syscalls (one to get the cpu clock, one for the gpu 
timestamp) back to back from the same thread leads to time differences 
of anywhere from a few microseconds to in some cases close to 
1millisecond. So it's basically unworkable.
Anyway the UI work won't go to waste :)

I'm thinking to go with your approach.
 From my experiment with gputop, it seems we might want to use a 
different cpu clock source though or make it configurable.
The perf infrastructure allows you to choose what clock you want to use. 
Since we want to avoid time adjustments on that clock (because we're 
adding deltas), a clock monotonic raw would make most sense.

I'll look at adding some tests for this too.

Thanks,

-
Lionel

On 15/11/17 12:13, Sagar Arun Kamble wrote:
> We can compute system time corresponding to GPU timestamp by taking a
> reference point (CPU monotonic time, GPU timestamp) and then adding
> delta time computed using timecounter/cyclecounter support in kernel.
> We have to configure cyclecounter with the GPU timestamp frequency.
> Earlier approach that was based on cross-timestamp is not needed. It
> was being used to approximate the frequency based on invalid assumptions
> (possibly drift was being seen in the time due to precision issue).
> The precision of time from GPU clocks is already in ns and timecounter
> takes care of it as verified over variable durations.
>
> This series adds base timecounter/cyclecounter changes and changes to
> get GPU and CPU timestamps in OA samples.
>
> Sagar Arun Kamble (1):
>    drm/i915/perf: Add support to correlate GPU timestamp with system time
>
> Sourab Gupta (3):
>    drm/i915/perf: Add support for collecting 64 bit timestamps with OA
>      reports
>    drm/i915/perf: Extract raw GPU timestamps from OA reports
>    drm/i915/perf: Send system clock monotonic time in perf samples
>
>   drivers/gpu/drm/i915/i915_drv.h  |  11 ++++
>   drivers/gpu/drm/i915/i915_perf.c | 124 ++++++++++++++++++++++++++++++++++++++-
>   drivers/gpu/drm/i915/i915_reg.h  |   6 ++
>   include/uapi/drm/i915_drm.h      |  14 +++++
>   4 files changed, 154 insertions(+), 1 deletion(-)
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 4/4] drm/i915/perf: Send system clock monotonic time in perf samples
  2017-11-15 12:13 ` [RFC 4/4] drm/i915/perf: Send system clock monotonic time in perf samples Sagar Arun Kamble
  2017-11-15 12:31   ` Chris Wilson
  2017-11-15 17:54   ` Sagar Arun Kamble
@ 2017-12-05 14:22   ` Lionel Landwerlin
  2017-12-06  8:31     ` Sagar Arun Kamble
  2 siblings, 1 reply; 45+ messages in thread
From: Lionel Landwerlin @ 2017-12-05 14:22 UTC (permalink / raw)
  To: Sagar Arun Kamble, intel-gfx; +Cc: Sourab Gupta, Matthew Auld

On 15/11/17 12:13, Sagar Arun Kamble wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
>
> Currently, we have the ability to only forward the GPU timestamps in the
> samples (which are generated via OA reports). This limits the ability to
> correlate these samples with the system events.
>
> An ability is therefore needed to report timestamps in different clock
> domains, such as CLOCK_MONOTONIC, in the perf samples to be of more
> practical use to the userspace. This ability becomes important
> when we want to correlate/plot GPU events/samples with other system events
> on the same timeline (e.g. vblank events, or timestamps when work was
> submitted to kernel, etc.)
>
> The patch here proposes a mechanism to achieve this. The correlation
> between gpu time and system time is established using the timestamp clock
> associated with the command stream, abstracted as timecounter/cyclecounter
> to retrieve gpu/system time correlated values.
>
> v2: Added i915_driver_init_late() function to capture the new late init
> phase for perf (Chris)
>
> v3: Removed cross-timestamp changes.
>
> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sourab Gupta <sourab.gupta@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_perf.c | 27 +++++++++++++++++++++++++++
>   include/uapi/drm/i915_drm.h      |  7 +++++++
>   2 files changed, 34 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
> index 3b721d7..94ee924 100644
> --- a/drivers/gpu/drm/i915/i915_perf.c
> +++ b/drivers/gpu/drm/i915/i915_perf.c
> @@ -336,6 +336,7 @@
>   
>   #define SAMPLE_OA_REPORT	BIT(0)
>   #define SAMPLE_GPU_TS		BIT(1)
> +#define SAMPLE_SYSTEM_TS	BIT(2)
>   
>   /**
>    * struct perf_open_properties - for validated properties given to open a stream
> @@ -622,6 +623,7 @@ static int append_oa_sample(struct i915_perf_stream *stream,
>   	struct drm_i915_perf_record_header header;
>   	u32 sample_flags = stream->sample_flags;
>   	u64 gpu_ts = 0;
> +	u64 system_ts = 0;
>   
>   	header.type = DRM_I915_PERF_RECORD_SAMPLE;
>   	header.pad = 0;
> @@ -647,6 +649,23 @@ static int append_oa_sample(struct i915_perf_stream *stream,
>   
>   		if (copy_to_user(buf, &gpu_ts, I915_PERF_TS_SAMPLE_SIZE))
>   			return -EFAULT;
> +		buf += I915_PERF_TS_SAMPLE_SIZE;

This is a ridiculous nit, but I think using sizeof(u64) would be more 
readable than this I915_PERF_TS_SAMPLE_SIZE define.

> +	}
> +
> +	if (sample_flags & SAMPLE_SYSTEM_TS) {
> +		gpu_ts = get_gpu_ts_from_oa_report(stream, report);
> +		/*
> +		 * XXX: timecounter_cyc2time considers time backwards if delta
> +		 * timestamp is more than half the max ns time covered by
> +		 * counter. It will be ~35min for 36 bit counter. If this much
> +		 * sampling duration is needed we will have to update tc->nsec
> +		 * by explicitly reading the timecounter (timecounter_read)
> +		 * before this duration.
> +		 */
> +		system_ts = timecounter_cyc2time(&stream->tc, gpu_ts);
> +
> +		if (copy_to_user(buf, &system_ts, I915_PERF_TS_SAMPLE_SIZE))
> +			return -EFAULT;
>   	}
>   
>   	(*offset) += header.size;
> @@ -2137,6 +2156,11 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream,
>   		stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
>   	}
>   
> +	if (props->sample_flags & SAMPLE_SYSTEM_TS) {
> +		stream->sample_flags |= SAMPLE_SYSTEM_TS;
> +		stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
> +	}
> +
>   	dev_priv->perf.oa.oa_buffer.format_size = format_size;
>   	if (WARN_ON(dev_priv->perf.oa.oa_buffer.format_size == 0))
>   		return -EINVAL;
> @@ -2857,6 +2881,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv,
>   		case DRM_I915_PERF_PROP_SAMPLE_GPU_TS:
>   			props->sample_flags |= SAMPLE_GPU_TS;
>   			break;
> +		case DRM_I915_PERF_PROP_SAMPLE_SYSTEM_TS:
> +			props->sample_flags |= SAMPLE_SYSTEM_TS;
> +			break;
>   		case DRM_I915_PERF_PROP_OA_METRICS_SET:
>   			if (value == 0) {
>   				DRM_DEBUG("Unknown OA metric set ID\n");
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 0b9249e..283859c 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1453,6 +1453,12 @@ enum drm_i915_perf_property_id {
>   	DRM_I915_PERF_PROP_SAMPLE_GPU_TS,
>   
>   	/**
> +	 * This property requests inclusion of CLOCK_MONOTONIC system time in
> +	 * the perf sample data.
> +	 */
> +	DRM_I915_PERF_PROP_SAMPLE_SYSTEM_TS,
> +
> +	/**
>   	 * The value specifies which set of OA unit metrics should be
>   	 * be configured, defining the contents of any OA unit reports.
>   	 */
> @@ -1539,6 +1545,7 @@ enum drm_i915_perf_record_type {
>   	 *
>   	 *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
>   	 *     { u64 gpu_timestamp; } && DRM_I915_PERF_PROP_SAMPLE_GPU_TS
> +	 *     { u64 system_timestamp; } && DRM_I915_PERF_PROP_SAMPLE_SYSTEM_TS

I would just add those u64 fields before oa_report.
Since the report sizes are dictated by the hardware, I'm afraid that one 
day someone might come up with a non 64bit aligned format (however 
unlikely).
And since the new properties mean you need to be aware of the potential 
new offsets, it's not breaking existing userspace.

>   	 * };
>   	 */
>   	DRM_I915_PERF_RECORD_SAMPLE = 1,


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-05 14:16 ` [RFC 0/4] " Lionel Landwerlin
@ 2017-12-05 14:28   ` Robert Bragg
  2017-12-05 14:37     ` Lionel Landwerlin
  0 siblings, 1 reply; 45+ messages in thread
From: Robert Bragg @ 2017-12-05 14:28 UTC (permalink / raw)
  To: Lionel Landwerlin; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 3256 bytes --]

On Tue, Dec 5, 2017 at 2:16 PM, Lionel Landwerlin <
lionel.g.landwerlin@intel.com> wrote:

> Hey Sagar,
>
> Sorry for the delay looking into this series.
> I've done some userspace/UI work in GPUTop to try to correlate perf
> samples/tracepoints with i915 perf reports.
>
> I wanted to avoid having to add too much logic into the kernel and tried
> to sample both cpu clocks & gpu timestamps from userspace.
> So far that's not working. People more knowledgable than I would have
> realized that the kernel can sneak in work into syscalls.
> So result is that 2 syscalls (one to get the cpu clock, one for the gpu
> timestamp) back to back from the same thread leads to time differences of
> anywhere from a few microseconds to in some cases close to 1millisecond. So
> it's basically unworkable.
> Anyway the UI work won't go to waste :)
>
> I'm thinking to go with your approach.
> From my experiment with gputop, it seems we might want to use a different
> cpu clock source though or make it configurable.
> The perf infrastructure allows you to choose what clock you want to use.
> Since we want to avoid time adjustments on that clock (because we're adding
> deltas), a clock monotonic raw would make most sense.
>

I would guess the most generally useful clock domain to correlate with the
largest number of interesting events would surely be CLOCK_MONOTONIC, not
_MONOTONIC_RAW.

E.g. here's some discussion around why vblank events use CLOCK_MONOTINIC:
https://lists.freedesktop.org/archives/dri-devel/2012-October/028878.html

Br,
- Robert


> I'll look at adding some tests for this too.
>
> Thanks,
>
> -
> Lionel
>
> On 15/11/17 12:13, Sagar Arun Kamble wrote:
>
>> We can compute system time corresponding to GPU timestamp by taking a
>> reference point (CPU monotonic time, GPU timestamp) and then adding
>> delta time computed using timecounter/cyclecounter support in kernel.
>> We have to configure cyclecounter with the GPU timestamp frequency.
>> Earlier approach that was based on cross-timestamp is not needed. It
>> was being used to approximate the frequency based on invalid assumptions
>> (possibly drift was being seen in the time due to precision issue).
>> The precision of time from GPU clocks is already in ns and timecounter
>> takes care of it as verified over variable durations.
>>
>> This series adds base timecounter/cyclecounter changes and changes to
>> get GPU and CPU timestamps in OA samples.
>>
>> Sagar Arun Kamble (1):
>>    drm/i915/perf: Add support to correlate GPU timestamp with system time
>>
>> Sourab Gupta (3):
>>    drm/i915/perf: Add support for collecting 64 bit timestamps with OA
>>      reports
>>    drm/i915/perf: Extract raw GPU timestamps from OA reports
>>    drm/i915/perf: Send system clock monotonic time in perf samples
>>
>>   drivers/gpu/drm/i915/i915_drv.h  |  11 ++++
>>   drivers/gpu/drm/i915/i915_perf.c | 124 ++++++++++++++++++++++++++++++
>> ++++++++-
>>   drivers/gpu/drm/i915/i915_reg.h  |   6 ++
>>   include/uapi/drm/i915_drm.h      |  14 +++++
>>   4 files changed, 154 insertions(+), 1 deletion(-)
>>
>>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

[-- Attachment #1.2: Type: text/html, Size: 4543 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-05 14:28   ` Robert Bragg
@ 2017-12-05 14:37     ` Lionel Landwerlin
  2017-12-06  9:01       ` Sagar Arun Kamble
  0 siblings, 1 reply; 45+ messages in thread
From: Lionel Landwerlin @ 2017-12-05 14:37 UTC (permalink / raw)
  To: Robert Bragg; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 4004 bytes --]

On 05/12/17 14:28, Robert Bragg wrote:
>
>
> On Tue, Dec 5, 2017 at 2:16 PM, Lionel Landwerlin 
> <lionel.g.landwerlin@intel.com <mailto:lionel.g.landwerlin@intel.com>> 
> wrote:
>
>     Hey Sagar,
>
>     Sorry for the delay looking into this series.
>     I've done some userspace/UI work in GPUTop to try to correlate
>     perf samples/tracepoints with i915 perf reports.
>
>     I wanted to avoid having to add too much logic into the kernel and
>     tried to sample both cpu clocks & gpu timestamps from userspace.
>     So far that's not working. People more knowledgable than I would
>     have realized that the kernel can sneak in work into syscalls.
>     So result is that 2 syscalls (one to get the cpu clock, one for
>     the gpu timestamp) back to back from the same thread leads to time
>     differences of anywhere from a few microseconds to in some cases
>     close to 1millisecond. So it's basically unworkable.
>     Anyway the UI work won't go to waste :)
>
>     I'm thinking to go with your approach.
>     >From my experiment with gputop, it seems we might want to use a
>     different cpu clock source though or make it configurable.
>     The perf infrastructure allows you to choose what clock you want
>     to use. Since we want to avoid time adjustments on that clock
>     (because we're adding deltas), a clock monotonic raw would make
>     most sense.
>
>
> I would guess the most generally useful clock domain to correlate with 
> the largest number of interesting events would surely be 
> CLOCK_MONOTONIC, not _MONOTONIC_RAW.
>
> E.g. here's some discussion around why vblank events use 
> CLOCK_MONOTINIC: 
> https://lists.freedesktop.org/archives/dri-devel/2012-October/028878.html
>
> Br,
> - Robert

Thanks Rob, then I guess making it configurable when opening the stream 
would be the safest option.

>
>
>     I'll look at adding some tests for this too.
>
>     Thanks,
>
>     -
>     Lionel
>
>     On 15/11/17 12:13, Sagar Arun Kamble wrote:
>
>         We can compute system time corresponding to GPU timestamp by
>         taking a
>         reference point (CPU monotonic time, GPU timestamp) and then
>         adding
>         delta time computed using timecounter/cyclecounter support in
>         kernel.
>         We have to configure cyclecounter with the GPU timestamp
>         frequency.
>         Earlier approach that was based on cross-timestamp is not
>         needed. It
>         was being used to approximate the frequency based on invalid
>         assumptions
>         (possibly drift was being seen in the time due to precision
>         issue).
>         The precision of time from GPU clocks is already in ns and
>         timecounter
>         takes care of it as verified over variable durations.
>
>         This series adds base timecounter/cyclecounter changes and
>         changes to
>         get GPU and CPU timestamps in OA samples.
>
>         Sagar Arun Kamble (1):
>            drm/i915/perf: Add support to correlate GPU timestamp with
>         system time
>
>         Sourab Gupta (3):
>            drm/i915/perf: Add support for collecting 64 bit timestamps
>         with OA
>              reports
>            drm/i915/perf: Extract raw GPU timestamps from OA reports
>            drm/i915/perf: Send system clock monotonic time in perf samples
>
>           drivers/gpu/drm/i915/i915_drv.h  |  11 ++++
>           drivers/gpu/drm/i915/i915_perf.c | 124
>         ++++++++++++++++++++++++++++++++++++++-
>           drivers/gpu/drm/i915/i915_reg.h  |   6 ++
>           include/uapi/drm/i915_drm.h      |  14 +++++
>           4 files changed, 154 insertions(+), 1 deletion(-)
>
>
>     _______________________________________________
>     Intel-gfx mailing list
>     Intel-gfx@lists.freedesktop.org
>     <mailto:Intel-gfx@lists.freedesktop.org>
>     https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>     <https://lists.freedesktop.org/mailman/listinfo/intel-gfx>
>
>


[-- Attachment #1.2: Type: text/html, Size: 7437 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time
  2017-12-05 13:58     ` [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time Lionel Landwerlin
@ 2017-12-06  8:17       ` Sagar Arun Kamble
  0 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-12-06  8:17 UTC (permalink / raw)
  To: Lionel Landwerlin, Chris Wilson, intel-gfx; +Cc: Sourab Gupta, Matthew Auld



On 12/5/2017 7:28 PM, Lionel Landwerlin wrote:
> On 15/11/17 12:25, Chris Wilson wrote:
>> Quoting Sagar Arun Kamble (2017-11-15 12:13:51)
>>>   #include <drm/drmP.h>
>>>   #include <drm/intel-gtt.h>
>>> @@ -2149,6 +2150,14 @@ struct i915_perf_stream {
>>>           * @oa_config: The OA configuration used by the stream.
>>>           */
>>>          struct i915_oa_config *oa_config;
>>> +
>>> +       /**
>>> +        * System time correlation variables.
>>> +        */
>>> +       struct cyclecounter cc;
>>> +       spinlock_t systime_lock;
>>> +       struct timespec64 start_systime;
>>> +       struct timecounter tc;
>> This pattern is repeated a lot by struct timecounter users. (I'm still
>> trying to understand why the common case is not catered for by a
>> convenience timecounter api.)
>>
>>>   };
>>>     /**
>>> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
>>> b/drivers/gpu/drm/i915/i915_perf.c
>>> index 00be015..72ddc34 100644
>>> --- a/drivers/gpu/drm/i915/i915_perf.c
>>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>>> @@ -192,6 +192,7 @@
>>>    */
>>>     #include <linux/anon_inodes.h>
>>> +#include <linux/clocksource.h>
>>>   #include <linux/sizes.h>
>>>   #include <linux/uuid.h>
>>>   @@ -2391,6 +2392,56 @@ static unsigned int i915_perf_poll(struct 
>>> file *file, poll_table *wait)
>>>   }
>>>     /**
>>> + * i915_cyclecounter_read - read raw cycle/timestamp counter
>>> + * @cc: cyclecounter structure
>>> + */
>>> +static u64 i915_cyclecounter_read(const struct cyclecounter *cc)
>>> +{
>>> +       struct i915_perf_stream *stream = container_of(cc, 
>>> typeof(*stream), cc);
>>> +       struct drm_i915_private *dev_priv = stream->dev_priv;
>>> +       u64 ts_count;
>>> +
>>> +       intel_runtime_pm_get(dev_priv);
>>> +       ts_count = I915_READ64_2x32(GEN4_TIMESTAMP,
>>> +                                   GEN7_TIMESTAMP_UDW);
>>> +       intel_runtime_pm_put(dev_priv);
>>> +
>>> +       return ts_count;
>>> +}
>>> +
>>> +static void i915_perf_init_cyclecounter(struct i915_perf_stream 
>>> *stream)
>>> +{
>>> +       struct drm_i915_private *dev_priv = stream->dev_priv;
>>> +       int cs_ts_freq = dev_priv->perf.oa.timestamp_frequency;
>>> +       struct cyclecounter *cc = &stream->cc;
>>> +       u32 maxsec;
>>> +
>>> +       cc->read = i915_cyclecounter_read;
>>> +       cc->mask = CYCLECOUNTER_MASK(CS_TIMESTAMP_WIDTH(dev_priv));
>>> +       maxsec = cc->mask / cs_ts_freq;
>>> +
>>> +       clocks_calc_mult_shift(&cc->mult, &cc->shift, cs_ts_freq,
>>> +                              NSEC_PER_SEC, maxsec);
>>> +}
>>> +
>>> +static void i915_perf_init_timecounter(struct i915_perf_stream 
>>> *stream)
>>> +{
>>> +#define SYSTIME_START_OFFSET   350000 /* Counter read takes about 
>>> 350us */
>>> +       unsigned long flags;
>>> +       u64 ns;
>>> +
>>> +       i915_perf_init_cyclecounter(stream);
>>> +       spin_lock_init(&stream->systime_lock);
>>> +
>>> +       getnstimeofday64(&stream->start_systime);
>>> +       ns = timespec64_to_ns(&stream->start_systime) + 
>>> SYSTIME_START_OFFSET;
>> Use ktime directly. Or else Arnd will be back with a patch to fix it.
>> (All non-ktime interfaces are effectively deprecated; obsolete for
>> drivers.)
>>
>>> + spin_lock_irqsave(&stream->systime_lock, flags);
>>> +       timecounter_init(&stream->tc, &stream->cc, ns);
>>> +       spin_unlock_irqrestore(&stream->systime_lock, flags);
>>> +}
>>> +
>>> +/**
>>>    * i915_perf_enable_locked - handle `I915_PERF_IOCTL_ENABLE` ioctl
>>>    * @stream: A disabled i915 perf stream
>>>    *
>>> @@ -2408,6 +2459,8 @@ static void i915_perf_enable_locked(struct 
>>> i915_perf_stream *stream)
>>>          /* Allow stream->ops->enable() to refer to this */
>>>          stream->enabled = true;
>>>   +       i915_perf_init_timecounter(stream);
>>> +
>>>          if (stream->ops->enable)
>>>                  stream->ops->enable(stream);
>>>   }
>>> diff --git a/drivers/gpu/drm/i915/i915_reg.h 
>>> b/drivers/gpu/drm/i915/i915_reg.h
>>> index cfdf4f8..e7e6966 100644
>>> --- a/drivers/gpu/drm/i915/i915_reg.h
>>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>>> @@ -8882,6 +8882,12 @@ enum skl_power_gate {
>>>     /* Gen4+ Timestamp and Pipe Frame time stamp registers */
>>>   #define GEN4_TIMESTAMP         _MMIO(0x2358)
>>> +#define GEN7_TIMESTAMP_UDW     _MMIO(0x235C)
>>> +#define PRE_GEN7_TIMESTAMP_WIDTH       32
>>> +#define GEN7_TIMESTAMP_WIDTH           36
>>> +#define CS_TIMESTAMP_WIDTH(dev_priv) \
>>> +       (INTEL_GEN(dev_priv) < 7 ? PRE_GEN7_TIMESTAMP_WIDTH : \
>>> +                                  GEN7_TIMESTAMP_WIDTH)
>> s/PRE_GEN7/GEN4/ would be consistent.
>> If you really want to add support for earlier, I9XX_.
>
> Why not use the RING_TIMESTAMP() RING_TIMESTAMP_UDW() ?
> There's already used for the reg_read ioctl.

Yes. We can use these.

>
>>
>> Ok. I can accept the justification, and we are not the only ones who do
>> the cyclecounter -> timecounter correction like this.
>> -Chris
>>
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 4/4] drm/i915/perf: Send system clock monotonic time in perf samples
  2017-12-05 14:22   ` Lionel Landwerlin
@ 2017-12-06  8:31     ` Sagar Arun Kamble
  0 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-12-06  8:31 UTC (permalink / raw)
  To: Lionel Landwerlin, intel-gfx; +Cc: Sourab Gupta, Matthew Auld



On 12/5/2017 7:52 PM, Lionel Landwerlin wrote:
> On 15/11/17 12:13, Sagar Arun Kamble wrote:
>> From: Sourab Gupta <sourab.gupta@intel.com>
>>
>> Currently, we have the ability to only forward the GPU timestamps in the
>> samples (which are generated via OA reports). This limits the ability to
>> correlate these samples with the system events.
>>
>> An ability is therefore needed to report timestamps in different clock
>> domains, such as CLOCK_MONOTONIC, in the perf samples to be of more
>> practical use to the userspace. This ability becomes important
>> when we want to correlate/plot GPU events/samples with other system 
>> events
>> on the same timeline (e.g. vblank events, or timestamps when work was
>> submitted to kernel, etc.)
>>
>> The patch here proposes a mechanism to achieve this. The correlation
>> between gpu time and system time is established using the timestamp 
>> clock
>> associated with the command stream, abstracted as 
>> timecounter/cyclecounter
>> to retrieve gpu/system time correlated values.
>>
>> v2: Added i915_driver_init_late() function to capture the new late init
>> phase for perf (Chris)
>>
>> v3: Removed cross-timestamp changes.
>>
>> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
>> Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
>> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Sourab Gupta <sourab.gupta@intel.com>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_perf.c | 27 +++++++++++++++++++++++++++
>>   include/uapi/drm/i915_drm.h      |  7 +++++++
>>   2 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
>> b/drivers/gpu/drm/i915/i915_perf.c
>> index 3b721d7..94ee924 100644
>> --- a/drivers/gpu/drm/i915/i915_perf.c
>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>> @@ -336,6 +336,7 @@
>>     #define SAMPLE_OA_REPORT    BIT(0)
>>   #define SAMPLE_GPU_TS        BIT(1)
>> +#define SAMPLE_SYSTEM_TS    BIT(2)
>>     /**
>>    * struct perf_open_properties - for validated properties given to 
>> open a stream
>> @@ -622,6 +623,7 @@ static int append_oa_sample(struct 
>> i915_perf_stream *stream,
>>       struct drm_i915_perf_record_header header;
>>       u32 sample_flags = stream->sample_flags;
>>       u64 gpu_ts = 0;
>> +    u64 system_ts = 0;
>>         header.type = DRM_I915_PERF_RECORD_SAMPLE;
>>       header.pad = 0;
>> @@ -647,6 +649,23 @@ static int append_oa_sample(struct 
>> i915_perf_stream *stream,
>>             if (copy_to_user(buf, &gpu_ts, I915_PERF_TS_SAMPLE_SIZE))
>>               return -EFAULT;
>> +        buf += I915_PERF_TS_SAMPLE_SIZE;
>
> This is a ridiculous nit, but I think using sizeof(u64) would be more 
> readable than this I915_PERF_TS_SAMPLE_SIZE define.

Will update. Thanks.

>
>> +    }
>> +
>> +    if (sample_flags & SAMPLE_SYSTEM_TS) {
>> +        gpu_ts = get_gpu_ts_from_oa_report(stream, report);
>> +        /*
>> +         * XXX: timecounter_cyc2time considers time backwards if delta
>> +         * timestamp is more than half the max ns time covered by
>> +         * counter. It will be ~35min for 36 bit counter. If this much
>> +         * sampling duration is needed we will have to update tc->nsec
>> +         * by explicitly reading the timecounter (timecounter_read)
>> +         * before this duration.
>> +         */
>> +        system_ts = timecounter_cyc2time(&stream->tc, gpu_ts);
>> +
>> +        if (copy_to_user(buf, &system_ts, I915_PERF_TS_SAMPLE_SIZE))
>> +            return -EFAULT;
>>       }
>>         (*offset) += header.size;
>> @@ -2137,6 +2156,11 @@ static int i915_oa_stream_init(struct 
>> i915_perf_stream *stream,
>>           stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
>>       }
>>   +    if (props->sample_flags & SAMPLE_SYSTEM_TS) {
>> +        stream->sample_flags |= SAMPLE_SYSTEM_TS;
>> +        stream->sample_size += I915_PERF_TS_SAMPLE_SIZE;
>> +    }
>> +
>>       dev_priv->perf.oa.oa_buffer.format_size = format_size;
>>       if (WARN_ON(dev_priv->perf.oa.oa_buffer.format_size == 0))
>>           return -EINVAL;
>> @@ -2857,6 +2881,9 @@ static int read_properties_unlocked(struct 
>> drm_i915_private *dev_priv,
>>           case DRM_I915_PERF_PROP_SAMPLE_GPU_TS:
>>               props->sample_flags |= SAMPLE_GPU_TS;
>>               break;
>> +        case DRM_I915_PERF_PROP_SAMPLE_SYSTEM_TS:
>> +            props->sample_flags |= SAMPLE_SYSTEM_TS;
>> +            break;
>>           case DRM_I915_PERF_PROP_OA_METRICS_SET:
>>               if (value == 0) {
>>                   DRM_DEBUG("Unknown OA metric set ID\n");
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 0b9249e..283859c 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1453,6 +1453,12 @@ enum drm_i915_perf_property_id {
>>       DRM_I915_PERF_PROP_SAMPLE_GPU_TS,
>>         /**
>> +     * This property requests inclusion of CLOCK_MONOTONIC system 
>> time in
>> +     * the perf sample data.
>> +     */
>> +    DRM_I915_PERF_PROP_SAMPLE_SYSTEM_TS,
>> +
>> +    /**
>>        * The value specifies which set of OA unit metrics should be
>>        * be configured, defining the contents of any OA unit reports.
>>        */
>> @@ -1539,6 +1545,7 @@ enum drm_i915_perf_record_type {
>>        *
>>        *     { u32 oa_report[]; } && DRM_I915_PERF_PROP_SAMPLE_OA
>>        *     { u64 gpu_timestamp; } && DRM_I915_PERF_PROP_SAMPLE_GPU_TS
>> +     *     { u64 system_timestamp; } && 
>> DRM_I915_PERF_PROP_SAMPLE_SYSTEM_TS
>
> I would just add those u64 fields before oa_report.
> Since the report sizes are dictated by the hardware, I'm afraid that 
> one day someone might come up with a non 64bit aligned format (however 
> unlikely).
> And since the new properties mean you need to be aware of the 
> potential new offsets, it's not breaking existing userspace.

Sure. Will update.

>
>>        * };
>>        */
>>       DRM_I915_PERF_RECORD_SAMPLE = 1,
>
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-05 14:37     ` Lionel Landwerlin
@ 2017-12-06  9:01       ` Sagar Arun Kamble
  0 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-12-06  9:01 UTC (permalink / raw)
  To: Lionel Landwerlin, Robert Bragg; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 4441 bytes --]



On 12/5/2017 8:07 PM, Lionel Landwerlin wrote:
> On 05/12/17 14:28, Robert Bragg wrote:
>>
>>
>> On Tue, Dec 5, 2017 at 2:16 PM, Lionel Landwerlin 
>> <lionel.g.landwerlin@intel.com 
>> <mailto:lionel.g.landwerlin@intel.com>> wrote:
>>
>>     Hey Sagar,
>>
>>     Sorry for the delay looking into this series.
>>     I've done some userspace/UI work in GPUTop to try to correlate
>>     perf samples/tracepoints with i915 perf reports.
>>
>>     I wanted to avoid having to add too much logic into the kernel
>>     and tried to sample both cpu clocks & gpu timestamps from userspace.
>>     So far that's not working. People more knowledgable than I would
>>     have realized that the kernel can sneak in work into syscalls.
>>     So result is that 2 syscalls (one to get the cpu clock, one for
>>     the gpu timestamp) back to back from the same thread leads to
>>     time differences of anywhere from a few microseconds to in some
>>     cases close to 1millisecond. So it's basically unworkable.
>>     Anyway the UI work won't go to waste :)
>>
>>     I'm thinking to go with your approach.
>>     >From my experiment with gputop, it seems we might want to use a
>>     different cpu clock source though or make it configurable.
>>     The perf infrastructure allows you to choose what clock you want
>>     to use. Since we want to avoid time adjustments on that clock
>>     (because we're adding deltas), a clock monotonic raw would make
>>     most sense.
>>
>>
>> I would guess the most generally useful clock domain to correlate 
>> with the largest number of interesting events would surely be 
>> CLOCK_MONOTONIC, not _MONOTONIC_RAW.
>>
>> E.g. here's some discussion around why vblank events use 
>> CLOCK_MONOTINIC: 
>> https://lists.freedesktop.org/archives/dri-devel/2012-October/028878.html
>>
>> Br,
>> - Robert
>
> Thanks Rob, then I guess making it configurable when opening the 
> stream would be the safest option.
>
All timecounter users today rely on monotonic clock (Can request for clock real/wall time, clock boot time, clock tai time
corresponding to monotonic clock).
Agree that we will need to have configuration option about clock to be used like in trace_clock in ftrace.

>>
>>
>>     I'll look at adding some tests for this too.
>>
>>     Thanks,
>>
>>     -
>>     Lionel
>>
>>     On 15/11/17 12:13, Sagar Arun Kamble wrote:
>>
>>         We can compute system time corresponding to GPU timestamp by
>>         taking a
>>         reference point (CPU monotonic time, GPU timestamp) and then
>>         adding
>>         delta time computed using timecounter/cyclecounter support in
>>         kernel.
>>         We have to configure cyclecounter with the GPU timestamp
>>         frequency.
>>         Earlier approach that was based on cross-timestamp is not
>>         needed. It
>>         was being used to approximate the frequency based on invalid
>>         assumptions
>>         (possibly drift was being seen in the time due to precision
>>         issue).
>>         The precision of time from GPU clocks is already in ns and
>>         timecounter
>>         takes care of it as verified over variable durations.
>>
>>         This series adds base timecounter/cyclecounter changes and
>>         changes to
>>         get GPU and CPU timestamps in OA samples.
>>
>>         Sagar Arun Kamble (1):
>>            drm/i915/perf: Add support to correlate GPU timestamp with
>>         system time
>>
>>         Sourab Gupta (3):
>>            drm/i915/perf: Add support for collecting 64 bit
>>         timestamps with OA
>>              reports
>>            drm/i915/perf: Extract raw GPU timestamps from OA reports
>>            drm/i915/perf: Send system clock monotonic time in perf
>>         samples
>>
>>           drivers/gpu/drm/i915/i915_drv.h  |  11 ++++
>>           drivers/gpu/drm/i915/i915_perf.c | 124
>>         ++++++++++++++++++++++++++++++++++++++-
>>           drivers/gpu/drm/i915/i915_reg.h  |   6 ++
>>           include/uapi/drm/i915_drm.h      |  14 +++++
>>           4 files changed, 154 insertions(+), 1 deletion(-)
>>
>>
>>     _______________________________________________
>>     Intel-gfx mailing list
>>     Intel-gfx@lists.freedesktop.org
>>     <mailto:Intel-gfx@lists.freedesktop.org>
>>     https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>     <https://lists.freedesktop.org/mailman/listinfo/intel-gfx>
>>
>>
>


[-- Attachment #1.2: Type: text/html, Size: 8404 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 2/4] drm/i915/perf: Add support for collecting 64 bit timestamps with OA reports
  2017-11-15 12:13 ` [RFC 2/4] drm/i915/perf: Add support for collecting 64 bit timestamps with OA reports Sagar Arun Kamble
@ 2017-12-06 16:01   ` Lionel Landwerlin
  2017-12-21  8:38     ` Sagar Arun Kamble
  0 siblings, 1 reply; 45+ messages in thread
From: Lionel Landwerlin @ 2017-12-06 16:01 UTC (permalink / raw)
  To: Sagar Arun Kamble, intel-gfx; +Cc: Sourab Gupta, Matthew Auld

On 15/11/17 12:13, Sagar Arun Kamble wrote:
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1447,6 +1447,12 @@ enum drm_i915_perf_property_id {
>   	DRM_I915_PERF_PROP_SAMPLE_OA,
>   
>   	/**
> +	 * The value of this property set to 1 requests inclusion of GPU
> +	 * timestamp in the perf sample data.
> +	 */
> +	DRM_I915_PERF_PROP_SAMPLE_GPU_TS,
> +
> +	/**
>   	 * The value specifies which set of OA unit metrics should be
>   	 * be configured, defining the contents of any OA unit reports.
>   	 */


Inserting this, not at the end of this enum break API/ABI.
This applies to other patches too.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 3/4] drm/i915/perf: Extract raw GPU timestamps from OA reports
  2017-11-15 12:13 ` [RFC 3/4] drm/i915/perf: Extract raw GPU timestamps from " Sagar Arun Kamble
@ 2017-12-06 19:55   ` Lionel Landwerlin
  2017-12-21  8:50     ` Sagar Arun Kamble
  0 siblings, 1 reply; 45+ messages in thread
From: Lionel Landwerlin @ 2017-12-06 19:55 UTC (permalink / raw)
  To: Sagar Arun Kamble, intel-gfx; +Cc: Sourab Gupta, Matthew Auld

On 15/11/17 12:13, Sagar Arun Kamble wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
>
> The OA reports contain the least significant 32 bits of the gpu timestamp.
> This patch enables retrieval of the timestamp field from OA reports, to
> forward as 64 bit raw gpu timestamps in the perf samples.
>
> v2: Rebase w.r.t new timecounter support.
>
> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
> Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sourab Gupta <sourab.gupta@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h  |  2 ++
>   drivers/gpu/drm/i915/i915_perf.c | 26 +++++++++++++++++++++++++-
>   2 files changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index e08bc85..5534cd2 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2151,6 +2151,8 @@ struct i915_perf_stream {
>   	 */
>   	struct i915_oa_config *oa_config;
>   
> +	u64 last_gpu_ts;
> +
>   	/**
>   	 * System time correlation variables.
>   	 */
> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
> index f7e748c..3b721d7 100644
> --- a/drivers/gpu/drm/i915/i915_perf.c
> +++ b/drivers/gpu/drm/i915/i915_perf.c
> @@ -575,6 +575,26 @@ static int append_oa_status(struct i915_perf_stream *stream,
>   }
>   
>   /**
> + * get_gpu_ts_from_oa_report - Retrieve absolute gpu timestamp from OA report
> + *
> + * Note: We are assuming that we're updating last_gpu_ts frequently enough so
> + * that it's never possible to see multiple overflows before we compare
> + * sample_ts to last_gpu_ts. Since this is significantly large duration
> + * (~6min for 80ns ts base), we can safely assume so.
> + */
> +static u64 get_gpu_ts_from_oa_report(struct i915_perf_stream *stream,
> +				     const u8 *report)
> +{
> +	u32 sample_ts = *(u32 *)(report + 4);
> +	u32 delta;
> +
> +	delta = sample_ts - (u32)stream->last_gpu_ts;
> +	stream->last_gpu_ts += delta;
> +
> +	return stream->last_gpu_ts;
> +}
> +
> +/**
>    * append_oa_sample - Copies single OA report into userspace read() buffer.
>    * @stream: An i915-perf stream opened for OA metrics
>    * @buf: destination buffer given by userspace
> @@ -622,7 +642,9 @@ static int append_oa_sample(struct i915_perf_stream *stream,
>   	}
>   
>   	if (sample_flags & SAMPLE_GPU_TS) {
> -		/* Timestamp to be populated from OA report */
> +		/* Timestamp populated from OA report */
> +		gpu_ts = get_gpu_ts_from_oa_report(stream, report);
> +
>   		if (copy_to_user(buf, &gpu_ts, I915_PERF_TS_SAMPLE_SIZE))
>   			return -EFAULT;
>   	}

I think everything above this line should be merged int patch 2.
It's better to have a single functional patch.

> @@ -2421,6 +2443,8 @@ static u64 i915_cyclecounter_read(const struct cyclecounter *cc)
>   				    GEN7_TIMESTAMP_UDW);
>   	intel_runtime_pm_put(dev_priv);
>   
> +	stream->last_gpu_ts = ts_count;

This doesn't look right. You're already adding a delta in 
get_gpu_ts_from_oa_report().
This will produce incorrect timestamps. Since at the moment we won't 
allow opening without PROP_SAMPLE_OA, I would just drop this line.

> +
>   	return ts_count;
>   }
>   


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-11-15 12:13 [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events Sagar Arun Kamble
                   ` (5 preceding siblings ...)
  2017-12-05 14:16 ` [RFC 0/4] " Lionel Landwerlin
@ 2017-12-06 20:02 ` Lionel Landwerlin
  2017-12-22  5:15   ` Sagar Arun Kamble
  2017-12-07  0:48 ` Robert Bragg
  7 siblings, 1 reply; 45+ messages in thread
From: Lionel Landwerlin @ 2017-12-06 20:02 UTC (permalink / raw)
  To: Sagar Arun Kamble, intel-gfx

I've put together some trival IGT tests : 
https://github.com/djdeath/intel-gpu-tools/commits/wip/djdeath/cpu-timestamps
With a few changes which I pointed in the review : 
https://github.com/djdeath/linux/commit/d0e4cf4d3f464491b4ffe97d112284d1ce73656d

Put together it seems to work relatively well.
There is still a small drift happening between the 2 timestamps. I've 
noticed over a 160ms of OA reports, there is a accumulated difference of 
~35us between the GPU timestamp and cpu timestamps.
I may be doing something wrong with the scaling in the tests, or maybe 
there is an issue in the kernel, or both.

I'll build the GPUTop parts and see if the results make sense.

Thanks!,

-
Lionel

On 15/11/17 12:13, Sagar Arun Kamble wrote:
> We can compute system time corresponding to GPU timestamp by taking a
> reference point (CPU monotonic time, GPU timestamp) and then adding
> delta time computed using timecounter/cyclecounter support in kernel.
> We have to configure cyclecounter with the GPU timestamp frequency.
> Earlier approach that was based on cross-timestamp is not needed. It
> was being used to approximate the frequency based on invalid assumptions
> (possibly drift was being seen in the time due to precision issue).
> The precision of time from GPU clocks is already in ns and timecounter
> takes care of it as verified over variable durations.
>
> This series adds base timecounter/cyclecounter changes and changes to
> get GPU and CPU timestamps in OA samples.
>
> Sagar Arun Kamble (1):
>    drm/i915/perf: Add support to correlate GPU timestamp with system time
>
> Sourab Gupta (3):
>    drm/i915/perf: Add support for collecting 64 bit timestamps with OA
>      reports
>    drm/i915/perf: Extract raw GPU timestamps from OA reports
>    drm/i915/perf: Send system clock monotonic time in perf samples
>
>   drivers/gpu/drm/i915/i915_drv.h  |  11 ++++
>   drivers/gpu/drm/i915/i915_perf.c | 124 ++++++++++++++++++++++++++++++++++++++-
>   drivers/gpu/drm/i915/i915_reg.h  |   6 ++
>   include/uapi/drm/i915_drm.h      |  14 +++++
>   4 files changed, 154 insertions(+), 1 deletion(-)
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-11-15 12:13 [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events Sagar Arun Kamble
                   ` (6 preceding siblings ...)
  2017-12-06 20:02 ` Lionel Landwerlin
@ 2017-12-07  0:48 ` Robert Bragg
  2017-12-07  0:57   ` Robert Bragg
  2017-12-22  6:06   ` Sagar Arun Kamble
  7 siblings, 2 replies; 45+ messages in thread
From: Robert Bragg @ 2017-12-07  0:48 UTC (permalink / raw)
  To: Sagar Arun Kamble; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 4234 bytes --]

On Wed, Nov 15, 2017 at 12:13 PM, Sagar Arun Kamble <
sagar.a.kamble@intel.com> wrote:

> We can compute system time corresponding to GPU timestamp by taking a
> reference point (CPU monotonic time, GPU timestamp) and then adding
> delta time computed using timecounter/cyclecounter support in kernel.
> We have to configure cyclecounter with the GPU timestamp frequency.
> Earlier approach that was based on cross-timestamp is not needed. It
> was being used to approximate the frequency based on invalid assumptions
> (possibly drift was being seen in the time due to precision issue).
> The precision of time from GPU clocks is already in ns and timecounter
> takes care of it as verified over variable durations.
>

Hi Sagar,

I have some doubts about this analysis...

The intent behind Sourab's original approach was to be able to determine
the frequency at runtime empirically because the constants we have aren't
particularly accurate. Without a perfectly stable frequency that's known
very precisely then an interpolated correlation will inevitably drift. I
think the nature of HW implies we can't expect to have either of those.
Then the general idea had been to try and use existing kernel
infrastructure for a problem which isn't unique to GPU clocks.

That's not to say that a more limited, simpler solution based on frequent
re-correlation wouldn't be more than welcome if tracking an accurate
frequency is too awkward for now, but I think some things need to be
considered in that case:

- It would be good to quantify the kind of drift seen in practice to know
how frequently it's necessary to re-synchronize. It sounds like you've done
this ("as verified over variable durations") so I'm curious what kind of
drift you saw. I'd imagine you would see a significant drift over, say, one
second and it might not take much longer for the drift to even become
clearly visible to the user when plotted in a UI. For reference I once
updated the arb_timer_query test in piglit to give some insight into this
drift (
https://lists.freedesktop.org/archives/piglit/2016-September/020673.html)
and at least from what I wrote back then it looks like I was seeing a drift
of a few milliseconds per second on SKL. I vaguely recall it being much
worse given the frequency constants we had for Haswell.

- What guarantees will be promised about monotonicity of correlated system
timestamps? Will it be guaranteed that sequential reports must have
monotonically increasing timestamps? That might be fiddly if the gpu +
system clock are periodically re-correlated, so it might be good to be
clear in documentation that the correlation is best-effort only for the
sake of implementation simplicity. That would still be good for a lot of
UIs I think and there's freedom for the driver to start simple and
potentially improve later by measuring the gpu clock frequency empirically.

Currently only one correlated pair of timestamps is read when enabling the
stream and so a relatively long time is likely to pass before the stream is
disabled (seconds, minutes while a user is running a system profiler) . It
seems very likely to me that these clocks are going to drift significantly
without introducing some form of periodic re-synchronization based on some
understanding of the drift that's seen.

Br,
- Robert



> This series adds base timecounter/cyclecounter changes and changes to
> get GPU and CPU timestamps in OA samples.
>
> Sagar Arun Kamble (1):
>   drm/i915/perf: Add support to correlate GPU timestamp with system time
>
> Sourab Gupta (3):
>   drm/i915/perf: Add support for collecting 64 bit timestamps with OA
>     reports
>   drm/i915/perf: Extract raw GPU timestamps from OA reports
>   drm/i915/perf: Send system clock monotonic time in perf samples
>
>  drivers/gpu/drm/i915/i915_drv.h  |  11 ++++
>  drivers/gpu/drm/i915/i915_perf.c | 124 ++++++++++++++++++++++++++++++
> ++++++++-
>  drivers/gpu/drm/i915/i915_reg.h  |   6 ++
>  include/uapi/drm/i915_drm.h      |  14 +++++
>  4 files changed, 154 insertions(+), 1 deletion(-)
>
> --
> 1.9.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

[-- Attachment #1.2: Type: text/html, Size: 5468 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-07  0:48 ` Robert Bragg
@ 2017-12-07  0:57   ` Robert Bragg
  2017-12-21 12:59     ` Lionel Landwerlin
  2017-12-22  6:06   ` Sagar Arun Kamble
  1 sibling, 1 reply; 45+ messages in thread
From: Robert Bragg @ 2017-12-07  0:57 UTC (permalink / raw)
  To: Sagar Arun Kamble; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 472 bytes --]

On Thu, Dec 7, 2017 at 12:48 AM, Robert Bragg <robert@sixbynine.org> wrote:

>
> at least from what I wrote back then it looks like I was seeing a drift of
> a few milliseconds per second on SKL. I vaguely recall it being much worse
> given the frequency constants we had for Haswell.
>

Sorry I didn't actually re-read my own message properly before referencing
it :) Apparently the 2ms per second drift was for Haswell, so presumably
not quite so bad for SKL.

- Robert

[-- Attachment #1.2: Type: text/html, Size: 1062 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 2/4] drm/i915/perf: Add support for collecting 64 bit timestamps with OA reports
  2017-12-06 16:01   ` Lionel Landwerlin
@ 2017-12-21  8:38     ` Sagar Arun Kamble
  0 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-12-21  8:38 UTC (permalink / raw)
  To: Lionel Landwerlin, intel-gfx; +Cc: Sourab Gupta, Matthew Auld



On 12/6/2017 9:31 PM, Lionel Landwerlin wrote:
> On 15/11/17 12:13, Sagar Arun Kamble wrote:
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -1447,6 +1447,12 @@ enum drm_i915_perf_property_id {
>>       DRM_I915_PERF_PROP_SAMPLE_OA,
>>         /**
>> +     * The value of this property set to 1 requests inclusion of GPU
>> +     * timestamp in the perf sample data.
>> +     */
>> +    DRM_I915_PERF_PROP_SAMPLE_GPU_TS,
>> +
>> +    /**
>>        * The value specifies which set of OA unit metrics should be
>>        * be configured, defining the contents of any OA unit reports.
>>        */
>
>
> Inserting this, not at the end of this enum break API/ABI.
> This applies to other patches too.
Thanks. Will update.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 3/4] drm/i915/perf: Extract raw GPU timestamps from OA reports
  2017-12-06 19:55   ` Lionel Landwerlin
@ 2017-12-21  8:50     ` Sagar Arun Kamble
  0 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-12-21  8:50 UTC (permalink / raw)
  To: Lionel Landwerlin, intel-gfx; +Cc: Sourab Gupta, Matthew Auld



On 12/7/2017 1:25 AM, Lionel Landwerlin wrote:
> On 15/11/17 12:13, Sagar Arun Kamble wrote:
>> From: Sourab Gupta <sourab.gupta@intel.com>
>>
>> The OA reports contain the least significant 32 bits of the gpu 
>> timestamp.
>> This patch enables retrieval of the timestamp field from OA reports, to
>> forward as 64 bit raw gpu timestamps in the perf samples.
>>
>> v2: Rebase w.r.t new timecounter support.
>>
>> Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
>> Signed-off-by: Sagar Arun Kamble <sagar.a.kamble@intel.com>
>> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Sourab Gupta <sourab.gupta@intel.com>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.h  |  2 ++
>>   drivers/gpu/drm/i915/i915_perf.c | 26 +++++++++++++++++++++++++-
>>   2 files changed, 27 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index e08bc85..5534cd2 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -2151,6 +2151,8 @@ struct i915_perf_stream {
>>        */
>>       struct i915_oa_config *oa_config;
>>   +    u64 last_gpu_ts;
>> +
>>       /**
>>        * System time correlation variables.
>>        */
>> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
>> b/drivers/gpu/drm/i915/i915_perf.c
>> index f7e748c..3b721d7 100644
>> --- a/drivers/gpu/drm/i915/i915_perf.c
>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>> @@ -575,6 +575,26 @@ static int append_oa_status(struct 
>> i915_perf_stream *stream,
>>   }
>>     /**
>> + * get_gpu_ts_from_oa_report - Retrieve absolute gpu timestamp from 
>> OA report
>> + *
>> + * Note: We are assuming that we're updating last_gpu_ts frequently 
>> enough so
>> + * that it's never possible to see multiple overflows before we compare
>> + * sample_ts to last_gpu_ts. Since this is significantly large duration
>> + * (~6min for 80ns ts base), we can safely assume so.
>> + */
>> +static u64 get_gpu_ts_from_oa_report(struct i915_perf_stream *stream,
>> +                     const u8 *report)
>> +{
>> +    u32 sample_ts = *(u32 *)(report + 4);
>> +    u32 delta;
>> +
>> +    delta = sample_ts - (u32)stream->last_gpu_ts;
>> +    stream->last_gpu_ts += delta;
>> +
>> +    return stream->last_gpu_ts;
>> +}
>> +
>> +/**
>>    * append_oa_sample - Copies single OA report into userspace read() 
>> buffer.
>>    * @stream: An i915-perf stream opened for OA metrics
>>    * @buf: destination buffer given by userspace
>> @@ -622,7 +642,9 @@ static int append_oa_sample(struct 
>> i915_perf_stream *stream,
>>       }
>>         if (sample_flags & SAMPLE_GPU_TS) {
>> -        /* Timestamp to be populated from OA report */
>> +        /* Timestamp populated from OA report */
>> +        gpu_ts = get_gpu_ts_from_oa_report(stream, report);
>> +
>>           if (copy_to_user(buf, &gpu_ts, I915_PERF_TS_SAMPLE_SIZE))
>>               return -EFAULT;
>>       }
>
> I think everything above this line should be merged int patch 2.
> It's better to have a single functional patch.
Yes. Will merge.
>
>> @@ -2421,6 +2443,8 @@ static u64 i915_cyclecounter_read(const struct 
>> cyclecounter *cc)
>>                       GEN7_TIMESTAMP_UDW);
>>       intel_runtime_pm_put(dev_priv);
>>   +    stream->last_gpu_ts = ts_count;
>
> This doesn't look right. You're already adding a delta in 
> get_gpu_ts_from_oa_report().
> This will produce incorrect timestamps. Since at the moment we won't 
> allow opening without PROP_SAMPLE_OA, I would just drop this line.
>
Yes. makes sense. will remove.
>> +
>>       return ts_count;
>>   }
>
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-07  0:57   ` Robert Bragg
@ 2017-12-21 12:59     ` Lionel Landwerlin
  2017-12-22  9:30       ` Sagar Arun Kamble
  0 siblings, 1 reply; 45+ messages in thread
From: Lionel Landwerlin @ 2017-12-21 12:59 UTC (permalink / raw)
  To: Robert Bragg, Sagar Arun Kamble; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 1792 bytes --]

Some more findings I made while playing with this series & GPUTop.
Turns out the 2ms drift per second is due to timecounter. Adding the 
delta this way :

https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R607

Eliminates the drift. Timelines of perf i915 tracepoints & OA reports 
now make a lot more sense.

There is still the issue that reading the CPU clock & the RCS timestamp 
is inherently not atomic. So there is a delta there.
I think we should add a new i915 perf record type to express the delta 
that we measure this way :

https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R2475

So that userspace knows there might be a global offset between the 2 
times and is able to present it.
Measurement on my KBL system were in the order of a few microseconds 
(~30us).
I guess we might be able to setup the correlation point better (masking 
interruption?) to reduce the delta.

Thanks,

-
Lionel


On 07/12/17 00:57, Robert Bragg wrote:
>
>
> On Thu, Dec 7, 2017 at 12:48 AM, Robert Bragg <robert@sixbynine.org 
> <mailto:robert@sixbynine.org>> wrote:
>
>
>     at least from what I wrote back then it looks like I was seeing a
>     drift of a few milliseconds per second on SKL. I vaguely recall it
>     being much worse given the frequency constants we had for Haswell.
>
>
> Sorry I didn't actually re-read my own message properly before 
> referencing it :) Apparently the 2ms per second drift was for Haswell, 
> so presumably not quite so bad for SKL.
>
> - Robert
>
>
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx



[-- Attachment #1.2: Type: text/html, Size: 4242 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-06 20:02 ` Lionel Landwerlin
@ 2017-12-22  5:15   ` Sagar Arun Kamble
  2017-12-22  5:26     ` Sagar Arun Kamble
  0 siblings, 1 reply; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-12-22  5:15 UTC (permalink / raw)
  To: Lionel Landwerlin, intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 5273 bytes --]



On 12/7/2017 1:32 AM, Lionel Landwerlin wrote:
> I've put together some trival IGT tests : 
> https://github.com/djdeath/intel-gpu-tools/commits/wip/djdeath/cpu-timestamps
> With a few changes which I pointed in the review : 
> https://github.com/djdeath/linux/commit/d0e4cf4d3f464491b4ffe97d112284d1ce73656d
>
> Put together it seems to work relatively well.
> There is still a small drift happening between the 2 timestamps. I've 
> noticed over a 160ms of OA reports, there is a accumulated difference 
> of ~35us between the GPU timestamp and cpu timestamps.
> I may be doing something wrong with the scaling in the tests, or maybe 
> there is an issue in the kernel, or both.
Went through the testcase. scaled_gpu_delta calculation is same as what 
timecounter does in kernel for calculating system_time corresponding
to gpu timestamp hence we don't see much of delta between these two 
times/time deltas.
Ideally we should be testing the system time delta by sampling system 
time and gpu timestamp atomically (which isn't feasible unless we do 
some precise adjustments)
I have attempted to check these two times/delta by sampling them in 
debugfs and reading over variable periods manually and checking the delta.
https://github.com/sakamble/i915-timestamp-support/commit/03be3056752d7b05a02cd01f5c20b3fcfcf18395

It is showing that delta is less than 10us in most cases for 30min and 
occasionally around 20-30us. Again this delta might be because of 
initial time setting as well.
timecounter initializing system time and gpu clocks as pair should be 
highly accurate for which I have currently taken 350us start offset.

I think gpu timestamp clock is highly stable as seen in my testing on 
SKL. I think this clock is used for calculations in GuC too so will
be good to get confirmation about the stability of this clock from them 
and HW team.

    systemd-udevd-169   [001] ....     3.035812: i915_driver_load: sys 
start time: 1512308011156099790

    systemd-udevd-169   [001] d...     3.036012: i915_cyclecounter_read: 
52025098974

              cat-1654  [001] ....    52.407957: i915_cyclecounter_read: 
52617562292

              cat-1654  [001] ....    52.407958: i915_timestamp_info: 
sys time: 1512308060527894638

              cat-1654  [001] ....    52.407958: i915_timestamp_info: 
ts: 52617562292 device time: 1512308060528043050

              cat-1684  [001] ....   177.239733: i915_cyclecounter_read: 
54115543581

              cat-1684  [001] ....   177.239736: i915_timestamp_info: 
sys time: 1512308185359666602

              cat-1684  [001] ....   177.239737: i915_timestamp_info: 
ts: 54115543581 device time: 1512308185359817372

              cat-1693  [001] ....   329.820374: i915_cyclecounter_read: 
55946511277

              cat-1693  [001] ....   329.820377: i915_timestamp_info: 
sys time: 1512308337940301732

              cat-1693  [001] ....   329.820378: i915_timestamp_info: 
ts: 55946511277 device time: 1512308337940458996


<delta between system time delta and gpu time delta for above two 
samples (177, 329) = 6494ns>


              cat-1702  [001] ....   506.980313: i915_cyclecounter_read: 
58072430542

              cat-1702  [001] ....   506.980315: i915_timestamp_info: 
sys time: 1512308515100233102

              cat-1702  [001] ....   506.980317: i915_timestamp_info: 
ts: 58072430542 device time: 1512308515100398084

<delta between system time delta and gpu time delta for above two 
samples (329, 506) = 6494ns>

>
> I'll build the GPUTop parts and see if the results make sense.
>
> Thanks!,
>
> -
> Lionel
>
> On 15/11/17 12:13, Sagar Arun Kamble wrote:
>> We can compute system time corresponding to GPU timestamp by taking a
>> reference point (CPU monotonic time, GPU timestamp) and then adding
>> delta time computed using timecounter/cyclecounter support in kernel.
>> We have to configure cyclecounter with the GPU timestamp frequency.
>> Earlier approach that was based on cross-timestamp is not needed. It
>> was being used to approximate the frequency based on invalid assumptions
>> (possibly drift was being seen in the time due to precision issue).
>> The precision of time from GPU clocks is already in ns and timecounter
>> takes care of it as verified over variable durations.
>>
>> This series adds base timecounter/cyclecounter changes and changes to
>> get GPU and CPU timestamps in OA samples.
>>
>> Sagar Arun Kamble (1):
>>    drm/i915/perf: Add support to correlate GPU timestamp with system 
>> time
>>
>> Sourab Gupta (3):
>>    drm/i915/perf: Add support for collecting 64 bit timestamps with OA
>>      reports
>>    drm/i915/perf: Extract raw GPU timestamps from OA reports
>>    drm/i915/perf: Send system clock monotonic time in perf samples
>>
>>   drivers/gpu/drm/i915/i915_drv.h  |  11 ++++
>>   drivers/gpu/drm/i915/i915_perf.c | 124 
>> ++++++++++++++++++++++++++++++++++++++-
>>   drivers/gpu/drm/i915/i915_reg.h  |   6 ++
>>   include/uapi/drm/i915_drm.h      |  14 +++++
>>   4 files changed, 154 insertions(+), 1 deletion(-)
>>
>


[-- Attachment #1.2: Type: text/html, Size: 7586 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-22  5:15   ` Sagar Arun Kamble
@ 2017-12-22  5:26     ` Sagar Arun Kamble
  0 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-12-22  5:26 UTC (permalink / raw)
  To: Lionel Landwerlin, intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 5783 bytes --]



On 12/22/2017 10:45 AM, Sagar Arun Kamble wrote:
>
>
>
> On 12/7/2017 1:32 AM, Lionel Landwerlin wrote:
>> I've put together some trival IGT tests : 
>> https://github.com/djdeath/intel-gpu-tools/commits/wip/djdeath/cpu-timestamps
>> With a few changes which I pointed in the review : 
>> https://github.com/djdeath/linux/commit/d0e4cf4d3f464491b4ffe97d112284d1ce73656d
>>
>> Put together it seems to work relatively well.
>> There is still a small drift happening between the 2 timestamps. I've 
>> noticed over a 160ms of OA reports, there is a accumulated difference 
>> of ~35us between the GPU timestamp and cpu timestamps.
>> I may be doing something wrong with the scaling in the tests, or 
>> maybe there is an issue in the kernel, or both.
> Went through the testcase. scaled_gpu_delta calculation is same as 
> what timecounter does in kernel for calculating system_time corresponding
> to gpu timestamp hence we don't see much of delta between these two 
> times/time deltas.
> Ideally we should be testing the system time delta by sampling system 
> time and gpu timestamp atomically (which isn't feasible unless we do 
> some precise adjustments)
> I have attempted to check these two times/delta by sampling them in 
> debugfs and reading over variable periods manually and checking the delta.
> https://github.com/sakamble/i915-timestamp-support/commit/03be3056752d7b05a02cd01f5c20b3fcfcf18395
>
> It is showing that delta is less than 10us in most cases for 30min and 
> occasionally around 20-30us. Again this delta might be because of 
> initial time setting as well.
> timecounter initializing system time and gpu clocks as pair should be 
> highly accurate for which I have currently taken 350us start offset.
>
> I think gpu timestamp clock is highly stable as seen in my testing on 
> SKL. I think this clock is used for calculations in GuC too so will
> be good to get confirmation about the stability of this clock from 
> them and HW team.
>
>    systemd-udevd-169   [001] ....     3.035812: i915_driver_load: sys 
> start time: 1512308011156099790
>
>    systemd-udevd-169   [001] d...     3.036012: 
> i915_cyclecounter_read: 52025098974
>
>              cat-1654  [001] ....    52.407957: 
> i915_cyclecounter_read: 52617562292
>
>              cat-1654  [001] ....    52.407958: i915_timestamp_info: 
> sys time: 1512308060527894638
>
>              cat-1654  [001] ....    52.407958: i915_timestamp_info: 
> ts: 52617562292 device time: 1512308060528043050
>
>              cat-1684  [001] ....   177.239733: 
> i915_cyclecounter_read: 54115543581
>
>              cat-1684  [001] ....   177.239736: i915_timestamp_info: 
> sys time: 1512308185359666602
>
>              cat-1684  [001] ....   177.239737: i915_timestamp_info: 
> ts: 54115543581 device time: 1512308185359817372
>
>              cat-1693  [001] ....   329.820374: 
> i915_cyclecounter_read: 55946511277
>
>              cat-1693  [001] ....   329.820377: i915_timestamp_info: 
> sys time: 1512308337940301732
>
>              cat-1693  [001] ....   329.820378: i915_timestamp_info: 
> ts: 55946511277 device time: 1512308337940458996
>
>
> <delta between system time delta and gpu time delta for above two 
> samples (177, 329) = 6494ns>
>
>
>              cat-1702  [001] ....   506.980313: 
> i915_cyclecounter_read: 58072430542
>
>              cat-1702  [001] ....   506.980315: i915_timestamp_info: 
> sys time: 1512308515100233102
>
>              cat-1702  [001] ....   506.980317: i915_timestamp_info: 
> ts: 58072430542 device time: 1512308515100398084
>
> <delta between system time delta and gpu time delta for above two 
> samples (329, 506) = 6494ns>
>
Fixing typo here:
<delta between system time delta and gpu time delta for above two 
samples (329, 506) = 7718ns>
>>
>> I'll build the GPUTop parts and see if the results make sense.
>>
>> Thanks!,
>>
>> -
>> Lionel
>>
>> On 15/11/17 12:13, Sagar Arun Kamble wrote:
>>> We can compute system time corresponding to GPU timestamp by taking a
>>> reference point (CPU monotonic time, GPU timestamp) and then adding
>>> delta time computed using timecounter/cyclecounter support in kernel.
>>> We have to configure cyclecounter with the GPU timestamp frequency.
>>> Earlier approach that was based on cross-timestamp is not needed. It
>>> was being used to approximate the frequency based on invalid 
>>> assumptions
>>> (possibly drift was being seen in the time due to precision issue).
>>> The precision of time from GPU clocks is already in ns and timecounter
>>> takes care of it as verified over variable durations.
>>>
>>> This series adds base timecounter/cyclecounter changes and changes to
>>> get GPU and CPU timestamps in OA samples.
>>>
>>> Sagar Arun Kamble (1):
>>>    drm/i915/perf: Add support to correlate GPU timestamp with system 
>>> time
>>>
>>> Sourab Gupta (3):
>>>    drm/i915/perf: Add support for collecting 64 bit timestamps with OA
>>>      reports
>>>    drm/i915/perf: Extract raw GPU timestamps from OA reports
>>>    drm/i915/perf: Send system clock monotonic time in perf samples
>>>
>>>   drivers/gpu/drm/i915/i915_drv.h  |  11 ++++
>>>   drivers/gpu/drm/i915/i915_perf.c | 124 
>>> ++++++++++++++++++++++++++++++++++++++-
>>>   drivers/gpu/drm/i915/i915_reg.h  |   6 ++
>>>   include/uapi/drm/i915_drm.h      |  14 +++++
>>>   4 files changed, 154 insertions(+), 1 deletion(-)
>>>
>>
>
>
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[-- Attachment #1.2: Type: text/html, Size: 8774 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-07  0:48 ` Robert Bragg
  2017-12-07  0:57   ` Robert Bragg
@ 2017-12-22  6:06   ` Sagar Arun Kamble
  1 sibling, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-12-22  6:06 UTC (permalink / raw)
  To: Robert Bragg; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 7440 bytes --]



On 12/7/2017 6:18 AM, Robert Bragg wrote:
>
>
> On Wed, Nov 15, 2017 at 12:13 PM, Sagar Arun Kamble 
> <sagar.a.kamble@intel.com <mailto:sagar.a.kamble@intel.com>> wrote:
>
>     We can compute system time corresponding to GPU timestamp by taking a
>     reference point (CPU monotonic time, GPU timestamp) and then adding
>     delta time computed using timecounter/cyclecounter support in kernel.
>     We have to configure cyclecounter with the GPU timestamp frequency.
>     Earlier approach that was based on cross-timestamp is not needed. It
>     was being used to approximate the frequency based on invalid
>     assumptions
>     (possibly drift was being seen in the time due to precision issue).
>     The precision of time from GPU clocks is already in ns and timecounter
>     takes care of it as verified over variable durations.
>
>
> Hi Sagar,
>
> I have some doubts about this analysis...
>
> The intent behind Sourab's original approach was to be able to 
> determine the frequency at runtime empirically because the constants 
> we have aren't particularly accurate. Without a perfectly stable 
> frequency that's known very precisely then an interpolated correlation 
> will inevitably drift. I think the nature of HW implies we can't 
> expect to have either of those. Then the general idea had been to try 
> and use existing kernel infrastructure for a problem which isn't 
> unique to GPU clocks.
Hi Robert,

Testing on SKL shows timestamps drift only about 10us for sampling done 
in kernel for about 30min time.
Verified with changes from 
https://github.com/sakamble/i915-timestamp-support/commits/drm-tip
Note that since we are sampling counter in debugfs, there is likely 
overhead of read that is adding to drift so adjustment might be needed.
But with OA reports we just have to worry about initial timecounter 
setup where we need accurate pair of system time and GPU timestamp clock 
counts.
I think timestamp clock is highly stable and we don't need logic to 
determine frequency at runtime. Will try to get confirmation from HW 
team as well.

If we need to determine the frequency, Sourab's approach needs to refined as
1. It can be implemented entirely in i915 because what we need is pair 
of system time and gpu clocks over different durations.
2. crosstimestamp framework usage in that approach is incorrect as 
ideally we should be sending ART counter and GPU counter. Instead we were
hacking to send the TSC clock.
Quoting Thomas from  https://patchwork.freedesktop.org/patch/144298/
get_device_system_crosststamp() is for timestamps taken via a clock 
which is directly correlated with the timekeeper clocksource.

ART and TSC are correlated via: TSC = (ART * scale) + offset
get_device_system_crosststamp() invokes the device function which reads 
ART, which is converted to CLOCK_MONOTONIC_RAW by the conversion above,
and then uses interpolation to map the CLOCK_MONOTONIC_RAW value to 
CLOCK_MONOTONIC.
The device function does not know anything about TSC. All it knows about 
is ART.

I am not aware if GPU timestamp clock is correlated with TSC like ART 
for ethernet drivers and if i915 can read ART like ethernet drivers.
3. I have seen precision issues in the calculations in 
i915_perf_clock_sync_work and usage of MONO_RAW which might jump time.
>
> That's not to say that a more limited, simpler solution based on 
> frequent re-correlation wouldn't be more than welcome if tracking an 
> accurate frequency is too awkward for now
Adjusting timecounter time can be another option if we confirm that GPU 
timestamp frequency is stable.
> , but I think some things need to be considered in that case:
>
> - It would be good to quantify the kind of drift seen in practice to 
> know how frequently it's necessary to re-synchronize. It sounds like 
> you've done this ("as verified over variable durations") so I'm 
> curious what kind of drift you saw. I'd imagine you would see a 
> significant drift over, say, one second and it might not take much 
> longer for the drift to even become clearly visible to the user when 
> plotted in a UI. For reference I once updated the arb_timer_query test 
> in piglit to give some insight into this drift 
> (https://lists.freedesktop.org/archives/piglit/2016-September/020673.html) 
> and at least from what I wrote back then it looks like I was seeing a 
> drift of a few milliseconds per second on SKL. I vaguely recall it 
> being much worse given the frequency constants we had for Haswell.
>
On SKL I have seen very small drift of less than 10us over a period of 
30 minutes.
Verified with changes from 
https://github.com/sakamble/i915-timestamp-support/commits/drm-tip

36bit counter will overflow in about 95min at 12mhz and timecounter 
framework considers
counter value with delta from timecounter init of more than half of 
total time covered by counter as time in the past so current approach 
works for less than 45min.
Will need to add overflow watchdog support like other drivers which just 
reinitializes timecounter prior to 45min.

> - What guarantees will be promised about monotonicity of correlated 
> system timestamps? Will it be guaranteed that sequential reports must 
> have monotonically increasing timestamps? That might be fiddly if the 
> gpu + system clock are periodically re-correlated, so it might be good 
> to be clear in documentation that the correlation is best-effort only 
> for the sake of implementation simplicity. That would still be good 
> for a lot of UIs I think and there's freedom for the driver to start 
> simple and potentially improve later by measuring the gpu clock 
> frequency empirically.
>
If we rely on timecounter alone without correlation to know frequency, 
setting init time as MONOTONIC system time should take care of 
monotonicity of correlated times.

Regards,
Sagar
> Currently only one correlated pair of timestamps is read when enabling 
> the stream and so a relatively long time is likely to pass before the 
> stream is disabled (seconds, minutes while a user is running a system 
> profiler) . It seems very likely to me that these clocks are going to 
> drift significantly without introducing some form of periodic 
> re-synchronization based on some understanding of the drift that's seen.
>
> Br,
> - Robert
>
>
>
>     This series adds base timecounter/cyclecounter changes and changes to
>     get GPU and CPU timestamps in OA samples.
>
>     Sagar Arun Kamble (1):
>       drm/i915/perf: Add support to correlate GPU timestamp with
>     system time
>
>     Sourab Gupta (3):
>       drm/i915/perf: Add support for collecting 64 bit timestamps with OA
>         reports
>       drm/i915/perf: Extract raw GPU timestamps from OA reports
>       drm/i915/perf: Send system clock monotonic time in perf samples
>
>      drivers/gpu/drm/i915/i915_drv.h  |  11 ++++
>      drivers/gpu/drm/i915/i915_perf.c | 124
>     ++++++++++++++++++++++++++++++++++++++-
>      drivers/gpu/drm/i915/i915_reg.h  |   6 ++
>      include/uapi/drm/i915_drm.h      |  14 +++++
>      4 files changed, 154 insertions(+), 1 deletion(-)
>
>     --
>     1.9.1
>
>     _______________________________________________
>     Intel-gfx mailing list
>     Intel-gfx@lists.freedesktop.org
>     <mailto:Intel-gfx@lists.freedesktop.org>
>     https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>     <https://lists.freedesktop.org/mailman/listinfo/intel-gfx>
>
>


[-- Attachment #1.2: Type: text/html, Size: 12510 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-21 12:59     ` Lionel Landwerlin
@ 2017-12-22  9:30       ` Sagar Arun Kamble
  2017-12-22 10:16         ` Lionel Landwerlin
  0 siblings, 1 reply; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-12-22  9:30 UTC (permalink / raw)
  To: Lionel Landwerlin, Robert Bragg; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 2225 bytes --]



On 12/21/2017 6:29 PM, Lionel Landwerlin wrote:
> Some more findings I made while playing with this series & GPUTop.
> Turns out the 2ms drift per second is due to timecounter. Adding the 
> delta this way :
>
> https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R607
>
> Eliminates the drift.
I see two imp. changes 1. approximation of start time during 
init_timecounter 2. overflow handling in delta accumulation.
With these incorporated, I guess timecounter should also work in same 
fashion.
> Timelines of perf i915 tracepoints & OA reports now make a lot more sense.
>
> There is still the issue that reading the CPU clock & the RCS 
> timestamp is inherently not atomic. So there is a delta there.
> I think we should add a new i915 perf record type to express the delta 
> that we measure this way :
>
> https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R2475
>
> So that userspace knows there might be a global offset between the 2 
> times and is able to present it.
agree on this. Delta ns1-ns0 can be interpreted as max drift.
> Measurement on my KBL system were in the order of a few microseconds 
> (~30us).
> I guess we might be able to setup the correlation point better 
> (masking interruption?) to reduce the delta.
already using spin_lock. Do you mean NMI?
>
> Thanks,
>
> -
> Lionel
>
>
> On 07/12/17 00:57, Robert Bragg wrote:
>>
>>
>> On Thu, Dec 7, 2017 at 12:48 AM, Robert Bragg <robert@sixbynine.org 
>> <mailto:robert@sixbynine.org>> wrote:
>>
>>
>>     at least from what I wrote back then it looks like I was seeing a
>>     drift of a few milliseconds per second on SKL. I vaguely recall
>>     it being much worse given the frequency constants we had for Haswell.
>>
>>
>> Sorry I didn't actually re-read my own message properly before 
>> referencing it :) Apparently the 2ms per second drift was for 
>> Haswell, so presumably not quite so bad for SKL.
>>
>> - Robert
>>
>>
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
>


[-- Attachment #1.2: Type: text/html, Size: 5592 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-22  9:30       ` Sagar Arun Kamble
@ 2017-12-22 10:16         ` Lionel Landwerlin
  2017-12-26  5:32           ` Sagar Arun Kamble
  0 siblings, 1 reply; 45+ messages in thread
From: Lionel Landwerlin @ 2017-12-22 10:16 UTC (permalink / raw)
  To: Sagar Arun Kamble, Robert Bragg; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 3013 bytes --]

On 22/12/17 09:30, Sagar Arun Kamble wrote:
>
>
>
> On 12/21/2017 6:29 PM, Lionel Landwerlin wrote:
>> Some more findings I made while playing with this series & GPUTop.
>> Turns out the 2ms drift per second is due to timecounter. Adding the 
>> delta this way :
>>
>> https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R607
>>
>> Eliminates the drift.
> I see two imp. changes 1. approximation of start time during 
> init_timecounter 2. overflow handling in delta accumulation.
> With these incorporated, I guess timecounter should also work in same 
> fashion.

I think the arithmetic in timecounter is inherently lossy and that's why 
we're seeing a drift. Could we be using it wrong?

In the patch above, I think there is still a drift because of the 
potential fractional part loss at every delta we add.
But it should only be a fraction of a nanosecond multiplied by the 
number of reports over a period of time.
With a report every 1us, that should still be much less than a 1ms of 
drift over 1s.

We can probably do better by always computing the clock using the entire 
delta rather than the accumulated delta.

>> Timelines of perf i915 tracepoints & OA reports now make a lot more 
>> sense.
>>
>> There is still the issue that reading the CPU clock & the RCS 
>> timestamp is inherently not atomic. So there is a delta there.
>> I think we should add a new i915 perf record type to express the 
>> delta that we measure this way :
>>
>> https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R2475
>>
>> So that userspace knows there might be a global offset between the 2 
>> times and is able to present it.
> agree on this. Delta ns1-ns0 can be interpreted as max drift.
>> Measurement on my KBL system were in the order of a few microseconds 
>> (~30us).
>> I guess we might be able to setup the correlation point better 
>> (masking interruption?) to reduce the delta.
> already using spin_lock. Do you mean NMI?

I don't actually know much on this point.
if spin_lock is the best we can do, then that's it :)

>>
>> Thanks,
>>
>> -
>> Lionel
>>
>>
>> On 07/12/17 00:57, Robert Bragg wrote:
>>>
>>>
>>> On Thu, Dec 7, 2017 at 12:48 AM, Robert Bragg <robert@sixbynine.org 
>>> <mailto:robert@sixbynine.org>> wrote:
>>>
>>>
>>>     at least from what I wrote back then it looks like I was seeing
>>>     a drift of a few milliseconds per second on SKL. I vaguely
>>>     recall it being much worse given the frequency constants we had
>>>     for Haswell.
>>>
>>>
>>> Sorry I didn't actually re-read my own message properly before 
>>> referencing it :) Apparently the 2ms per second drift was for 
>>> Haswell, so presumably not quite so bad for SKL.
>>>
>>> - Robert
>>>
>>>
>>>
>>> _______________________________________________
>>> Intel-gfx mailing list
>>> Intel-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>
>>
>


[-- Attachment #1.2: Type: text/html, Size: 7046 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-22 10:16         ` Lionel Landwerlin
@ 2017-12-26  5:32           ` Sagar Arun Kamble
  2017-12-28 17:13             ` Lionel Landwerlin
  0 siblings, 1 reply; 45+ messages in thread
From: Sagar Arun Kamble @ 2017-12-26  5:32 UTC (permalink / raw)
  To: Lionel Landwerlin, Robert Bragg; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 3852 bytes --]



On 12/22/2017 3:46 PM, Lionel Landwerlin wrote:
> On 22/12/17 09:30, Sagar Arun Kamble wrote:
>>
>>
>>
>> On 12/21/2017 6:29 PM, Lionel Landwerlin wrote:
>>> Some more findings I made while playing with this series & GPUTop.
>>> Turns out the 2ms drift per second is due to timecounter. Adding the 
>>> delta this way :
>>>
>>> https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R607
>>>
>>> Eliminates the drift.
>> I see two imp. changes 1. approximation of start time during 
>> init_timecounter 2. overflow handling in delta accumulation.
>> With these incorporated, I guess timecounter should also work in same 
>> fashion.
>
> I think the arithmetic in timecounter is inherently lossy and that's 
> why we're seeing a drift.
Could you share details about platform, scenario in which 2ms drift per 
second is being seen with timecounter.
I did not observe this on SKL.
> Could we be using it wrong?
>
if we use two changes highlighted above with timecounter maybe we will 
get same results as your current implementation.
> In the patch above, I think there is still a drift because of the 
> potential fractional part loss at every delta we add.
> But it should only be a fraction of a nanosecond multiplied by the 
> number of reports over a period of time.
> With a report every 1us, that should still be much less than a 1ms of 
> drift over 1s.
>
timecounter interface takes care of fractional parts so that should help us.
we can either go with timecounter or our own implementation provided 
conversions are precise.
> We can probably do better by always computing the clock using the 
> entire delta rather than the accumulated delta.
>
issue is that the reported clock cycles in the OA report is 32bits LSB 
of GPU TS whereas counter is 36bits. Hence we will need to
accumulate the delta. ofc there is assumption that two reports can't be 
spaced with count value of 0xffffffff apart.
>>> Timelines of perf i915 tracepoints & OA reports now make a lot more 
>>> sense.
>>>
>>> There is still the issue that reading the CPU clock & the RCS 
>>> timestamp is inherently not atomic. So there is a delta there.
>>> I think we should add a new i915 perf record type to express the 
>>> delta that we measure this way :
>>>
>>> https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R2475
>>>
>>> So that userspace knows there might be a global offset between the 2 
>>> times and is able to present it.
>> agree on this. Delta ns1-ns0 can be interpreted as max drift.
>>> Measurement on my KBL system were in the order of a few microseconds 
>>> (~30us).
>>> I guess we might be able to setup the correlation point better 
>>> (masking interruption?) to reduce the delta.
>> already using spin_lock. Do you mean NMI?
>
> I don't actually know much on this point.
> if spin_lock is the best we can do, then that's it :)
>
>>>
>>> Thanks,
>>>
>>> -
>>> Lionel
>>>
>>>
>>> On 07/12/17 00:57, Robert Bragg wrote:
>>>>
>>>>
>>>> On Thu, Dec 7, 2017 at 12:48 AM, Robert Bragg <robert@sixbynine.org 
>>>> <mailto:robert@sixbynine.org>> wrote:
>>>>
>>>>
>>>>     at least from what I wrote back then it looks like I was seeing
>>>>     a drift of a few milliseconds per second on SKL. I vaguely
>>>>     recall it being much worse given the frequency constants we had
>>>>     for Haswell.
>>>>
>>>>
>>>> Sorry I didn't actually re-read my own message properly before 
>>>> referencing it :) Apparently the 2ms per second drift was for 
>>>> Haswell, so presumably not quite so bad for SKL.
>>>>
>>>> - Robert
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Intel-gfx mailing list
>>>> Intel-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>>
>>>
>>
>


[-- Attachment #1.2: Type: text/html, Size: 8793 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-26  5:32           ` Sagar Arun Kamble
@ 2017-12-28 17:13             ` Lionel Landwerlin
  2018-01-03  5:38               ` Sagar Arun Kamble
  0 siblings, 1 reply; 45+ messages in thread
From: Lionel Landwerlin @ 2017-12-28 17:13 UTC (permalink / raw)
  To: Sagar Arun Kamble, Robert Bragg; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 5395 bytes --]

On 26/12/17 05:32, Sagar Arun Kamble wrote:
>
>
>
> On 12/22/2017 3:46 PM, Lionel Landwerlin wrote:
>> On 22/12/17 09:30, Sagar Arun Kamble wrote:
>>>
>>>
>>>
>>> On 12/21/2017 6:29 PM, Lionel Landwerlin wrote:
>>>> Some more findings I made while playing with this series & GPUTop.
>>>> Turns out the 2ms drift per second is due to timecounter. Adding 
>>>> the delta this way :
>>>>
>>>> https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R607
>>>>
>>>> Eliminates the drift.
>>> I see two imp. changes 1. approximation of start time during 
>>> init_timecounter 2. overflow handling in delta accumulation.
>>> With these incorporated, I guess timecounter should also work in 
>>> same fashion.
>>
>> I think the arithmetic in timecounter is inherently lossy and that's 
>> why we're seeing a drift.
> Could you share details about platform, scenario in which 2ms drift 
> per second is being seen with timecounter.
> I did not observe this on SKL.

The 2ms drift was on SKL GT4.

With the patch above, I'm seeing only a ~40us drift over ~7seconds of 
recording both perf tracepoints & i915 perf reports.
I'm tracking the kernel tracepoints adding gem requests and the i915 
perf reports.
Here a screenshot at the beginning of the 7s recording : 
https://i.imgur.com/hnexgjQ.png (you can see the gem request add before 
the work starts in the i915 perf reports).
At the end of the recording, the gem requests appear later than the work 
in the i915 perf report : https://i.imgur.com/oCd0C9T.png

I'll try to prepare some IGT tests that show the drift using perf & i915 
perf, so we can run those on different platforms.
I tend to mostly test on a SKL GT4 & KBL GT2, but BXT definitely needs 
more attention...

>> Could we be using it wrong?
>>
> if we use two changes highlighted above with timecounter maybe we will 
> get same results as your current implementation.
>> In the patch above, I think there is still a drift because of the 
>> potential fractional part loss at every delta we add.
>> But it should only be a fraction of a nanosecond multiplied by the 
>> number of reports over a period of time.
>> With a report every 1us, that should still be much less than a 1ms of 
>> drift over 1s.
>>
> timecounter interface takes care of fractional parts so that should 
> help us.
> we can either go with timecounter or our own implementation provided 
> conversions are precise.

Looking at clocks_calc_mult_shift(), it seems clear to me that there is 
less precision when using timecounter :

  /*
   * Find the conversion shift/mult pair which has the best
   * accuracy and fits the maxsec conversion range:
   */

On the other hand, there is a performance penalty for doing a div64 for 
every report.

>> We can probably do better by always computing the clock using the 
>> entire delta rather than the accumulated delta.
>>
> issue is that the reported clock cycles in the OA report is 32bits LSB 
> of GPU TS whereas counter is 36bits. Hence we will need to
> accumulate the delta. ofc there is assumption that two reports can't 
> be spaced with count value of 0xffffffff apart.

You're right :)
I thought maybe we could do this :

Look at teduhe opening period parameter, if it's superior to the period 
of timestamps wrapping, make sure we schle some work on kernel context 
to generate a context switch report (like at least once every 6 minutes 
on gen9).

>>>> Timelines of perf i915 tracepoints & OA reports now make a lot more 
>>>> sense.
>>>>
>>>> There is still the issue that reading the CPU clock & the RCS 
>>>> timestamp is inherently not atomic. So there is a delta there.
>>>> I think we should add a new i915 perf record type to express the 
>>>> delta that we measure this way :
>>>>
>>>> https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R2475
>>>>
>>>> So that userspace knows there might be a global offset between the 
>>>> 2 times and is able to present it.
>>> agree on this. Delta ns1-ns0 can be interpreted as max drift.
>>>> Measurement on my KBL system were in the order of a few 
>>>> microseconds (~30us).
>>>> I guess we might be able to setup the correlation point better 
>>>> (masking interruption?) to reduce the delta.
>>> already using spin_lock. Do you mean NMI?
>>
>> I don't actually know much on this point.
>> if spin_lock is the best we can do, then that's it :)
>>
>>>>
>>>> Thanks,
>>>>
>>>> -
>>>> Lionel
>>>>
>>>>
>>>> On 07/12/17 00:57, Robert Bragg wrote:
>>>>>
>>>>>
>>>>> On Thu, Dec 7, 2017 at 12:48 AM, Robert Bragg 
>>>>> <robert@sixbynine.org <mailto:robert@sixbynine.org>> wrote:
>>>>>
>>>>>
>>>>>     at least from what I wrote back then it looks like I was
>>>>>     seeing a drift of a few milliseconds per second on SKL. I
>>>>>     vaguely recall it being much worse given the frequency
>>>>>     constants we had for Haswell.
>>>>>
>>>>>
>>>>> Sorry I didn't actually re-read my own message properly before 
>>>>> referencing it :) Apparently the 2ms per second drift was for 
>>>>> Haswell, so presumably not quite so bad for SKL.
>>>>>
>>>>> - Robert
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Intel-gfx mailing list
>>>>> Intel-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>>>
>>>>
>>>
>>
>


[-- Attachment #1.2: Type: text/html, Size: 11534 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events
  2017-12-28 17:13             ` Lionel Landwerlin
@ 2018-01-03  5:38               ` Sagar Arun Kamble
  0 siblings, 0 replies; 45+ messages in thread
From: Sagar Arun Kamble @ 2018-01-03  5:38 UTC (permalink / raw)
  To: Lionel Landwerlin, Robert Bragg; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 6819 bytes --]



On 12/28/2017 10:43 PM, Lionel Landwerlin wrote:
> On 26/12/17 05:32, Sagar Arun Kamble wrote:
>>
>>
>>
>> On 12/22/2017 3:46 PM, Lionel Landwerlin wrote:
>>> On 22/12/17 09:30, Sagar Arun Kamble wrote:
>>>>
>>>>
>>>>
>>>> On 12/21/2017 6:29 PM, Lionel Landwerlin wrote:
>>>>> Some more findings I made while playing with this series & GPUTop.
>>>>> Turns out the 2ms drift per second is due to timecounter. Adding 
>>>>> the delta this way :
>>>>>
>>>>> https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R607
>>>>>
>>>>> Eliminates the drift.
>>>> I see two imp. changes 1. approximation of start time during 
>>>> init_timecounter 2. overflow handling in delta accumulation.
>>>> With these incorporated, I guess timecounter should also work in 
>>>> same fashion.
>>>
>>> I think the arithmetic in timecounter is inherently lossy and that's 
>>> why we're seeing a drift.
>> Could you share details about platform, scenario in which 2ms drift 
>> per second is being seen with timecounter.
>> I did not observe this on SKL.
>
> The 2ms drift was on SKL GT4.
>
I have checked the timecounter arithmetic. Accuracy is very high (of the 
order of micro ns per tick).
I interpreted maxsec parameter in calculation of mult/shift using 
clocks_calc_mult_shift function as total time covered by counter
but actually it controls the conversion accuracy. Since we want best 
possible accuracy passing zero should be preferred there.
For instance below are the mult/shift values and time reported for 10 
minutes with these values for SKL GT2 12mhz.
As you can see drift due to calculation is only about 2us. We should 
check by passing zero to clocks_calc_mult_shift and
delta handling new added with timecounter on SKL GT4. 2ms is huge drift 
and it is very unlikely related to these calculations.

maxsec, mult, shift, tick time (mult/2^shift), total time 
(10*60*12000000 * tick time), drift due to calculation
0, 2796202667, 25, 83.33333334326, 600,000,000,071.525, 71ns
3000, 174762667, 21, 83.33333349227, 600,000,001,144.409, 1144ns
6000, 87381333, 20, 83.33333301544, 599,999,997,711.181, 2289ns
> With the patch above, I'm seeing only a ~40us drift over ~7seconds of 
> recording both perf tracepoints & i915 perf reports.
> I'm tracking the kernel tracepoints adding gem requests and the i915 
> perf reports.
> Here a screenshot at the beginning of the 7s recording : 
> https://i.imgur.com/hnexgjQ.png (you can see the gem request add 
> before the work starts in the i915 perf reports).
> At the end of the recording, the gem requests appear later than the 
> work in the i915 perf report : https://i.imgur.com/oCd0C9T.png
>
Looks like we need to have error margin of only few microseconds :)
> I'll try to prepare some IGT tests that show the drift using perf & 
> i915 perf, so we can run those on different platforms.
> I tend to mostly test on a SKL GT4 & KBL GT2, but BXT definitely needs 
> more attention...
>
>>> Could we be using it wrong?
>>>
>> if we use two changes highlighted above with timecounter maybe we 
>> will get same results as your current implementation.
>>> In the patch above, I think there is still a drift because of the 
>>> potential fractional part loss at every delta we add.
>>> But it should only be a fraction of a nanosecond multiplied by the 
>>> number of reports over a period of time.
>>> With a report every 1us, that should still be much less than a 1ms 
>>> of drift over 1s.
>>>
>> timecounter interface takes care of fractional parts so that should 
>> help us.
>> we can either go with timecounter or our own implementation provided 
>> conversions are precise.
>
> Looking at clocks_calc_mult_shift(), it seems clear to me that there 
> is less precision when using timecounter :
>
>  /*
>   * Find the conversion shift/mult pair which has the best
>   * accuracy and fits the maxsec conversion range:
>   */
>
We can improve upon this by passing zero as maxsec to 
clocks_calc_mult_shift.
> On the other hand, there is a performance penalty for doing a div64 
> for every report.
>
>>> We can probably do better by always computing the clock using the 
>>> entire delta rather than the accumulated delta.
>>>
>> issue is that the reported clock cycles in the OA report is 32bits 
>> LSB of GPU TS whereas counter is 36bits. Hence we will need to
>> accumulate the delta. ofc there is assumption that two reports can't 
>> be spaced with count value of 0xffffffff apart.
>
> You're right :)
> I thought maybe we could do this :
>
> Look at teduhe opening period parameter, if it's superior to the 
> period of timestamps wrapping, make sure we schle some work on kernel 
> context to generate a context switch report (like at least once every 
> 6 minutes on gen9).
>
Looks fine to me.
>>>>> Timelines of perf i915 tracepoints & OA reports now make a lot 
>>>>> more sense.
>>>>>
>>>>> There is still the issue that reading the CPU clock & the RCS 
>>>>> timestamp is inherently not atomic. So there is a delta there.
>>>>> I think we should add a new i915 perf record type to express the 
>>>>> delta that we measure this way :
>>>>>
>>>>> https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R2475
>>>>>
>>>>> So that userspace knows there might be a global offset between the 
>>>>> 2 times and is able to present it.
>>>> agree on this. Delta ns1-ns0 can be interpreted as max drift.
>>>>> Measurement on my KBL system were in the order of a few 
>>>>> microseconds (~30us).
>>>>> I guess we might be able to setup the correlation point better 
>>>>> (masking interruption?) to reduce the delta.
>>>> already using spin_lock. Do you mean NMI?
>>>
>>> I don't actually know much on this point.
>>> if spin_lock is the best we can do, then that's it :)
>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -
>>>>> Lionel
>>>>>
>>>>>
>>>>> On 07/12/17 00:57, Robert Bragg wrote:
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 7, 2017 at 12:48 AM, Robert Bragg 
>>>>>> <robert@sixbynine.org <mailto:robert@sixbynine.org>> wrote:
>>>>>>
>>>>>>
>>>>>>     at least from what I wrote back then it looks like I was
>>>>>>     seeing a drift of a few milliseconds per second on SKL. I
>>>>>>     vaguely recall it being much worse given the frequency
>>>>>>     constants we had for Haswell.
>>>>>>
>>>>>>
>>>>>> Sorry I didn't actually re-read my own message properly before 
>>>>>> referencing it :) Apparently the 2ms per second drift was for 
>>>>>> Haswell, so presumably not quite so bad for SKL.
>>>>>>
>>>>>> - Robert
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Intel-gfx mailing list
>>>>>> Intel-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>>>>>
>>>>>
>>>>
>>>
>>
>


[-- Attachment #1.2: Type: text/html, Size: 14215 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2018-01-03  5:38 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-15 12:13 [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events Sagar Arun Kamble
2017-11-15 12:13 ` [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time Sagar Arun Kamble
2017-11-15 12:25   ` Chris Wilson
2017-11-15 16:41     ` Sagar Arun Kamble
2017-11-23  7:34     ` Creating cyclecounter and lock member in timecounter structure [ Was Re: [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time] Sagar Arun Kamble
2017-11-23 18:59       ` Thomas Gleixner
2017-11-24  9:06         ` Sagar Arun Kamble
2017-11-24 13:31           ` Thomas Gleixner
2017-11-27 10:05             ` Sagar Arun Kamble
2017-11-27 10:05               ` Sagar Arun Kamble
     [not found]               ` <63a2a495-1bdb-5d47-1202-9b538e9601d8-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-11-30 21:03                 ` Saeed Mahameed
2017-11-30 21:03                   ` Saeed Mahameed
     [not found]                   ` <CALzJLG9JXOnr3EQ2zLcmwKx8S9-CGONRRBSAd9XwHdemEgOn2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-01  7:42                     ` Sagar Arun Kamble
2017-12-01  7:42                       ` Sagar Arun Kamble
2017-12-05 13:58     ` [RFC 1/4] drm/i915/perf: Add support to correlate GPU timestamp with system time Lionel Landwerlin
2017-12-06  8:17       ` Sagar Arun Kamble
2017-11-15 12:13 ` [RFC 2/4] drm/i915/perf: Add support for collecting 64 bit timestamps with OA reports Sagar Arun Kamble
2017-12-06 16:01   ` Lionel Landwerlin
2017-12-21  8:38     ` Sagar Arun Kamble
2017-11-15 12:13 ` [RFC 3/4] drm/i915/perf: Extract raw GPU timestamps from " Sagar Arun Kamble
2017-12-06 19:55   ` Lionel Landwerlin
2017-12-21  8:50     ` Sagar Arun Kamble
2017-11-15 12:13 ` [RFC 4/4] drm/i915/perf: Send system clock monotonic time in perf samples Sagar Arun Kamble
2017-11-15 12:31   ` Chris Wilson
2017-11-15 16:51     ` Sagar Arun Kamble
2017-11-15 17:54   ` Sagar Arun Kamble
2017-12-05 14:22   ` Lionel Landwerlin
2017-12-06  8:31     ` Sagar Arun Kamble
2017-11-15 12:30 ` ✗ Fi.CI.BAT: warning for GPU/CPU timestamps correlation for relating OA samples with system events Patchwork
2017-12-05 14:16 ` [RFC 0/4] " Lionel Landwerlin
2017-12-05 14:28   ` Robert Bragg
2017-12-05 14:37     ` Lionel Landwerlin
2017-12-06  9:01       ` Sagar Arun Kamble
2017-12-06 20:02 ` Lionel Landwerlin
2017-12-22  5:15   ` Sagar Arun Kamble
2017-12-22  5:26     ` Sagar Arun Kamble
2017-12-07  0:48 ` Robert Bragg
2017-12-07  0:57   ` Robert Bragg
2017-12-21 12:59     ` Lionel Landwerlin
2017-12-22  9:30       ` Sagar Arun Kamble
2017-12-22 10:16         ` Lionel Landwerlin
2017-12-26  5:32           ` Sagar Arun Kamble
2017-12-28 17:13             ` Lionel Landwerlin
2018-01-03  5:38               ` Sagar Arun Kamble
2017-12-22  6:06   ` Sagar Arun Kamble

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.