All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
@ 2021-10-05 17:47 ` Umesh Nerlige Ramappa
  0 siblings, 0 replies; 24+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-10-05 17:47 UTC (permalink / raw)
  To: intel-gfx, dri-devel
  Cc: john.c.harrison, Tvrtko Ursulin, daniel.vetter, Matthew Brost

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

	engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Opens and wip that are targeted for later patches:

1) On global gt reset the total busyness of engines resets and i915
   needs to fix that so that user sees monotonically increasing
   busyness.
2) In runtime suspend mode, the worker may not need to be run. We could
   stop the worker on suspend and rerun it on resume provided that the
   guc pm timestamp does not tick during suspend.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
 .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
 drivers/gpu/drm/i915/i915_reg.h               |   2 +
 12 files changed, 398 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..6fcc70a313d9 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 	intel_engine_print_breadcrumbs(engine, m);
 }
 
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-					    ktime_t *now)
-{
-	ktime_t total = engine->stats.total;
-
-	/*
-	 * If the engine is executing something at the moment
-	 * add it to the total.
-	 */
-	*now = ktime_get();
-	if (READ_ONCE(engine->stats.active))
-		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
-
-	return total;
-}
-
 /**
  * intel_engine_get_busy_time() - Return current accumulated engine busyness
  * @engine: engine to report on
@@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
 {
-	unsigned int seq;
-	ktime_t total;
-
-	do {
-		seq = read_seqcount_begin(&engine->stats.lock);
-		total = __intel_engine_get_busy_time(engine, now);
-	} while (read_seqcount_retry(&engine->stats.lock, seq));
-
-	return total;
+	return engine->busyness(engine, now);
 }
 
 struct intel_context *
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 5ae1207c363b..8e1b9c38a6fc 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -432,6 +432,12 @@ struct intel_engine_cs {
 	void		(*add_active_request)(struct i915_request *rq);
 	void		(*remove_active_request)(struct i915_request *rq);
 
+	/*
+	 * Get engine busyness and the time at which the busyness was sampled.
+	 */
+	ktime_t		(*busyness)(struct intel_engine_cs *engine,
+				    ktime_t *now);
+
 	struct intel_engine_execlists execlists;
 
 	/*
@@ -481,30 +487,66 @@ struct intel_engine_cs {
 	u32 (*get_cmd_length_mask)(u32 cmd_header);
 
 	struct {
-		/**
-		 * @active: Number of contexts currently scheduled in.
-		 */
-		unsigned int active;
-
-		/**
-		 * @lock: Lock protecting the below fields.
-		 */
-		seqcount_t lock;
-
-		/**
-		 * @total: Total time this engine was busy.
-		 *
-		 * Accumulated time not counting the most recent block in cases
-		 * where engine is currently busy (active > 0).
-		 */
-		ktime_t total;
-
-		/**
-		 * @start: Timestamp of the last idle to active transition.
-		 *
-		 * Idle is defined as active == 0, active is active > 0.
-		 */
-		ktime_t start;
+		union {
+			/* Fields used by the execlists backend. */
+			struct {
+				/**
+				 * @active: Number of contexts currently
+				 * scheduled in.
+				 */
+				unsigned int active;
+
+				/**
+				 * @lock: Lock protecting the below fields.
+				 */
+				seqcount_t lock;
+
+				/**
+				 * @total: Total time this engine was busy.
+				 *
+				 * Accumulated time not counting the most recent
+				 * block in cases where engine is currently busy
+				 * (active > 0).
+				 */
+				ktime_t total;
+
+				/**
+				 * @start: Timestamp of the last idle to active
+				 * transition.
+				 *
+				 * Idle is defined as active == 0, active is
+				 * active > 0.
+				 */
+				ktime_t start;
+			};
+
+			/* Fields used by the GuC backend. */
+			struct {
+				/**
+				 * @running: Active state of the engine when
+				 * busyness was last sampled.
+				 */
+				bool running;
+
+				/**
+				 * @prev_total: Previous value of total runtime
+				 * clock cycles.
+				 */
+				u32 prev_total;
+
+				/**
+				 * @total_gt_clks: Total gt clock cycles this
+				 * engine was busy.
+				 */
+				u64 total_gt_clks;
+
+				/**
+				 * @start_gt_clk: GT clock time of last idle to
+				 * active transition.
+				 */
+				u64 start_gt_clk;
+			};
+		};
 
 		/**
 		 * @rps: Utilisation at last RPS sampling.
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 7147fe80919e..5c9b695e906c 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
 	lrc_fini_wa_ctx(engine);
 }
 
+static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
+					   ktime_t *now)
+{
+	ktime_t total = engine->stats.total;
+
+	/*
+	 * If the engine is executing something at the moment
+	 * add it to the total.
+	 */
+	*now = ktime_get();
+	if (READ_ONCE(engine->stats.active))
+		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
+
+	return total;
+}
+
+static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
+					 ktime_t *now)
+{
+	unsigned int seq;
+	ktime_t total;
+
+	do {
+		seq = read_seqcount_begin(&engine->stats.lock);
+		total = __execlists_engine_busyness(engine, now);
+	} while (read_seqcount_retry(&engine->stats.lock, seq));
+
+	return total;
+}
+
 static void
 logical_ring_default_vfuncs(struct intel_engine_cs *engine)
 {
@@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
 		engine->emit_bb_start = gen8_emit_bb_start;
 	else
 		engine->emit_bb_start = gen8_emit_bb_start_noarb;
+
+	engine->busyness = execlists_engine_busyness;
 }
 
 static void logical_ring_default_irqs(struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 524eaf678790..b4a8594bc46c 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
 	intel_rc6_unpark(&gt->rc6);
 	intel_rps_unpark(&gt->rps);
 	i915_pmu_gt_unparked(i915);
+	intel_guc_busyness_unpark(gt);
 
 	intel_gt_unpark_requests(gt);
 	runtime_begin(gt);
@@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
 	runtime_end(gt);
 	intel_gt_park_requests(gt);
 
+	intel_guc_busyness_park(gt);
 	i915_vma_parked(gt);
 	i915_pmu_gt_parked(i915);
 	intel_rps_park(&gt->rps);
diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
index 8ff582222aff..ff1311d4beff 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
@@ -143,6 +143,7 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
 	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
 	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
+	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
 	INTEL_GUC_ACTION_LIMIT
 };
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 5dd174babf7a..22c30dbdf63a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -104,6 +104,8 @@ struct intel_guc {
 	u32 ads_regset_size;
 	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
 	u32 ads_golden_ctxt_size;
+	/** @ads_engine_usage_size: size of engine usage in the ADS */
+	u32 ads_engine_usage_size;
 
 	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
 	struct i915_vma *lrc_desc_pool;
@@ -138,6 +140,30 @@ struct intel_guc {
 
 	/** @send_mutex: used to serialize the intel_guc_send actions */
 	struct mutex send_mutex;
+
+	struct {
+		/**
+		 * @lock: Lock protecting the below fields and the engine stats.
+		 */
+		spinlock_t lock;
+
+		/**
+		 * @gt_stamp: 64 bit extended value of the GT timestamp.
+		 */
+		u64 gt_stamp;
+
+		/**
+		 * @ping_delay: Period for polling the GT timestamp for
+		 * overflow.
+		 */
+		unsigned long ping_delay;
+
+		/**
+		 * @work: Periodic work to adjust GT timestamp, engine and
+		 * context usage for overflows.
+		 */
+		struct delayed_work work;
+	} timestamp;
 };
 
 static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 2c6ea64af7ec..ca9ab53999d5 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -26,6 +26,8 @@
  *      | guc_policies                          |
  *      +---------------------------------------+
  *      | guc_gt_system_info                    |
+ *      +---------------------------------------+
+ *      | guc_engine_usage                      |
  *      +---------------------------------------+ <== static
  *      | guc_mmio_reg[countA] (engine 0.0)     |
  *      | guc_mmio_reg[countB] (engine 0.1)     |
@@ -47,6 +49,7 @@ struct __guc_ads_blob {
 	struct guc_ads ads;
 	struct guc_policies policies;
 	struct guc_gt_system_info system_info;
+	struct guc_engine_usage engine_usage;
 	/* From here on, location is dynamic! Refer to above diagram. */
 	struct guc_mmio_reg regset[0];
 } __packed;
@@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
 
 	guc_ads_private_data_reset(guc);
 }
+
+u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
+{
+	struct __guc_ads_blob *blob = guc->ads_blob;
+	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
+	u32 offset = base + ptr_offset(blob, engine_usage);
+
+	return offset;
+}
+
+struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
+{
+	struct intel_guc *guc = &engine->gt->uc.guc;
+	struct __guc_ads_blob *blob = guc->ads_blob;
+	u8 guc_class = engine_class_to_guc_class(engine->class);
+
+	return &blob->engine_usage.engines[guc_class][engine->instance];
+}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
index 3d85051d57e4..e74c110facff 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
@@ -6,8 +6,11 @@
 #ifndef _INTEL_GUC_ADS_H_
 #define _INTEL_GUC_ADS_H_
 
+#include <linux/types.h>
+
 struct intel_guc;
 struct drm_printer;
+struct intel_engine_cs;
 
 int intel_guc_ads_create(struct intel_guc *guc);
 void intel_guc_ads_destroy(struct intel_guc *guc);
@@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
 void intel_guc_ads_reset(struct intel_guc *guc);
 void intel_guc_ads_print_policy_info(struct intel_guc *guc,
 				     struct drm_printer *p);
+struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
+u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
 
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index fa4be13c8854..7c9c081670fc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -294,6 +294,19 @@ struct guc_ads {
 	u32 reserved[15];
 } __packed;
 
+/* Engine usage stats */
+struct guc_engine_usage_record {
+	u32 current_context_index;
+	u32 last_switch_in_stamp;
+	u32 reserved0;
+	u32 total_runtime;
+	u32 reserved1[4];
+} __packed;
+
+struct guc_engine_usage {
+	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
+} __packed;
+
 /* GuC logging structures */
 
 enum guc_log_buffer_type {
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ba0de35f6323..3f7d0f2ac9da 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -12,6 +12,7 @@
 #include "gt/intel_engine_pm.h"
 #include "gt/intel_engine_heartbeat.h"
 #include "gt/intel_gt.h"
+#include "gt/intel_gt_clock_utils.h"
 #include "gt/intel_gt_irq.h"
 #include "gt/intel_gt_pm.h"
 #include "gt/intel_gt_requests.h"
@@ -20,6 +21,7 @@
 #include "gt/intel_mocs.h"
 #include "gt/intel_ring.h"
 
+#include "intel_guc_ads.h"
 #include "intel_guc_submission.h"
 
 #include "i915_drv.h"
@@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
 static void disable_submission(struct intel_guc *guc)
 {
 	struct i915_sched_engine * const sched_engine = guc->sched_engine;
+	struct intel_gt *gt = guc_to_gt(guc);
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	unsigned long flags;
 
 	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
 		GEM_BUG_ON(!guc->ct.enabled);
 		__tasklet_disable_sync_once(&sched_engine->tasklet);
 		sched_engine->tasklet.callback = NULL;
 	}
+
+	cancel_delayed_work(&guc->timestamp.work);
+
+	spin_lock_irqsave(&guc->timestamp.lock, flags);
+
+	for_each_engine(engine, gt, id)
+		engine->stats.prev_total = 0;
+
+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
 }
 
 static void enable_submission(struct intel_guc *guc)
@@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 	intel_gt_unpark_heartbeats(guc_to_gt(guc));
 }
 
+/*
+ * GuC stores busyness stats for each engine at context in/out boundaries. A
+ * context 'in' logs execution start time, 'out' adds in -> out delta to total.
+ * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
+ * GuC.
+ *
+ * __i915_pmu_event_read samples engine busyness. When sampling, if context id
+ * is valid (!= ~0) and start is non-zero, the engine is considered to be
+ * active. For an active engine total busyness = total + (now - start), where
+ * 'now' is the time at which the busyness is sampled. For inactive engine,
+ * total busyness = total.
+ *
+ * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
+ *
+ * The start and total values provided by GuC are 32 bits and wrap around in a
+ * few minutes. Since perf pmu provides busyness as 64 bit monotonically
+ * increasing ns values, there is a need for this implementation to account for
+ * overflows and extend the GuC provided values to 64 bits before returning
+ * busyness to the user. In order to do that, a worker runs periodically at
+ * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
+ * 27 seconds for a gt clock frequency of 19.2 MHz).
+ */
+
+#define WRAP_TIME_CLKS U32_MAX
+#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
+
+static void
+__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
+{
+	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
+	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
+
+	if (new_start == lower_32_bits(*prev_start))
+		return;
+
+	if (new_start < gt_stamp_last &&
+	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
+		gt_stamp_hi++;
+
+	if (new_start > gt_stamp_last &&
+	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
+		gt_stamp_hi--;
+
+	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
+}
+
+static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
+{
+	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
+	struct intel_guc *guc = &engine->gt->uc.guc;
+	u32 last_switch = rec->last_switch_in_stamp;
+	u32 ctx_id = rec->current_context_index;
+	u32 total = rec->total_runtime;
+
+	lockdep_assert_held(&guc->timestamp.lock);
+
+	engine->stats.running = ctx_id != ~0U && last_switch;
+	if (engine->stats.running)
+		__extend_last_switch(guc, &engine->stats.start_gt_clk,
+				     last_switch);
+
+	/*
+	 * Instead of adjusting the total for overflow, just add the
+	 * difference from previous sample to the stats.total_gt_clks
+	 */
+	if (total && total != ~0U) {
+		engine->stats.total_gt_clks += (u32)(total -
+						     engine->stats.prev_total);
+		engine->stats.prev_total = total;
+	}
+}
+
+static void guc_update_pm_timestamp(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	u32 gt_stamp_now, gt_stamp_hi;
+
+	lockdep_assert_held(&guc->timestamp.lock);
+
+	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
+	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
+
+	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
+		gt_stamp_hi++;
+
+	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
+}
+
+/*
+ * Unlike the execlist mode of submission total and active times are in terms of
+ * gt clocks. The *now parameter is retained to return the cpu time at which the
+ * busyness was sampled.
+ */
+static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
+{
+	struct intel_gt *gt = engine->gt;
+	struct intel_guc *guc = &gt->uc.guc;
+	unsigned long flags;
+	u64 total;
+
+	spin_lock_irqsave(&guc->timestamp.lock, flags);
+
+	*now = ktime_get();
+
+	/*
+	 * The active busyness depends on start_gt_clk and gt_stamp.
+	 * gt_stamp is updated by i915 only when gt is awake and the
+	 * start_gt_clk is derived from GuC state. To get a consistent
+	 * view of activity, we query the GuC state only if gt is awake.
+	 */
+	if (intel_gt_pm_get_if_awake(gt)) {
+		guc_update_engine_gt_clks(engine);
+		guc_update_pm_timestamp(guc);
+		intel_gt_pm_put_async(gt);
+	}
+
+	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
+	if (engine->stats.running) {
+		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
+
+		total += intel_gt_clock_interval_to_ns(gt, clk);
+	}
+
+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
+
+	return ns_to_ktime(total);
+}
+
+static void __update_guc_busyness_stats(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->timestamp.lock, flags);
+
+	if (intel_gt_pm_get_if_awake(gt)) {
+		guc_update_pm_timestamp(guc);
+
+		for_each_engine(engine, gt, id)
+			guc_update_engine_gt_clks(engine);
+
+		intel_gt_pm_put_async(gt);
+	}
+
+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
+}
+
+static void guc_timestamp_ping(struct work_struct *wrk)
+{
+	struct intel_guc *guc = container_of(wrk, typeof(*guc),
+					     timestamp.work.work);
+
+	__update_guc_busyness_stats(guc);
+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
+			 guc->timestamp.ping_delay);
+}
+
+static int guc_action_enable_usage_stats(struct intel_guc *guc)
+{
+	u32 offset = intel_guc_engine_usage_offset(guc);
+	u32 action[] = {
+		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
+		offset,
+		0,
+	};
+
+	return intel_guc_send(guc, action, ARRAY_SIZE(action));
+}
+
+static void guc_init_engine_stats(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	intel_wakeref_t wakeref;
+
+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
+			 guc->timestamp.ping_delay);
+
+	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
+		int ret = guc_action_enable_usage_stats(guc);
+
+		if (ret)
+			drm_err(&gt->i915->drm,
+				"Failed to enable usage stats: %d!\n", ret);
+	}
+}
+
+void intel_guc_busyness_park(struct intel_gt *gt)
+{
+	struct intel_guc *guc = &gt->uc.guc;
+
+	cancel_delayed_work(&guc->timestamp.work);
+	__update_guc_busyness_stats(guc);
+}
+
+void intel_guc_busyness_unpark(struct intel_gt *gt)
+{
+	struct intel_guc *guc = &gt->uc.guc;
+
+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
+			 guc->timestamp.ping_delay);
+}
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
  */
 int intel_guc_submission_init(struct intel_guc *guc)
 {
+	struct intel_gt *gt = guc_to_gt(guc);
 	int ret;
 
 	if (guc->lrc_desc_pool)
@@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->guc_id_list);
 	ida_init(&guc->guc_ids);
 
+	spin_lock_init(&guc->timestamp.lock);
+	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
+	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
+
 	return 0;
 }
 
@@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
 		engine->emit_flush = gen12_emit_flush_xcs;
 	}
 	engine->set_default_submission = guc_set_default_submission;
+	engine->busyness = guc_engine_busyness;
 
+	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
 	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
 	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
 
@@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
 void intel_guc_submission_enable(struct intel_guc *guc)
 {
 	guc_init_lrc_mapping(guc);
+	guc_init_engine_stats(guc);
 }
 
 void intel_guc_submission_disable(struct intel_guc *guc)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
index c7ef44fa0c36..5a95a9f0a8e3 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
@@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
 				    struct i915_request *hung_rq,
 				    struct drm_printer *m);
+void intel_guc_busyness_park(struct intel_gt *gt);
+void intel_guc_busyness_unpark(struct intel_gt *gt);
 
 bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
 
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index a897f4abea0c..9aee08425382 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
 #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
 
+#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
+
 /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
 #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
 #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Intel-gfx] [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
@ 2021-10-05 17:47 ` Umesh Nerlige Ramappa
  0 siblings, 0 replies; 24+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-10-05 17:47 UTC (permalink / raw)
  To: intel-gfx, dri-devel
  Cc: john.c.harrison, Tvrtko Ursulin, daniel.vetter, Matthew Brost

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

	engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Opens and wip that are targeted for later patches:

1) On global gt reset the total busyness of engines resets and i915
   needs to fix that so that user sees monotonically increasing
   busyness.
2) In runtime suspend mode, the worker may not need to be run. We could
   stop the worker on suspend and rerun it on resume provided that the
   guc pm timestamp does not tick during suspend.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
 .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
 drivers/gpu/drm/i915/i915_reg.h               |   2 +
 12 files changed, 398 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..6fcc70a313d9 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 	intel_engine_print_breadcrumbs(engine, m);
 }
 
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-					    ktime_t *now)
-{
-	ktime_t total = engine->stats.total;
-
-	/*
-	 * If the engine is executing something at the moment
-	 * add it to the total.
-	 */
-	*now = ktime_get();
-	if (READ_ONCE(engine->stats.active))
-		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
-
-	return total;
-}
-
 /**
  * intel_engine_get_busy_time() - Return current accumulated engine busyness
  * @engine: engine to report on
@@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
 {
-	unsigned int seq;
-	ktime_t total;
-
-	do {
-		seq = read_seqcount_begin(&engine->stats.lock);
-		total = __intel_engine_get_busy_time(engine, now);
-	} while (read_seqcount_retry(&engine->stats.lock, seq));
-
-	return total;
+	return engine->busyness(engine, now);
 }
 
 struct intel_context *
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 5ae1207c363b..8e1b9c38a6fc 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -432,6 +432,12 @@ struct intel_engine_cs {
 	void		(*add_active_request)(struct i915_request *rq);
 	void		(*remove_active_request)(struct i915_request *rq);
 
+	/*
+	 * Get engine busyness and the time at which the busyness was sampled.
+	 */
+	ktime_t		(*busyness)(struct intel_engine_cs *engine,
+				    ktime_t *now);
+
 	struct intel_engine_execlists execlists;
 
 	/*
@@ -481,30 +487,66 @@ struct intel_engine_cs {
 	u32 (*get_cmd_length_mask)(u32 cmd_header);
 
 	struct {
-		/**
-		 * @active: Number of contexts currently scheduled in.
-		 */
-		unsigned int active;
-
-		/**
-		 * @lock: Lock protecting the below fields.
-		 */
-		seqcount_t lock;
-
-		/**
-		 * @total: Total time this engine was busy.
-		 *
-		 * Accumulated time not counting the most recent block in cases
-		 * where engine is currently busy (active > 0).
-		 */
-		ktime_t total;
-
-		/**
-		 * @start: Timestamp of the last idle to active transition.
-		 *
-		 * Idle is defined as active == 0, active is active > 0.
-		 */
-		ktime_t start;
+		union {
+			/* Fields used by the execlists backend. */
+			struct {
+				/**
+				 * @active: Number of contexts currently
+				 * scheduled in.
+				 */
+				unsigned int active;
+
+				/**
+				 * @lock: Lock protecting the below fields.
+				 */
+				seqcount_t lock;
+
+				/**
+				 * @total: Total time this engine was busy.
+				 *
+				 * Accumulated time not counting the most recent
+				 * block in cases where engine is currently busy
+				 * (active > 0).
+				 */
+				ktime_t total;
+
+				/**
+				 * @start: Timestamp of the last idle to active
+				 * transition.
+				 *
+				 * Idle is defined as active == 0, active is
+				 * active > 0.
+				 */
+				ktime_t start;
+			};
+
+			/* Fields used by the GuC backend. */
+			struct {
+				/**
+				 * @running: Active state of the engine when
+				 * busyness was last sampled.
+				 */
+				bool running;
+
+				/**
+				 * @prev_total: Previous value of total runtime
+				 * clock cycles.
+				 */
+				u32 prev_total;
+
+				/**
+				 * @total_gt_clks: Total gt clock cycles this
+				 * engine was busy.
+				 */
+				u64 total_gt_clks;
+
+				/**
+				 * @start_gt_clk: GT clock time of last idle to
+				 * active transition.
+				 */
+				u64 start_gt_clk;
+			};
+		};
 
 		/**
 		 * @rps: Utilisation at last RPS sampling.
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 7147fe80919e..5c9b695e906c 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
 	lrc_fini_wa_ctx(engine);
 }
 
+static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
+					   ktime_t *now)
+{
+	ktime_t total = engine->stats.total;
+
+	/*
+	 * If the engine is executing something at the moment
+	 * add it to the total.
+	 */
+	*now = ktime_get();
+	if (READ_ONCE(engine->stats.active))
+		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
+
+	return total;
+}
+
+static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
+					 ktime_t *now)
+{
+	unsigned int seq;
+	ktime_t total;
+
+	do {
+		seq = read_seqcount_begin(&engine->stats.lock);
+		total = __execlists_engine_busyness(engine, now);
+	} while (read_seqcount_retry(&engine->stats.lock, seq));
+
+	return total;
+}
+
 static void
 logical_ring_default_vfuncs(struct intel_engine_cs *engine)
 {
@@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
 		engine->emit_bb_start = gen8_emit_bb_start;
 	else
 		engine->emit_bb_start = gen8_emit_bb_start_noarb;
+
+	engine->busyness = execlists_engine_busyness;
 }
 
 static void logical_ring_default_irqs(struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 524eaf678790..b4a8594bc46c 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
 	intel_rc6_unpark(&gt->rc6);
 	intel_rps_unpark(&gt->rps);
 	i915_pmu_gt_unparked(i915);
+	intel_guc_busyness_unpark(gt);
 
 	intel_gt_unpark_requests(gt);
 	runtime_begin(gt);
@@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
 	runtime_end(gt);
 	intel_gt_park_requests(gt);
 
+	intel_guc_busyness_park(gt);
 	i915_vma_parked(gt);
 	i915_pmu_gt_parked(i915);
 	intel_rps_park(&gt->rps);
diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
index 8ff582222aff..ff1311d4beff 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
@@ -143,6 +143,7 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
 	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
 	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
+	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
 	INTEL_GUC_ACTION_LIMIT
 };
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 5dd174babf7a..22c30dbdf63a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -104,6 +104,8 @@ struct intel_guc {
 	u32 ads_regset_size;
 	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
 	u32 ads_golden_ctxt_size;
+	/** @ads_engine_usage_size: size of engine usage in the ADS */
+	u32 ads_engine_usage_size;
 
 	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
 	struct i915_vma *lrc_desc_pool;
@@ -138,6 +140,30 @@ struct intel_guc {
 
 	/** @send_mutex: used to serialize the intel_guc_send actions */
 	struct mutex send_mutex;
+
+	struct {
+		/**
+		 * @lock: Lock protecting the below fields and the engine stats.
+		 */
+		spinlock_t lock;
+
+		/**
+		 * @gt_stamp: 64 bit extended value of the GT timestamp.
+		 */
+		u64 gt_stamp;
+
+		/**
+		 * @ping_delay: Period for polling the GT timestamp for
+		 * overflow.
+		 */
+		unsigned long ping_delay;
+
+		/**
+		 * @work: Periodic work to adjust GT timestamp, engine and
+		 * context usage for overflows.
+		 */
+		struct delayed_work work;
+	} timestamp;
 };
 
 static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 2c6ea64af7ec..ca9ab53999d5 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -26,6 +26,8 @@
  *      | guc_policies                          |
  *      +---------------------------------------+
  *      | guc_gt_system_info                    |
+ *      +---------------------------------------+
+ *      | guc_engine_usage                      |
  *      +---------------------------------------+ <== static
  *      | guc_mmio_reg[countA] (engine 0.0)     |
  *      | guc_mmio_reg[countB] (engine 0.1)     |
@@ -47,6 +49,7 @@ struct __guc_ads_blob {
 	struct guc_ads ads;
 	struct guc_policies policies;
 	struct guc_gt_system_info system_info;
+	struct guc_engine_usage engine_usage;
 	/* From here on, location is dynamic! Refer to above diagram. */
 	struct guc_mmio_reg regset[0];
 } __packed;
@@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
 
 	guc_ads_private_data_reset(guc);
 }
+
+u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
+{
+	struct __guc_ads_blob *blob = guc->ads_blob;
+	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
+	u32 offset = base + ptr_offset(blob, engine_usage);
+
+	return offset;
+}
+
+struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
+{
+	struct intel_guc *guc = &engine->gt->uc.guc;
+	struct __guc_ads_blob *blob = guc->ads_blob;
+	u8 guc_class = engine_class_to_guc_class(engine->class);
+
+	return &blob->engine_usage.engines[guc_class][engine->instance];
+}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
index 3d85051d57e4..e74c110facff 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
@@ -6,8 +6,11 @@
 #ifndef _INTEL_GUC_ADS_H_
 #define _INTEL_GUC_ADS_H_
 
+#include <linux/types.h>
+
 struct intel_guc;
 struct drm_printer;
+struct intel_engine_cs;
 
 int intel_guc_ads_create(struct intel_guc *guc);
 void intel_guc_ads_destroy(struct intel_guc *guc);
@@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
 void intel_guc_ads_reset(struct intel_guc *guc);
 void intel_guc_ads_print_policy_info(struct intel_guc *guc,
 				     struct drm_printer *p);
+struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
+u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
 
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index fa4be13c8854..7c9c081670fc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -294,6 +294,19 @@ struct guc_ads {
 	u32 reserved[15];
 } __packed;
 
+/* Engine usage stats */
+struct guc_engine_usage_record {
+	u32 current_context_index;
+	u32 last_switch_in_stamp;
+	u32 reserved0;
+	u32 total_runtime;
+	u32 reserved1[4];
+} __packed;
+
+struct guc_engine_usage {
+	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
+} __packed;
+
 /* GuC logging structures */
 
 enum guc_log_buffer_type {
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ba0de35f6323..3f7d0f2ac9da 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -12,6 +12,7 @@
 #include "gt/intel_engine_pm.h"
 #include "gt/intel_engine_heartbeat.h"
 #include "gt/intel_gt.h"
+#include "gt/intel_gt_clock_utils.h"
 #include "gt/intel_gt_irq.h"
 #include "gt/intel_gt_pm.h"
 #include "gt/intel_gt_requests.h"
@@ -20,6 +21,7 @@
 #include "gt/intel_mocs.h"
 #include "gt/intel_ring.h"
 
+#include "intel_guc_ads.h"
 #include "intel_guc_submission.h"
 
 #include "i915_drv.h"
@@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
 static void disable_submission(struct intel_guc *guc)
 {
 	struct i915_sched_engine * const sched_engine = guc->sched_engine;
+	struct intel_gt *gt = guc_to_gt(guc);
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	unsigned long flags;
 
 	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
 		GEM_BUG_ON(!guc->ct.enabled);
 		__tasklet_disable_sync_once(&sched_engine->tasklet);
 		sched_engine->tasklet.callback = NULL;
 	}
+
+	cancel_delayed_work(&guc->timestamp.work);
+
+	spin_lock_irqsave(&guc->timestamp.lock, flags);
+
+	for_each_engine(engine, gt, id)
+		engine->stats.prev_total = 0;
+
+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
 }
 
 static void enable_submission(struct intel_guc *guc)
@@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 	intel_gt_unpark_heartbeats(guc_to_gt(guc));
 }
 
+/*
+ * GuC stores busyness stats for each engine at context in/out boundaries. A
+ * context 'in' logs execution start time, 'out' adds in -> out delta to total.
+ * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
+ * GuC.
+ *
+ * __i915_pmu_event_read samples engine busyness. When sampling, if context id
+ * is valid (!= ~0) and start is non-zero, the engine is considered to be
+ * active. For an active engine total busyness = total + (now - start), where
+ * 'now' is the time at which the busyness is sampled. For inactive engine,
+ * total busyness = total.
+ *
+ * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
+ *
+ * The start and total values provided by GuC are 32 bits and wrap around in a
+ * few minutes. Since perf pmu provides busyness as 64 bit monotonically
+ * increasing ns values, there is a need for this implementation to account for
+ * overflows and extend the GuC provided values to 64 bits before returning
+ * busyness to the user. In order to do that, a worker runs periodically at
+ * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
+ * 27 seconds for a gt clock frequency of 19.2 MHz).
+ */
+
+#define WRAP_TIME_CLKS U32_MAX
+#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
+
+static void
+__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
+{
+	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
+	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
+
+	if (new_start == lower_32_bits(*prev_start))
+		return;
+
+	if (new_start < gt_stamp_last &&
+	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
+		gt_stamp_hi++;
+
+	if (new_start > gt_stamp_last &&
+	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
+		gt_stamp_hi--;
+
+	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
+}
+
+static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
+{
+	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
+	struct intel_guc *guc = &engine->gt->uc.guc;
+	u32 last_switch = rec->last_switch_in_stamp;
+	u32 ctx_id = rec->current_context_index;
+	u32 total = rec->total_runtime;
+
+	lockdep_assert_held(&guc->timestamp.lock);
+
+	engine->stats.running = ctx_id != ~0U && last_switch;
+	if (engine->stats.running)
+		__extend_last_switch(guc, &engine->stats.start_gt_clk,
+				     last_switch);
+
+	/*
+	 * Instead of adjusting the total for overflow, just add the
+	 * difference from previous sample to the stats.total_gt_clks
+	 */
+	if (total && total != ~0U) {
+		engine->stats.total_gt_clks += (u32)(total -
+						     engine->stats.prev_total);
+		engine->stats.prev_total = total;
+	}
+}
+
+static void guc_update_pm_timestamp(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	u32 gt_stamp_now, gt_stamp_hi;
+
+	lockdep_assert_held(&guc->timestamp.lock);
+
+	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
+	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
+
+	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
+		gt_stamp_hi++;
+
+	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
+}
+
+/*
+ * Unlike the execlist mode of submission total and active times are in terms of
+ * gt clocks. The *now parameter is retained to return the cpu time at which the
+ * busyness was sampled.
+ */
+static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
+{
+	struct intel_gt *gt = engine->gt;
+	struct intel_guc *guc = &gt->uc.guc;
+	unsigned long flags;
+	u64 total;
+
+	spin_lock_irqsave(&guc->timestamp.lock, flags);
+
+	*now = ktime_get();
+
+	/*
+	 * The active busyness depends on start_gt_clk and gt_stamp.
+	 * gt_stamp is updated by i915 only when gt is awake and the
+	 * start_gt_clk is derived from GuC state. To get a consistent
+	 * view of activity, we query the GuC state only if gt is awake.
+	 */
+	if (intel_gt_pm_get_if_awake(gt)) {
+		guc_update_engine_gt_clks(engine);
+		guc_update_pm_timestamp(guc);
+		intel_gt_pm_put_async(gt);
+	}
+
+	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
+	if (engine->stats.running) {
+		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
+
+		total += intel_gt_clock_interval_to_ns(gt, clk);
+	}
+
+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
+
+	return ns_to_ktime(total);
+}
+
+static void __update_guc_busyness_stats(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->timestamp.lock, flags);
+
+	if (intel_gt_pm_get_if_awake(gt)) {
+		guc_update_pm_timestamp(guc);
+
+		for_each_engine(engine, gt, id)
+			guc_update_engine_gt_clks(engine);
+
+		intel_gt_pm_put_async(gt);
+	}
+
+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
+}
+
+static void guc_timestamp_ping(struct work_struct *wrk)
+{
+	struct intel_guc *guc = container_of(wrk, typeof(*guc),
+					     timestamp.work.work);
+
+	__update_guc_busyness_stats(guc);
+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
+			 guc->timestamp.ping_delay);
+}
+
+static int guc_action_enable_usage_stats(struct intel_guc *guc)
+{
+	u32 offset = intel_guc_engine_usage_offset(guc);
+	u32 action[] = {
+		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
+		offset,
+		0,
+	};
+
+	return intel_guc_send(guc, action, ARRAY_SIZE(action));
+}
+
+static void guc_init_engine_stats(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	intel_wakeref_t wakeref;
+
+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
+			 guc->timestamp.ping_delay);
+
+	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
+		int ret = guc_action_enable_usage_stats(guc);
+
+		if (ret)
+			drm_err(&gt->i915->drm,
+				"Failed to enable usage stats: %d!\n", ret);
+	}
+}
+
+void intel_guc_busyness_park(struct intel_gt *gt)
+{
+	struct intel_guc *guc = &gt->uc.guc;
+
+	cancel_delayed_work(&guc->timestamp.work);
+	__update_guc_busyness_stats(guc);
+}
+
+void intel_guc_busyness_unpark(struct intel_gt *gt)
+{
+	struct intel_guc *guc = &gt->uc.guc;
+
+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
+			 guc->timestamp.ping_delay);
+}
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
  */
 int intel_guc_submission_init(struct intel_guc *guc)
 {
+	struct intel_gt *gt = guc_to_gt(guc);
 	int ret;
 
 	if (guc->lrc_desc_pool)
@@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	INIT_LIST_HEAD(&guc->guc_id_list);
 	ida_init(&guc->guc_ids);
 
+	spin_lock_init(&guc->timestamp.lock);
+	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
+	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
+
 	return 0;
 }
 
@@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
 		engine->emit_flush = gen12_emit_flush_xcs;
 	}
 	engine->set_default_submission = guc_set_default_submission;
+	engine->busyness = guc_engine_busyness;
 
+	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
 	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
 	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
 
@@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
 void intel_guc_submission_enable(struct intel_guc *guc)
 {
 	guc_init_lrc_mapping(guc);
+	guc_init_engine_stats(guc);
 }
 
 void intel_guc_submission_disable(struct intel_guc *guc)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
index c7ef44fa0c36..5a95a9f0a8e3 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
@@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
 void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
 				    struct i915_request *hung_rq,
 				    struct drm_printer *m);
+void intel_guc_busyness_park(struct intel_gt *gt);
+void intel_guc_busyness_unpark(struct intel_gt *gt);
 
 bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
 
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index a897f4abea0c..9aee08425382 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
 #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
 
+#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
+
 /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
 #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
 #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/i915/pmu: Connect engine busyness stats from GuC to pmu (rev2)
  2021-10-05 17:47 ` [Intel-gfx] " Umesh Nerlige Ramappa
  (?)
@ 2021-10-05 22:14 ` Patchwork
  -1 siblings, 0 replies; 24+ messages in thread
From: Patchwork @ 2021-10-05 22:14 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/pmu: Connect engine busyness stats from GuC to pmu (rev2)
URL   : https://patchwork.freedesktop.org/series/95043/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
6044dafc24be drm/i915/pmu: Connect engine busyness stats from GuC to pmu
-:577: CHECK:SPACING: No space is necessary after a cast
#577: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1229:
+	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;

total: 0 errors, 0 warnings, 1 checks, 614 lines checked



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Intel-gfx] ✗ Fi.CI.DOCS: warning for drm/i915/pmu: Connect engine busyness stats from GuC to pmu (rev2)
  2021-10-05 17:47 ` [Intel-gfx] " Umesh Nerlige Ramappa
  (?)
  (?)
@ 2021-10-05 22:20 ` Patchwork
  -1 siblings, 0 replies; 24+ messages in thread
From: Patchwork @ 2021-10-05 22:20 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/pmu: Connect engine busyness stats from GuC to pmu (rev2)
URL   : https://patchwork.freedesktop.org/series/95043/
State : warning

== Summary ==

$ make htmldocs 2>&1 > /dev/null | grep i915
./drivers/gpu/drm/i915/gt/uc/intel_guc.h:167: warning: Function parameter or member 'timestamp' not described in 'intel_guc'



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for drm/i915/pmu: Connect engine busyness stats from GuC to pmu (rev2)
  2021-10-05 17:47 ` [Intel-gfx] " Umesh Nerlige Ramappa
                   ` (2 preceding siblings ...)
  (?)
@ 2021-10-05 22:49 ` Patchwork
  -1 siblings, 0 replies; 24+ messages in thread
From: Patchwork @ 2021-10-05 22:49 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 2919 bytes --]

== Series Details ==

Series: drm/i915/pmu: Connect engine busyness stats from GuC to pmu (rev2)
URL   : https://patchwork.freedesktop.org/series/95043/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_10685 -> Patchwork_21254
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_21254 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_21254, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21254/index.html

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_21254:

### IGT changes ###

#### Possible regressions ####

  * igt@i915_selftest@live@gt_engines:
    - fi-rkl-guc:         [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10685/fi-rkl-guc/igt@i915_selftest@live@gt_engines.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21254/fi-rkl-guc/igt@i915_selftest@live@gt_engines.html

  
Known issues
------------

  Here are the changes found in Patchwork_21254 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@runner@aborted:
    - fi-rkl-guc:         NOTRUN -> [FAIL][3] ([i915#3928])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21254/fi-rkl-guc/igt@runner@aborted.html

  
  [i915#3928]: https://gitlab.freedesktop.org/drm/intel/issues/3928


Participating hosts (41 -> 1)
------------------------------

  ERROR: It appears as if the changes made in Patchwork_21254 prevented too many machines from booting.

  Missing    (40): fi-kbl-soraka fi-rkl-11600 bat-dg1-6 fi-bdw-gvtdvm fi-icl-u2 fi-apl-guc fi-snb-2520m fi-pnv-d510 fi-icl-y fi-skl-6600u fi-snb-2600 fi-cml-u2 fi-bxt-dsi fi-bdw-5557u fi-bsw-n3050 fi-tgl-u2 fi-glk-dsi fi-bwr-2160 fi-kbl-7500u fi-ctg-p8600 fi-hsw-4770 fi-ivb-3770 fi-elk-e7500 fi-bsw-nick fi-kbl-r fi-kbl-7567u fi-ilk-m540 fi-tgl-dsi fi-cfl-8700k fi-ehl-2 bat-jsl-1 fi-jsl-1 fi-hsw-4200u fi-tgl-1115g4 fi-bsw-cyan fi-cfl-guc bat-adlp-4 fi-cfl-8109u fi-kbl-8809g fi-bsw-kefka 


Build changes
-------------

  * Linux: CI_DRM_10685 -> Patchwork_21254

  CI-20190529: 20190529
  CI_DRM_10685: 36c3656c997b07f326d6b967efb1b75e01713773 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6232: effad6af5678be711a2c3e58e182319de784de54 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_21254: 6044dafc24be45f573885716d0b7ac1b78133e3f @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

6044dafc24be drm/i915/pmu: Connect engine busyness stats from GuC to pmu

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21254/index.html

[-- Attachment #2: Type: text/html, Size: 3555 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
  2021-10-05 17:47 ` [Intel-gfx] " Umesh Nerlige Ramappa
@ 2021-10-05 23:14   ` Matthew Brost
  -1 siblings, 0 replies; 24+ messages in thread
From: Matthew Brost @ 2021-10-05 23:14 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa
  Cc: intel-gfx, dri-devel, john.c.harrison, Tvrtko Ursulin, daniel.vetter

On Tue, Oct 05, 2021 at 10:47:11AM -0700, Umesh Nerlige Ramappa wrote:
> With GuC handling scheduling, i915 is not aware of the time that a
> context is scheduled in and out of the engine. Since i915 pmu relies on
> this info to provide engine busyness to the user, GuC shares this info
> with i915 for all engines using shared memory. For each engine, this
> info contains:
> 
> - total busyness: total time that the context was running (total)
> - id: id of the running context (id)
> - start timestamp: timestamp when the context started running (start)
> 
> At the time (now) of sampling the engine busyness, if the id is valid
> (!= ~0), and start is non-zero, then the context is considered to be
> active and the engine busyness is calculated using the below equation
> 
> 	engine busyness = total + (now - start)
> 
> All times are obtained from the gt clock base. For inactive contexts,
> engine busyness is just equal to the total.
> 
> The start and total values provided by GuC are 32 bits and wrap around
> in a few minutes. Since perf pmu provides busyness as 64 bit
> monotonically increasing values, there is a need for this implementation
> to account for overflows and extend the time to 64 bits before returning
> busyness to the user. In order to do that, a worker runs periodically at
> frequency = 1/8th the time it takes for the timestamp to wrap. As an
> example, that would be once in 27 seconds for a gt clock frequency of
> 19.2 MHz.
> 
> Opens and wip that are targeted for later patches:
> 
> 1) On global gt reset the total busyness of engines resets and i915
>    needs to fix that so that user sees monotonically increasing
>    busyness.
> 2) In runtime suspend mode, the worker may not need to be run. We could
>    stop the worker on suspend and rerun it on resume provided that the
>    guc pm timestamp does not tick during suspend.
> 
> Note:
> There might be an overaccounting of busyness due to the fact that GuC
> may be updating the total and start values while kmd is reading them.
> (i.e kmd may read the updated total and the stale start). In such a
> case, user may see higher busyness value followed by smaller ones which
> would eventually catch up to the higher value.
> 
> v2: (Tvrtko)
> - Include details in commit message
> - Move intel engine busyness function into execlist code
> - Use union inside engine->stats
> - Use natural type for ping delay jiffies
> - Drop active_work condition checks
> - Use for_each_engine if iterating all engines
> - Drop seq locking, use spinlock at guc level to update engine stats
> - Document worker specific details
> 
> v3: (Tvrtko/Umesh)
> - Demarcate guc and execlist stat objects with comments
> - Document known over-accounting issue in commit
> - Provide a consistent view of guc state
> - Add hooks to gt park/unpark for guc busyness
> - Stop/start worker in gt park/unpark path
> - Drop inline
> - Move spinlock and worker inits to guc initialization
> - Drop helpers that are called only once
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
>  drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
>  .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
>  drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
>  .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
>  drivers/gpu/drm/i915/i915_reg.h               |   2 +
>  12 files changed, 398 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 2ae57e4656a3..6fcc70a313d9 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>  	intel_engine_print_breadcrumbs(engine, m);
>  }
>  
> -static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
> -					    ktime_t *now)
> -{
> -	ktime_t total = engine->stats.total;
> -
> -	/*
> -	 * If the engine is executing something at the moment
> -	 * add it to the total.
> -	 */
> -	*now = ktime_get();
> -	if (READ_ONCE(engine->stats.active))
> -		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> -
> -	return total;
> -}
> -
>  /**
>   * intel_engine_get_busy_time() - Return current accumulated engine busyness
>   * @engine: engine to report on
> @@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>   */
>  ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>  {
> -	unsigned int seq;
> -	ktime_t total;
> -
> -	do {
> -		seq = read_seqcount_begin(&engine->stats.lock);
> -		total = __intel_engine_get_busy_time(engine, now);
> -	} while (read_seqcount_retry(&engine->stats.lock, seq));
> -
> -	return total;
> +	return engine->busyness(engine, now);
>  }
>  
>  struct intel_context *
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 5ae1207c363b..8e1b9c38a6fc 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -432,6 +432,12 @@ struct intel_engine_cs {
>  	void		(*add_active_request)(struct i915_request *rq);
>  	void		(*remove_active_request)(struct i915_request *rq);
>  
> +	/*
> +	 * Get engine busyness and the time at which the busyness was sampled.
> +	 */
> +	ktime_t		(*busyness)(struct intel_engine_cs *engine,
> +				    ktime_t *now);
> +
>  	struct intel_engine_execlists execlists;
>  
>  	/*
> @@ -481,30 +487,66 @@ struct intel_engine_cs {
>  	u32 (*get_cmd_length_mask)(u32 cmd_header);
>  
>  	struct {
> -		/**
> -		 * @active: Number of contexts currently scheduled in.
> -		 */
> -		unsigned int active;
> -
> -		/**
> -		 * @lock: Lock protecting the below fields.
> -		 */
> -		seqcount_t lock;
> -
> -		/**
> -		 * @total: Total time this engine was busy.
> -		 *
> -		 * Accumulated time not counting the most recent block in cases
> -		 * where engine is currently busy (active > 0).
> -		 */
> -		ktime_t total;
> -
> -		/**
> -		 * @start: Timestamp of the last idle to active transition.
> -		 *
> -		 * Idle is defined as active == 0, active is active > 0.
> -		 */
> -		ktime_t start;
> +		union {
> +			/* Fields used by the execlists backend. */
> +			struct {
> +				/**
> +				 * @active: Number of contexts currently
> +				 * scheduled in.
> +				 */
> +				unsigned int active;
> +
> +				/**
> +				 * @lock: Lock protecting the below fields.
> +				 */
> +				seqcount_t lock;
> +
> +				/**
> +				 * @total: Total time this engine was busy.
> +				 *
> +				 * Accumulated time not counting the most recent
> +				 * block in cases where engine is currently busy
> +				 * (active > 0).
> +				 */
> +				ktime_t total;
> +
> +				/**
> +				 * @start: Timestamp of the last idle to active
> +				 * transition.
> +				 *
> +				 * Idle is defined as active == 0, active is
> +				 * active > 0.
> +				 */
> +				ktime_t start;
> +			};

Not anonymous? e.g.

struct {
	...
} execlists;
struct {
	...
} guc;

IMO this is better as this is self documenting and if you touch an
backend specific field in a non-backend specific file it pops out as
incorrect.

> +
> +			/* Fields used by the GuC backend. */
> +			struct {
> +				/**
> +				 * @running: Active state of the engine when
> +				 * busyness was last sampled.
> +				 */
> +				bool running;
> +
> +				/**
> +				 * @prev_total: Previous value of total runtime
> +				 * clock cycles.
> +				 */
> +				u32 prev_total;
> +
> +				/**
> +				 * @total_gt_clks: Total gt clock cycles this
> +				 * engine was busy.
> +				 */
> +				u64 total_gt_clks;
> +
> +				/**
> +				 * @start_gt_clk: GT clock time of last idle to
> +				 * active transition.
> +				 */
> +				u64 start_gt_clk;
> +			};
> +		};
>  
>  		/**
>  		 * @rps: Utilisation at last RPS sampling.
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 7147fe80919e..5c9b695e906c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
>  	lrc_fini_wa_ctx(engine);
>  }
>  
> +static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
> +					   ktime_t *now)
> +{
> +	ktime_t total = engine->stats.total;
> +
> +	/*
> +	 * If the engine is executing something at the moment
> +	 * add it to the total.
> +	 */
> +	*now = ktime_get();
> +	if (READ_ONCE(engine->stats.active))
> +		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> +
> +	return total;
> +}
> +
> +static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
> +					 ktime_t *now)
> +{
> +	unsigned int seq;
> +	ktime_t total;
> +
> +	do {
> +		seq = read_seqcount_begin(&engine->stats.lock);
> +		total = __execlists_engine_busyness(engine, now);
> +	} while (read_seqcount_retry(&engine->stats.lock, seq));
> +
> +	return total;
> +}
> +
>  static void
>  logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>  {
> @@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>  		engine->emit_bb_start = gen8_emit_bb_start;
>  	else
>  		engine->emit_bb_start = gen8_emit_bb_start_noarb;
> +
> +	engine->busyness = execlists_engine_busyness;
>  }
>  
>  static void logical_ring_default_irqs(struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index 524eaf678790..b4a8594bc46c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
>  	intel_rc6_unpark(&gt->rc6);
>  	intel_rps_unpark(&gt->rps);
>  	i915_pmu_gt_unparked(i915);
> +	intel_guc_busyness_unpark(gt);

I personally don't mind this but in the spirit of correct layering, this
likely should be generic wrapper inline func which calls a vfunc if
present (e.g. set the vfunc for backend, don't set for execlists).

>  
>  	intel_gt_unpark_requests(gt);
>  	runtime_begin(gt);
> @@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
>  	runtime_end(gt);
>  	intel_gt_park_requests(gt);
>  
> +	intel_guc_busyness_park(gt);

Same here.

>  	i915_vma_parked(gt);
>  	i915_pmu_gt_parked(i915);
>  	intel_rps_park(&gt->rps);
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> index 8ff582222aff..ff1311d4beff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> @@ -143,6 +143,7 @@ enum intel_guc_action {
>  	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>  	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>  	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
> +	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>  	INTEL_GUC_ACTION_LIMIT
>  };
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 5dd174babf7a..22c30dbdf63a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -104,6 +104,8 @@ struct intel_guc {
>  	u32 ads_regset_size;
>  	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>  	u32 ads_golden_ctxt_size;
> +	/** @ads_engine_usage_size: size of engine usage in the ADS */
> +	u32 ads_engine_usage_size;
>  
>  	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
>  	struct i915_vma *lrc_desc_pool;
> @@ -138,6 +140,30 @@ struct intel_guc {
>  
>  	/** @send_mutex: used to serialize the intel_guc_send actions */
>  	struct mutex send_mutex;
> +
> +	struct {
> +		/**
> +		 * @lock: Lock protecting the below fields and the engine stats.
> +		 */
> +		spinlock_t lock;
> +

Again I really don't mind but I'm told not to add more spin locks than
needed. This really should be protected by a generic GuC submission spin
lock. e.g. Build on this patch and protect all of this by the
submission_state.lock.

https://patchwork.freedesktop.org/patch/457310/?series=92789&rev=5

Whomevers series gets merged first can include the above patch.

Rest the series looks fine cosmetically to me.

Matt

> +		/**
> +		 * @gt_stamp: 64 bit extended value of the GT timestamp.
> +		 */
> +		u64 gt_stamp;
> +
> +		/**
> +		 * @ping_delay: Period for polling the GT timestamp for
> +		 * overflow.
> +		 */
> +		unsigned long ping_delay;
> +
> +		/**
> +		 * @work: Periodic work to adjust GT timestamp, engine and
> +		 * context usage for overflows.
> +		 */
> +		struct delayed_work work;
> +	} timestamp;
>  };
>  
>  static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> index 2c6ea64af7ec..ca9ab53999d5 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> @@ -26,6 +26,8 @@
>   *      | guc_policies                          |
>   *      +---------------------------------------+
>   *      | guc_gt_system_info                    |
> + *      +---------------------------------------+
> + *      | guc_engine_usage                      |
>   *      +---------------------------------------+ <== static
>   *      | guc_mmio_reg[countA] (engine 0.0)     |
>   *      | guc_mmio_reg[countB] (engine 0.1)     |
> @@ -47,6 +49,7 @@ struct __guc_ads_blob {
>  	struct guc_ads ads;
>  	struct guc_policies policies;
>  	struct guc_gt_system_info system_info;
> +	struct guc_engine_usage engine_usage;
>  	/* From here on, location is dynamic! Refer to above diagram. */
>  	struct guc_mmio_reg regset[0];
>  } __packed;
> @@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
>  
>  	guc_ads_private_data_reset(guc);
>  }
> +
> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
> +{
> +	struct __guc_ads_blob *blob = guc->ads_blob;
> +	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
> +	u32 offset = base + ptr_offset(blob, engine_usage);
> +
> +	return offset;
> +}
> +
> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
> +{
> +	struct intel_guc *guc = &engine->gt->uc.guc;
> +	struct __guc_ads_blob *blob = guc->ads_blob;
> +	u8 guc_class = engine_class_to_guc_class(engine->class);
> +
> +	return &blob->engine_usage.engines[guc_class][engine->instance];
> +}
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> index 3d85051d57e4..e74c110facff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> @@ -6,8 +6,11 @@
>  #ifndef _INTEL_GUC_ADS_H_
>  #define _INTEL_GUC_ADS_H_
>  
> +#include <linux/types.h>
> +
>  struct intel_guc;
>  struct drm_printer;
> +struct intel_engine_cs;
>  
>  int intel_guc_ads_create(struct intel_guc *guc);
>  void intel_guc_ads_destroy(struct intel_guc *guc);
> @@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
>  void intel_guc_ads_reset(struct intel_guc *guc);
>  void intel_guc_ads_print_policy_info(struct intel_guc *guc,
>  				     struct drm_printer *p);
> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
>  
>  #endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index fa4be13c8854..7c9c081670fc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -294,6 +294,19 @@ struct guc_ads {
>  	u32 reserved[15];
>  } __packed;
>  
> +/* Engine usage stats */
> +struct guc_engine_usage_record {
> +	u32 current_context_index;
> +	u32 last_switch_in_stamp;
> +	u32 reserved0;
> +	u32 total_runtime;
> +	u32 reserved1[4];
> +} __packed;
> +
> +struct guc_engine_usage {
> +	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
> +} __packed;
> +
>  /* GuC logging structures */
>  
>  enum guc_log_buffer_type {
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index ba0de35f6323..3f7d0f2ac9da 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -12,6 +12,7 @@
>  #include "gt/intel_engine_pm.h"
>  #include "gt/intel_engine_heartbeat.h"
>  #include "gt/intel_gt.h"
> +#include "gt/intel_gt_clock_utils.h"
>  #include "gt/intel_gt_irq.h"
>  #include "gt/intel_gt_pm.h"
>  #include "gt/intel_gt_requests.h"
> @@ -20,6 +21,7 @@
>  #include "gt/intel_mocs.h"
>  #include "gt/intel_ring.h"
>  
> +#include "intel_guc_ads.h"
>  #include "intel_guc_submission.h"
>  
>  #include "i915_drv.h"
> @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>  static void disable_submission(struct intel_guc *guc)
>  {
>  	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	unsigned long flags;
>  
>  	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>  		GEM_BUG_ON(!guc->ct.enabled);
>  		__tasklet_disable_sync_once(&sched_engine->tasklet);
>  		sched_engine->tasklet.callback = NULL;
>  	}
> +
> +	cancel_delayed_work(&guc->timestamp.work);
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	for_each_engine(engine, gt, id)
> +		engine->stats.prev_total = 0;
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>  }
>  
>  static void enable_submission(struct intel_guc *guc)
> @@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>  	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>  }
>  
> +/*
> + * GuC stores busyness stats for each engine at context in/out boundaries. A
> + * context 'in' logs execution start time, 'out' adds in -> out delta to total.
> + * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
> + * GuC.
> + *
> + * __i915_pmu_event_read samples engine busyness. When sampling, if context id
> + * is valid (!= ~0) and start is non-zero, the engine is considered to be
> + * active. For an active engine total busyness = total + (now - start), where
> + * 'now' is the time at which the busyness is sampled. For inactive engine,
> + * total busyness = total.
> + *
> + * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
> + *
> + * The start and total values provided by GuC are 32 bits and wrap around in a
> + * few minutes. Since perf pmu provides busyness as 64 bit monotonically
> + * increasing ns values, there is a need for this implementation to account for
> + * overflows and extend the GuC provided values to 64 bits before returning
> + * busyness to the user. In order to do that, a worker runs periodically at
> + * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
> + * 27 seconds for a gt clock frequency of 19.2 MHz).
> + */
> +
> +#define WRAP_TIME_CLKS U32_MAX
> +#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
> +
> +static void
> +__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
> +{
> +	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> +	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
> +
> +	if (new_start == lower_32_bits(*prev_start))
> +		return;
> +
> +	if (new_start < gt_stamp_last &&
> +	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
> +		gt_stamp_hi++;
> +
> +	if (new_start > gt_stamp_last &&
> +	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
> +		gt_stamp_hi--;
> +
> +	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
> +}
> +
> +static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
> +{
> +	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
> +	struct intel_guc *guc = &engine->gt->uc.guc;
> +	u32 last_switch = rec->last_switch_in_stamp;
> +	u32 ctx_id = rec->current_context_index;
> +	u32 total = rec->total_runtime;
> +
> +	lockdep_assert_held(&guc->timestamp.lock);
> +
> +	engine->stats.running = ctx_id != ~0U && last_switch;
> +	if (engine->stats.running)
> +		__extend_last_switch(guc, &engine->stats.start_gt_clk,
> +				     last_switch);
> +
> +	/*
> +	 * Instead of adjusting the total for overflow, just add the
> +	 * difference from previous sample to the stats.total_gt_clks
> +	 */
> +	if (total && total != ~0U) {
> +		engine->stats.total_gt_clks += (u32)(total -
> +						     engine->stats.prev_total);
> +		engine->stats.prev_total = total;
> +	}
> +}
> +
> +static void guc_update_pm_timestamp(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	u32 gt_stamp_now, gt_stamp_hi;
> +
> +	lockdep_assert_held(&guc->timestamp.lock);
> +
> +	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> +	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
> +
> +	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
> +		gt_stamp_hi++;
> +
> +	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
> +}
> +
> +/*
> + * Unlike the execlist mode of submission total and active times are in terms of
> + * gt clocks. The *now parameter is retained to return the cpu time at which the
> + * busyness was sampled.
> + */
> +static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
> +{
> +	struct intel_gt *gt = engine->gt;
> +	struct intel_guc *guc = &gt->uc.guc;
> +	unsigned long flags;
> +	u64 total;
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	*now = ktime_get();
> +
> +	/*
> +	 * The active busyness depends on start_gt_clk and gt_stamp.
> +	 * gt_stamp is updated by i915 only when gt is awake and the
> +	 * start_gt_clk is derived from GuC state. To get a consistent
> +	 * view of activity, we query the GuC state only if gt is awake.
> +	 */
> +	if (intel_gt_pm_get_if_awake(gt)) {
> +		guc_update_engine_gt_clks(engine);
> +		guc_update_pm_timestamp(guc);
> +		intel_gt_pm_put_async(gt);
> +	}
> +
> +	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
> +	if (engine->stats.running) {
> +		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
> +
> +		total += intel_gt_clock_interval_to_ns(gt, clk);
> +	}
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> +
> +	return ns_to_ktime(total);
> +}
> +
> +static void __update_guc_busyness_stats(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	if (intel_gt_pm_get_if_awake(gt)) {
> +		guc_update_pm_timestamp(guc);
> +
> +		for_each_engine(engine, gt, id)
> +			guc_update_engine_gt_clks(engine);
> +
> +		intel_gt_pm_put_async(gt);
> +	}
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> +}
> +
> +static void guc_timestamp_ping(struct work_struct *wrk)
> +{
> +	struct intel_guc *guc = container_of(wrk, typeof(*guc),
> +					     timestamp.work.work);
> +
> +	__update_guc_busyness_stats(guc);
> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> +			 guc->timestamp.ping_delay);
> +}
> +
> +static int guc_action_enable_usage_stats(struct intel_guc *guc)
> +{
> +	u32 offset = intel_guc_engine_usage_offset(guc);
> +	u32 action[] = {
> +		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
> +		offset,
> +		0,
> +	};
> +
> +	return intel_guc_send(guc, action, ARRAY_SIZE(action));
> +}
> +
> +static void guc_init_engine_stats(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	intel_wakeref_t wakeref;
> +
> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> +			 guc->timestamp.ping_delay);
> +
> +	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
> +		int ret = guc_action_enable_usage_stats(guc);
> +
> +		if (ret)
> +			drm_err(&gt->i915->drm,
> +				"Failed to enable usage stats: %d!\n", ret);
> +	}
> +}
> +
> +void intel_guc_busyness_park(struct intel_gt *gt)
> +{
> +	struct intel_guc *guc = &gt->uc.guc;
> +
> +	cancel_delayed_work(&guc->timestamp.work);
> +	__update_guc_busyness_stats(guc);
> +}
> +
> +void intel_guc_busyness_unpark(struct intel_gt *gt)
> +{
> +	struct intel_guc *guc = &gt->uc.guc;
> +
> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> +			 guc->timestamp.ping_delay);
> +}
> +
>  /*
>   * Set up the memory resources to be shared with the GuC (via the GGTT)
>   * at firmware loading time.
>   */
>  int intel_guc_submission_init(struct intel_guc *guc)
>  {
> +	struct intel_gt *gt = guc_to_gt(guc);
>  	int ret;
>  
>  	if (guc->lrc_desc_pool)
> @@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
>  	INIT_LIST_HEAD(&guc->guc_id_list);
>  	ida_init(&guc->guc_ids);
>  
> +	spin_lock_init(&guc->timestamp.lock);
> +	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
> +	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
> +
>  	return 0;
>  }
>  
> @@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
>  		engine->emit_flush = gen12_emit_flush_xcs;
>  	}
>  	engine->set_default_submission = guc_set_default_submission;
> +	engine->busyness = guc_engine_busyness;
>  
> +	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
>  	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
>  	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
>  
> @@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
>  void intel_guc_submission_enable(struct intel_guc *guc)
>  {
>  	guc_init_lrc_mapping(guc);
> +	guc_init_engine_stats(guc);
>  }
>  
>  void intel_guc_submission_disable(struct intel_guc *guc)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> index c7ef44fa0c36..5a95a9f0a8e3 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> @@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>  void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>  				    struct i915_request *hung_rq,
>  				    struct drm_printer *m);
> +void intel_guc_busyness_park(struct intel_gt *gt);
> +void intel_guc_busyness_unpark(struct intel_gt *gt);
>  
>  bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
>  
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index a897f4abea0c..9aee08425382 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>  #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
>  #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
>  
> +#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
> +
>  /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
>  #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
>  #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
@ 2021-10-05 23:14   ` Matthew Brost
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Brost @ 2021-10-05 23:14 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa
  Cc: intel-gfx, dri-devel, john.c.harrison, Tvrtko Ursulin, daniel.vetter

On Tue, Oct 05, 2021 at 10:47:11AM -0700, Umesh Nerlige Ramappa wrote:
> With GuC handling scheduling, i915 is not aware of the time that a
> context is scheduled in and out of the engine. Since i915 pmu relies on
> this info to provide engine busyness to the user, GuC shares this info
> with i915 for all engines using shared memory. For each engine, this
> info contains:
> 
> - total busyness: total time that the context was running (total)
> - id: id of the running context (id)
> - start timestamp: timestamp when the context started running (start)
> 
> At the time (now) of sampling the engine busyness, if the id is valid
> (!= ~0), and start is non-zero, then the context is considered to be
> active and the engine busyness is calculated using the below equation
> 
> 	engine busyness = total + (now - start)
> 
> All times are obtained from the gt clock base. For inactive contexts,
> engine busyness is just equal to the total.
> 
> The start and total values provided by GuC are 32 bits and wrap around
> in a few minutes. Since perf pmu provides busyness as 64 bit
> monotonically increasing values, there is a need for this implementation
> to account for overflows and extend the time to 64 bits before returning
> busyness to the user. In order to do that, a worker runs periodically at
> frequency = 1/8th the time it takes for the timestamp to wrap. As an
> example, that would be once in 27 seconds for a gt clock frequency of
> 19.2 MHz.
> 
> Opens and wip that are targeted for later patches:
> 
> 1) On global gt reset the total busyness of engines resets and i915
>    needs to fix that so that user sees monotonically increasing
>    busyness.
> 2) In runtime suspend mode, the worker may not need to be run. We could
>    stop the worker on suspend and rerun it on resume provided that the
>    guc pm timestamp does not tick during suspend.
> 
> Note:
> There might be an overaccounting of busyness due to the fact that GuC
> may be updating the total and start values while kmd is reading them.
> (i.e kmd may read the updated total and the stale start). In such a
> case, user may see higher busyness value followed by smaller ones which
> would eventually catch up to the higher value.
> 
> v2: (Tvrtko)
> - Include details in commit message
> - Move intel engine busyness function into execlist code
> - Use union inside engine->stats
> - Use natural type for ping delay jiffies
> - Drop active_work condition checks
> - Use for_each_engine if iterating all engines
> - Drop seq locking, use spinlock at guc level to update engine stats
> - Document worker specific details
> 
> v3: (Tvrtko/Umesh)
> - Demarcate guc and execlist stat objects with comments
> - Document known over-accounting issue in commit
> - Provide a consistent view of guc state
> - Add hooks to gt park/unpark for guc busyness
> - Stop/start worker in gt park/unpark path
> - Drop inline
> - Move spinlock and worker inits to guc initialization
> - Drop helpers that are called only once
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
>  drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
>  .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
>  drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
>  .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
>  drivers/gpu/drm/i915/i915_reg.h               |   2 +
>  12 files changed, 398 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 2ae57e4656a3..6fcc70a313d9 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>  	intel_engine_print_breadcrumbs(engine, m);
>  }
>  
> -static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
> -					    ktime_t *now)
> -{
> -	ktime_t total = engine->stats.total;
> -
> -	/*
> -	 * If the engine is executing something at the moment
> -	 * add it to the total.
> -	 */
> -	*now = ktime_get();
> -	if (READ_ONCE(engine->stats.active))
> -		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> -
> -	return total;
> -}
> -
>  /**
>   * intel_engine_get_busy_time() - Return current accumulated engine busyness
>   * @engine: engine to report on
> @@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>   */
>  ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>  {
> -	unsigned int seq;
> -	ktime_t total;
> -
> -	do {
> -		seq = read_seqcount_begin(&engine->stats.lock);
> -		total = __intel_engine_get_busy_time(engine, now);
> -	} while (read_seqcount_retry(&engine->stats.lock, seq));
> -
> -	return total;
> +	return engine->busyness(engine, now);
>  }
>  
>  struct intel_context *
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 5ae1207c363b..8e1b9c38a6fc 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -432,6 +432,12 @@ struct intel_engine_cs {
>  	void		(*add_active_request)(struct i915_request *rq);
>  	void		(*remove_active_request)(struct i915_request *rq);
>  
> +	/*
> +	 * Get engine busyness and the time at which the busyness was sampled.
> +	 */
> +	ktime_t		(*busyness)(struct intel_engine_cs *engine,
> +				    ktime_t *now);
> +
>  	struct intel_engine_execlists execlists;
>  
>  	/*
> @@ -481,30 +487,66 @@ struct intel_engine_cs {
>  	u32 (*get_cmd_length_mask)(u32 cmd_header);
>  
>  	struct {
> -		/**
> -		 * @active: Number of contexts currently scheduled in.
> -		 */
> -		unsigned int active;
> -
> -		/**
> -		 * @lock: Lock protecting the below fields.
> -		 */
> -		seqcount_t lock;
> -
> -		/**
> -		 * @total: Total time this engine was busy.
> -		 *
> -		 * Accumulated time not counting the most recent block in cases
> -		 * where engine is currently busy (active > 0).
> -		 */
> -		ktime_t total;
> -
> -		/**
> -		 * @start: Timestamp of the last idle to active transition.
> -		 *
> -		 * Idle is defined as active == 0, active is active > 0.
> -		 */
> -		ktime_t start;
> +		union {
> +			/* Fields used by the execlists backend. */
> +			struct {
> +				/**
> +				 * @active: Number of contexts currently
> +				 * scheduled in.
> +				 */
> +				unsigned int active;
> +
> +				/**
> +				 * @lock: Lock protecting the below fields.
> +				 */
> +				seqcount_t lock;
> +
> +				/**
> +				 * @total: Total time this engine was busy.
> +				 *
> +				 * Accumulated time not counting the most recent
> +				 * block in cases where engine is currently busy
> +				 * (active > 0).
> +				 */
> +				ktime_t total;
> +
> +				/**
> +				 * @start: Timestamp of the last idle to active
> +				 * transition.
> +				 *
> +				 * Idle is defined as active == 0, active is
> +				 * active > 0.
> +				 */
> +				ktime_t start;
> +			};

Not anonymous? e.g.

struct {
	...
} execlists;
struct {
	...
} guc;

IMO this is better as this is self documenting and if you touch an
backend specific field in a non-backend specific file it pops out as
incorrect.

> +
> +			/* Fields used by the GuC backend. */
> +			struct {
> +				/**
> +				 * @running: Active state of the engine when
> +				 * busyness was last sampled.
> +				 */
> +				bool running;
> +
> +				/**
> +				 * @prev_total: Previous value of total runtime
> +				 * clock cycles.
> +				 */
> +				u32 prev_total;
> +
> +				/**
> +				 * @total_gt_clks: Total gt clock cycles this
> +				 * engine was busy.
> +				 */
> +				u64 total_gt_clks;
> +
> +				/**
> +				 * @start_gt_clk: GT clock time of last idle to
> +				 * active transition.
> +				 */
> +				u64 start_gt_clk;
> +			};
> +		};
>  
>  		/**
>  		 * @rps: Utilisation at last RPS sampling.
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 7147fe80919e..5c9b695e906c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
>  	lrc_fini_wa_ctx(engine);
>  }
>  
> +static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
> +					   ktime_t *now)
> +{
> +	ktime_t total = engine->stats.total;
> +
> +	/*
> +	 * If the engine is executing something at the moment
> +	 * add it to the total.
> +	 */
> +	*now = ktime_get();
> +	if (READ_ONCE(engine->stats.active))
> +		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> +
> +	return total;
> +}
> +
> +static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
> +					 ktime_t *now)
> +{
> +	unsigned int seq;
> +	ktime_t total;
> +
> +	do {
> +		seq = read_seqcount_begin(&engine->stats.lock);
> +		total = __execlists_engine_busyness(engine, now);
> +	} while (read_seqcount_retry(&engine->stats.lock, seq));
> +
> +	return total;
> +}
> +
>  static void
>  logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>  {
> @@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>  		engine->emit_bb_start = gen8_emit_bb_start;
>  	else
>  		engine->emit_bb_start = gen8_emit_bb_start_noarb;
> +
> +	engine->busyness = execlists_engine_busyness;
>  }
>  
>  static void logical_ring_default_irqs(struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index 524eaf678790..b4a8594bc46c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
>  	intel_rc6_unpark(&gt->rc6);
>  	intel_rps_unpark(&gt->rps);
>  	i915_pmu_gt_unparked(i915);
> +	intel_guc_busyness_unpark(gt);

I personally don't mind this but in the spirit of correct layering, this
likely should be generic wrapper inline func which calls a vfunc if
present (e.g. set the vfunc for backend, don't set for execlists).

>  
>  	intel_gt_unpark_requests(gt);
>  	runtime_begin(gt);
> @@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
>  	runtime_end(gt);
>  	intel_gt_park_requests(gt);
>  
> +	intel_guc_busyness_park(gt);

Same here.

>  	i915_vma_parked(gt);
>  	i915_pmu_gt_parked(i915);
>  	intel_rps_park(&gt->rps);
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> index 8ff582222aff..ff1311d4beff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> @@ -143,6 +143,7 @@ enum intel_guc_action {
>  	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>  	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>  	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
> +	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>  	INTEL_GUC_ACTION_LIMIT
>  };
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 5dd174babf7a..22c30dbdf63a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -104,6 +104,8 @@ struct intel_guc {
>  	u32 ads_regset_size;
>  	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>  	u32 ads_golden_ctxt_size;
> +	/** @ads_engine_usage_size: size of engine usage in the ADS */
> +	u32 ads_engine_usage_size;
>  
>  	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
>  	struct i915_vma *lrc_desc_pool;
> @@ -138,6 +140,30 @@ struct intel_guc {
>  
>  	/** @send_mutex: used to serialize the intel_guc_send actions */
>  	struct mutex send_mutex;
> +
> +	struct {
> +		/**
> +		 * @lock: Lock protecting the below fields and the engine stats.
> +		 */
> +		spinlock_t lock;
> +

Again I really don't mind but I'm told not to add more spin locks than
needed. This really should be protected by a generic GuC submission spin
lock. e.g. Build on this patch and protect all of this by the
submission_state.lock.

https://patchwork.freedesktop.org/patch/457310/?series=92789&rev=5

Whomevers series gets merged first can include the above patch.

Rest the series looks fine cosmetically to me.

Matt

> +		/**
> +		 * @gt_stamp: 64 bit extended value of the GT timestamp.
> +		 */
> +		u64 gt_stamp;
> +
> +		/**
> +		 * @ping_delay: Period for polling the GT timestamp for
> +		 * overflow.
> +		 */
> +		unsigned long ping_delay;
> +
> +		/**
> +		 * @work: Periodic work to adjust GT timestamp, engine and
> +		 * context usage for overflows.
> +		 */
> +		struct delayed_work work;
> +	} timestamp;
>  };
>  
>  static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> index 2c6ea64af7ec..ca9ab53999d5 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> @@ -26,6 +26,8 @@
>   *      | guc_policies                          |
>   *      +---------------------------------------+
>   *      | guc_gt_system_info                    |
> + *      +---------------------------------------+
> + *      | guc_engine_usage                      |
>   *      +---------------------------------------+ <== static
>   *      | guc_mmio_reg[countA] (engine 0.0)     |
>   *      | guc_mmio_reg[countB] (engine 0.1)     |
> @@ -47,6 +49,7 @@ struct __guc_ads_blob {
>  	struct guc_ads ads;
>  	struct guc_policies policies;
>  	struct guc_gt_system_info system_info;
> +	struct guc_engine_usage engine_usage;
>  	/* From here on, location is dynamic! Refer to above diagram. */
>  	struct guc_mmio_reg regset[0];
>  } __packed;
> @@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
>  
>  	guc_ads_private_data_reset(guc);
>  }
> +
> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
> +{
> +	struct __guc_ads_blob *blob = guc->ads_blob;
> +	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
> +	u32 offset = base + ptr_offset(blob, engine_usage);
> +
> +	return offset;
> +}
> +
> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
> +{
> +	struct intel_guc *guc = &engine->gt->uc.guc;
> +	struct __guc_ads_blob *blob = guc->ads_blob;
> +	u8 guc_class = engine_class_to_guc_class(engine->class);
> +
> +	return &blob->engine_usage.engines[guc_class][engine->instance];
> +}
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> index 3d85051d57e4..e74c110facff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> @@ -6,8 +6,11 @@
>  #ifndef _INTEL_GUC_ADS_H_
>  #define _INTEL_GUC_ADS_H_
>  
> +#include <linux/types.h>
> +
>  struct intel_guc;
>  struct drm_printer;
> +struct intel_engine_cs;
>  
>  int intel_guc_ads_create(struct intel_guc *guc);
>  void intel_guc_ads_destroy(struct intel_guc *guc);
> @@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
>  void intel_guc_ads_reset(struct intel_guc *guc);
>  void intel_guc_ads_print_policy_info(struct intel_guc *guc,
>  				     struct drm_printer *p);
> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
>  
>  #endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index fa4be13c8854..7c9c081670fc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -294,6 +294,19 @@ struct guc_ads {
>  	u32 reserved[15];
>  } __packed;
>  
> +/* Engine usage stats */
> +struct guc_engine_usage_record {
> +	u32 current_context_index;
> +	u32 last_switch_in_stamp;
> +	u32 reserved0;
> +	u32 total_runtime;
> +	u32 reserved1[4];
> +} __packed;
> +
> +struct guc_engine_usage {
> +	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
> +} __packed;
> +
>  /* GuC logging structures */
>  
>  enum guc_log_buffer_type {
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index ba0de35f6323..3f7d0f2ac9da 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -12,6 +12,7 @@
>  #include "gt/intel_engine_pm.h"
>  #include "gt/intel_engine_heartbeat.h"
>  #include "gt/intel_gt.h"
> +#include "gt/intel_gt_clock_utils.h"
>  #include "gt/intel_gt_irq.h"
>  #include "gt/intel_gt_pm.h"
>  #include "gt/intel_gt_requests.h"
> @@ -20,6 +21,7 @@
>  #include "gt/intel_mocs.h"
>  #include "gt/intel_ring.h"
>  
> +#include "intel_guc_ads.h"
>  #include "intel_guc_submission.h"
>  
>  #include "i915_drv.h"
> @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>  static void disable_submission(struct intel_guc *guc)
>  {
>  	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	unsigned long flags;
>  
>  	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>  		GEM_BUG_ON(!guc->ct.enabled);
>  		__tasklet_disable_sync_once(&sched_engine->tasklet);
>  		sched_engine->tasklet.callback = NULL;
>  	}
> +
> +	cancel_delayed_work(&guc->timestamp.work);
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	for_each_engine(engine, gt, id)
> +		engine->stats.prev_total = 0;
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>  }
>  
>  static void enable_submission(struct intel_guc *guc)
> @@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>  	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>  }
>  
> +/*
> + * GuC stores busyness stats for each engine at context in/out boundaries. A
> + * context 'in' logs execution start time, 'out' adds in -> out delta to total.
> + * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
> + * GuC.
> + *
> + * __i915_pmu_event_read samples engine busyness. When sampling, if context id
> + * is valid (!= ~0) and start is non-zero, the engine is considered to be
> + * active. For an active engine total busyness = total + (now - start), where
> + * 'now' is the time at which the busyness is sampled. For inactive engine,
> + * total busyness = total.
> + *
> + * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
> + *
> + * The start and total values provided by GuC are 32 bits and wrap around in a
> + * few minutes. Since perf pmu provides busyness as 64 bit monotonically
> + * increasing ns values, there is a need for this implementation to account for
> + * overflows and extend the GuC provided values to 64 bits before returning
> + * busyness to the user. In order to do that, a worker runs periodically at
> + * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
> + * 27 seconds for a gt clock frequency of 19.2 MHz).
> + */
> +
> +#define WRAP_TIME_CLKS U32_MAX
> +#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
> +
> +static void
> +__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
> +{
> +	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> +	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
> +
> +	if (new_start == lower_32_bits(*prev_start))
> +		return;
> +
> +	if (new_start < gt_stamp_last &&
> +	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
> +		gt_stamp_hi++;
> +
> +	if (new_start > gt_stamp_last &&
> +	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
> +		gt_stamp_hi--;
> +
> +	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
> +}
> +
> +static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
> +{
> +	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
> +	struct intel_guc *guc = &engine->gt->uc.guc;
> +	u32 last_switch = rec->last_switch_in_stamp;
> +	u32 ctx_id = rec->current_context_index;
> +	u32 total = rec->total_runtime;
> +
> +	lockdep_assert_held(&guc->timestamp.lock);
> +
> +	engine->stats.running = ctx_id != ~0U && last_switch;
> +	if (engine->stats.running)
> +		__extend_last_switch(guc, &engine->stats.start_gt_clk,
> +				     last_switch);
> +
> +	/*
> +	 * Instead of adjusting the total for overflow, just add the
> +	 * difference from previous sample to the stats.total_gt_clks
> +	 */
> +	if (total && total != ~0U) {
> +		engine->stats.total_gt_clks += (u32)(total -
> +						     engine->stats.prev_total);
> +		engine->stats.prev_total = total;
> +	}
> +}
> +
> +static void guc_update_pm_timestamp(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	u32 gt_stamp_now, gt_stamp_hi;
> +
> +	lockdep_assert_held(&guc->timestamp.lock);
> +
> +	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> +	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
> +
> +	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
> +		gt_stamp_hi++;
> +
> +	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
> +}
> +
> +/*
> + * Unlike the execlist mode of submission total and active times are in terms of
> + * gt clocks. The *now parameter is retained to return the cpu time at which the
> + * busyness was sampled.
> + */
> +static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
> +{
> +	struct intel_gt *gt = engine->gt;
> +	struct intel_guc *guc = &gt->uc.guc;
> +	unsigned long flags;
> +	u64 total;
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	*now = ktime_get();
> +
> +	/*
> +	 * The active busyness depends on start_gt_clk and gt_stamp.
> +	 * gt_stamp is updated by i915 only when gt is awake and the
> +	 * start_gt_clk is derived from GuC state. To get a consistent
> +	 * view of activity, we query the GuC state only if gt is awake.
> +	 */
> +	if (intel_gt_pm_get_if_awake(gt)) {
> +		guc_update_engine_gt_clks(engine);
> +		guc_update_pm_timestamp(guc);
> +		intel_gt_pm_put_async(gt);
> +	}
> +
> +	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
> +	if (engine->stats.running) {
> +		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
> +
> +		total += intel_gt_clock_interval_to_ns(gt, clk);
> +	}
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> +
> +	return ns_to_ktime(total);
> +}
> +
> +static void __update_guc_busyness_stats(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	if (intel_gt_pm_get_if_awake(gt)) {
> +		guc_update_pm_timestamp(guc);
> +
> +		for_each_engine(engine, gt, id)
> +			guc_update_engine_gt_clks(engine);
> +
> +		intel_gt_pm_put_async(gt);
> +	}
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> +}
> +
> +static void guc_timestamp_ping(struct work_struct *wrk)
> +{
> +	struct intel_guc *guc = container_of(wrk, typeof(*guc),
> +					     timestamp.work.work);
> +
> +	__update_guc_busyness_stats(guc);
> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> +			 guc->timestamp.ping_delay);
> +}
> +
> +static int guc_action_enable_usage_stats(struct intel_guc *guc)
> +{
> +	u32 offset = intel_guc_engine_usage_offset(guc);
> +	u32 action[] = {
> +		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
> +		offset,
> +		0,
> +	};
> +
> +	return intel_guc_send(guc, action, ARRAY_SIZE(action));
> +}
> +
> +static void guc_init_engine_stats(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	intel_wakeref_t wakeref;
> +
> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> +			 guc->timestamp.ping_delay);
> +
> +	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
> +		int ret = guc_action_enable_usage_stats(guc);
> +
> +		if (ret)
> +			drm_err(&gt->i915->drm,
> +				"Failed to enable usage stats: %d!\n", ret);
> +	}
> +}
> +
> +void intel_guc_busyness_park(struct intel_gt *gt)
> +{
> +	struct intel_guc *guc = &gt->uc.guc;
> +
> +	cancel_delayed_work(&guc->timestamp.work);
> +	__update_guc_busyness_stats(guc);
> +}
> +
> +void intel_guc_busyness_unpark(struct intel_gt *gt)
> +{
> +	struct intel_guc *guc = &gt->uc.guc;
> +
> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> +			 guc->timestamp.ping_delay);
> +}
> +
>  /*
>   * Set up the memory resources to be shared with the GuC (via the GGTT)
>   * at firmware loading time.
>   */
>  int intel_guc_submission_init(struct intel_guc *guc)
>  {
> +	struct intel_gt *gt = guc_to_gt(guc);
>  	int ret;
>  
>  	if (guc->lrc_desc_pool)
> @@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
>  	INIT_LIST_HEAD(&guc->guc_id_list);
>  	ida_init(&guc->guc_ids);
>  
> +	spin_lock_init(&guc->timestamp.lock);
> +	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
> +	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
> +
>  	return 0;
>  }
>  
> @@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
>  		engine->emit_flush = gen12_emit_flush_xcs;
>  	}
>  	engine->set_default_submission = guc_set_default_submission;
> +	engine->busyness = guc_engine_busyness;
>  
> +	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
>  	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
>  	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
>  
> @@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
>  void intel_guc_submission_enable(struct intel_guc *guc)
>  {
>  	guc_init_lrc_mapping(guc);
> +	guc_init_engine_stats(guc);
>  }
>  
>  void intel_guc_submission_disable(struct intel_guc *guc)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> index c7ef44fa0c36..5a95a9f0a8e3 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> @@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>  void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>  				    struct i915_request *hung_rq,
>  				    struct drm_printer *m);
> +void intel_guc_busyness_park(struct intel_gt *gt);
> +void intel_guc_busyness_unpark(struct intel_gt *gt);
>  
>  bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
>  
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index a897f4abea0c..9aee08425382 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>  #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
>  #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
>  
> +#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
> +
>  /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
>  #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
>  #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
  2021-10-05 23:14   ` [Intel-gfx] " Matthew Brost
@ 2021-10-06  8:22     ` Tvrtko Ursulin
  -1 siblings, 0 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2021-10-06  8:22 UTC (permalink / raw)
  To: Matthew Brost, Umesh Nerlige Ramappa
  Cc: intel-gfx, dri-devel, john.c.harrison, daniel.vetter


On 06/10/2021 00:14, Matthew Brost wrote:
> On Tue, Oct 05, 2021 at 10:47:11AM -0700, Umesh Nerlige Ramappa wrote:
>> With GuC handling scheduling, i915 is not aware of the time that a
>> context is scheduled in and out of the engine. Since i915 pmu relies on
>> this info to provide engine busyness to the user, GuC shares this info
>> with i915 for all engines using shared memory. For each engine, this
>> info contains:
>>
>> - total busyness: total time that the context was running (total)
>> - id: id of the running context (id)
>> - start timestamp: timestamp when the context started running (start)
>>
>> At the time (now) of sampling the engine busyness, if the id is valid
>> (!= ~0), and start is non-zero, then the context is considered to be
>> active and the engine busyness is calculated using the below equation
>>
>> 	engine busyness = total + (now - start)
>>
>> All times are obtained from the gt clock base. For inactive contexts,
>> engine busyness is just equal to the total.
>>
>> The start and total values provided by GuC are 32 bits and wrap around
>> in a few minutes. Since perf pmu provides busyness as 64 bit
>> monotonically increasing values, there is a need for this implementation
>> to account for overflows and extend the time to 64 bits before returning
>> busyness to the user. In order to do that, a worker runs periodically at
>> frequency = 1/8th the time it takes for the timestamp to wrap. As an
>> example, that would be once in 27 seconds for a gt clock frequency of
>> 19.2 MHz.
>>
>> Opens and wip that are targeted for later patches:
>>
>> 1) On global gt reset the total busyness of engines resets and i915
>>     needs to fix that so that user sees monotonically increasing
>>     busyness.
>> 2) In runtime suspend mode, the worker may not need to be run. We could
>>     stop the worker on suspend and rerun it on resume provided that the
>>     guc pm timestamp does not tick during suspend.
>>
>> Note:
>> There might be an overaccounting of busyness due to the fact that GuC
>> may be updating the total and start values while kmd is reading them.
>> (i.e kmd may read the updated total and the stale start). In such a
>> case, user may see higher busyness value followed by smaller ones which
>> would eventually catch up to the higher value.
>>
>> v2: (Tvrtko)
>> - Include details in commit message
>> - Move intel engine busyness function into execlist code
>> - Use union inside engine->stats
>> - Use natural type for ping delay jiffies
>> - Drop active_work condition checks
>> - Use for_each_engine if iterating all engines
>> - Drop seq locking, use spinlock at guc level to update engine stats
>> - Document worker specific details
>>
>> v3: (Tvrtko/Umesh)
>> - Demarcate guc and execlist stat objects with comments
>> - Document known over-accounting issue in commit
>> - Provide a consistent view of guc state
>> - Add hooks to gt park/unpark for guc busyness
>> - Stop/start worker in gt park/unpark path
>> - Drop inline
>> - Move spinlock and worker inits to guc initialization
>> - Drop helpers that are called only once
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
>>   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
>>   .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
>>   drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
>>   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
>>   drivers/gpu/drm/i915/i915_reg.h               |   2 +
>>   12 files changed, 398 insertions(+), 49 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index 2ae57e4656a3..6fcc70a313d9 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>>   	intel_engine_print_breadcrumbs(engine, m);
>>   }
>>   
>> -static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>> -					    ktime_t *now)
>> -{
>> -	ktime_t total = engine->stats.total;
>> -
>> -	/*
>> -	 * If the engine is executing something at the moment
>> -	 * add it to the total.
>> -	 */
>> -	*now = ktime_get();
>> -	if (READ_ONCE(engine->stats.active))
>> -		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>> -
>> -	return total;
>> -}
>> -
>>   /**
>>    * intel_engine_get_busy_time() - Return current accumulated engine busyness
>>    * @engine: engine to report on
>> @@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>    */
>>   ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>>   {
>> -	unsigned int seq;
>> -	ktime_t total;
>> -
>> -	do {
>> -		seq = read_seqcount_begin(&engine->stats.lock);
>> -		total = __intel_engine_get_busy_time(engine, now);
>> -	} while (read_seqcount_retry(&engine->stats.lock, seq));
>> -
>> -	return total;
>> +	return engine->busyness(engine, now);
>>   }
>>   
>>   struct intel_context *
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>> index 5ae1207c363b..8e1b9c38a6fc 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>> @@ -432,6 +432,12 @@ struct intel_engine_cs {
>>   	void		(*add_active_request)(struct i915_request *rq);
>>   	void		(*remove_active_request)(struct i915_request *rq);
>>   
>> +	/*
>> +	 * Get engine busyness and the time at which the busyness was sampled.
>> +	 */
>> +	ktime_t		(*busyness)(struct intel_engine_cs *engine,
>> +				    ktime_t *now);
>> +
>>   	struct intel_engine_execlists execlists;
>>   
>>   	/*
>> @@ -481,30 +487,66 @@ struct intel_engine_cs {
>>   	u32 (*get_cmd_length_mask)(u32 cmd_header);
>>   
>>   	struct {
>> -		/**
>> -		 * @active: Number of contexts currently scheduled in.
>> -		 */
>> -		unsigned int active;
>> -
>> -		/**
>> -		 * @lock: Lock protecting the below fields.
>> -		 */
>> -		seqcount_t lock;
>> -
>> -		/**
>> -		 * @total: Total time this engine was busy.
>> -		 *
>> -		 * Accumulated time not counting the most recent block in cases
>> -		 * where engine is currently busy (active > 0).
>> -		 */
>> -		ktime_t total;
>> -
>> -		/**
>> -		 * @start: Timestamp of the last idle to active transition.
>> -		 *
>> -		 * Idle is defined as active == 0, active is active > 0.
>> -		 */
>> -		ktime_t start;
>> +		union {
>> +			/* Fields used by the execlists backend. */
>> +			struct {
>> +				/**
>> +				 * @active: Number of contexts currently
>> +				 * scheduled in.
>> +				 */
>> +				unsigned int active;
>> +
>> +				/**
>> +				 * @lock: Lock protecting the below fields.
>> +				 */
>> +				seqcount_t lock;
>> +
>> +				/**
>> +				 * @total: Total time this engine was busy.
>> +				 *
>> +				 * Accumulated time not counting the most recent
>> +				 * block in cases where engine is currently busy
>> +				 * (active > 0).
>> +				 */
>> +				ktime_t total;
>> +
>> +				/**
>> +				 * @start: Timestamp of the last idle to active
>> +				 * transition.
>> +				 *
>> +				 * Idle is defined as active == 0, active is
>> +				 * active > 0.
>> +				 */
>> +				ktime_t start;
>> +			};
> 
> Not anonymous? e.g.
> 
> struct {
> 	...
> } execlists;
> struct {
> 	...
> } guc;
> 
> IMO this is better as this is self documenting and if you touch an
> backend specific field in a non-backend specific file it pops out as
> incorrect.
> 
>> +
>> +			/* Fields used by the GuC backend. */
>> +			struct {
>> +				/**
>> +				 * @running: Active state of the engine when
>> +				 * busyness was last sampled.
>> +				 */
>> +				bool running;
>> +
>> +				/**
>> +				 * @prev_total: Previous value of total runtime
>> +				 * clock cycles.
>> +				 */
>> +				u32 prev_total;
>> +
>> +				/**
>> +				 * @total_gt_clks: Total gt clock cycles this
>> +				 * engine was busy.
>> +				 */
>> +				u64 total_gt_clks;
>> +
>> +				/**
>> +				 * @start_gt_clk: GT clock time of last idle to
>> +				 * active transition.
>> +				 */
>> +				u64 start_gt_clk;
>> +			};
>> +		};
>>   
>>   		/**
>>   		 * @rps: Utilisation at last RPS sampling.
>> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> index 7147fe80919e..5c9b695e906c 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> @@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
>>   	lrc_fini_wa_ctx(engine);
>>   }
>>   
>> +static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
>> +					   ktime_t *now)
>> +{
>> +	ktime_t total = engine->stats.total;
>> +
>> +	/*
>> +	 * If the engine is executing something at the moment
>> +	 * add it to the total.
>> +	 */
>> +	*now = ktime_get();
>> +	if (READ_ONCE(engine->stats.active))
>> +		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>> +
>> +	return total;
>> +}
>> +
>> +static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
>> +					 ktime_t *now)
>> +{
>> +	unsigned int seq;
>> +	ktime_t total;
>> +
>> +	do {
>> +		seq = read_seqcount_begin(&engine->stats.lock);
>> +		total = __execlists_engine_busyness(engine, now);
>> +	} while (read_seqcount_retry(&engine->stats.lock, seq));
>> +
>> +	return total;
>> +}
>> +
>>   static void
>>   logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>   {
>> @@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>   		engine->emit_bb_start = gen8_emit_bb_start;
>>   	else
>>   		engine->emit_bb_start = gen8_emit_bb_start_noarb;
>> +
>> +	engine->busyness = execlists_engine_busyness;
>>   }
>>   
>>   static void logical_ring_default_irqs(struct intel_engine_cs *engine)
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> index 524eaf678790..b4a8594bc46c 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> @@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
>>   	intel_rc6_unpark(&gt->rc6);
>>   	intel_rps_unpark(&gt->rps);
>>   	i915_pmu_gt_unparked(i915);
>> +	intel_guc_busyness_unpark(gt);
> 
> I personally don't mind this but in the spirit of correct layering, this
> likely should be generic wrapper inline func which calls a vfunc if
> present (e.g. set the vfunc for backend, don't set for execlists).
> 
>>   
>>   	intel_gt_unpark_requests(gt);
>>   	runtime_begin(gt);
>> @@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
>>   	runtime_end(gt);
>>   	intel_gt_park_requests(gt);
>>   
>> +	intel_guc_busyness_park(gt);
> 
> Same here.
> 
>>   	i915_vma_parked(gt);
>>   	i915_pmu_gt_parked(i915);
>>   	intel_rps_park(&gt->rps);
>> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> index 8ff582222aff..ff1311d4beff 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> @@ -143,6 +143,7 @@ enum intel_guc_action {
>>   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>>   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>>   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>> +	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>>   	INTEL_GUC_ACTION_LIMIT
>>   };
>>   
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> index 5dd174babf7a..22c30dbdf63a 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> @@ -104,6 +104,8 @@ struct intel_guc {
>>   	u32 ads_regset_size;
>>   	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>>   	u32 ads_golden_ctxt_size;
>> +	/** @ads_engine_usage_size: size of engine usage in the ADS */
>> +	u32 ads_engine_usage_size;
>>   
>>   	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
>>   	struct i915_vma *lrc_desc_pool;
>> @@ -138,6 +140,30 @@ struct intel_guc {
>>   
>>   	/** @send_mutex: used to serialize the intel_guc_send actions */
>>   	struct mutex send_mutex;
>> +
>> +	struct {
>> +		/**
>> +		 * @lock: Lock protecting the below fields and the engine stats.
>> +		 */
>> +		spinlock_t lock;
>> +
> 
> Again I really don't mind but I'm told not to add more spin locks than
> needed. This really should be protected by a generic GuC submission spin
> lock. e.g. Build on this patch and protect all of this by the
> submission_state.lock.

I see no good reason to use the submission lock here. The two are 
completely different paths, with completely different entry points and 
we don't want to introduce contention where it is trivially avoidable 
for no real cost. In other words I think this lock is well defined and 
localised both in code and in execution flows.

Regards,

Tvrtko

> 
> https://patchwork.freedesktop.org/patch/457310/?series=92789&rev=5
> 
> Whomevers series gets merged first can include the above patch.
> 
> Rest the series looks fine cosmetically to me.
> 
> Matt
> 
>> +		/**
>> +		 * @gt_stamp: 64 bit extended value of the GT timestamp.
>> +		 */
>> +		u64 gt_stamp;
>> +
>> +		/**
>> +		 * @ping_delay: Period for polling the GT timestamp for
>> +		 * overflow.
>> +		 */
>> +		unsigned long ping_delay;
>> +
>> +		/**
>> +		 * @work: Periodic work to adjust GT timestamp, engine and
>> +		 * context usage for overflows.
>> +		 */
>> +		struct delayed_work work;
>> +	} timestamp;
>>   };
>>   
>>   static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>> index 2c6ea64af7ec..ca9ab53999d5 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>> @@ -26,6 +26,8 @@
>>    *      | guc_policies                          |
>>    *      +---------------------------------------+
>>    *      | guc_gt_system_info                    |
>> + *      +---------------------------------------+
>> + *      | guc_engine_usage                      |
>>    *      +---------------------------------------+ <== static
>>    *      | guc_mmio_reg[countA] (engine 0.0)     |
>>    *      | guc_mmio_reg[countB] (engine 0.1)     |
>> @@ -47,6 +49,7 @@ struct __guc_ads_blob {
>>   	struct guc_ads ads;
>>   	struct guc_policies policies;
>>   	struct guc_gt_system_info system_info;
>> +	struct guc_engine_usage engine_usage;
>>   	/* From here on, location is dynamic! Refer to above diagram. */
>>   	struct guc_mmio_reg regset[0];
>>   } __packed;
>> @@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
>>   
>>   	guc_ads_private_data_reset(guc);
>>   }
>> +
>> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
>> +{
>> +	struct __guc_ads_blob *blob = guc->ads_blob;
>> +	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
>> +	u32 offset = base + ptr_offset(blob, engine_usage);
>> +
>> +	return offset;
>> +}
>> +
>> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
>> +{
>> +	struct intel_guc *guc = &engine->gt->uc.guc;
>> +	struct __guc_ads_blob *blob = guc->ads_blob;
>> +	u8 guc_class = engine_class_to_guc_class(engine->class);
>> +
>> +	return &blob->engine_usage.engines[guc_class][engine->instance];
>> +}
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>> index 3d85051d57e4..e74c110facff 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>> @@ -6,8 +6,11 @@
>>   #ifndef _INTEL_GUC_ADS_H_
>>   #define _INTEL_GUC_ADS_H_
>>   
>> +#include <linux/types.h>
>> +
>>   struct intel_guc;
>>   struct drm_printer;
>> +struct intel_engine_cs;
>>   
>>   int intel_guc_ads_create(struct intel_guc *guc);
>>   void intel_guc_ads_destroy(struct intel_guc *guc);
>> @@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
>>   void intel_guc_ads_reset(struct intel_guc *guc);
>>   void intel_guc_ads_print_policy_info(struct intel_guc *guc,
>>   				     struct drm_printer *p);
>> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
>> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
>>   
>>   #endif
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> index fa4be13c8854..7c9c081670fc 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> @@ -294,6 +294,19 @@ struct guc_ads {
>>   	u32 reserved[15];
>>   } __packed;
>>   
>> +/* Engine usage stats */
>> +struct guc_engine_usage_record {
>> +	u32 current_context_index;
>> +	u32 last_switch_in_stamp;
>> +	u32 reserved0;
>> +	u32 total_runtime;
>> +	u32 reserved1[4];
>> +} __packed;
>> +
>> +struct guc_engine_usage {
>> +	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
>> +} __packed;
>> +
>>   /* GuC logging structures */
>>   
>>   enum guc_log_buffer_type {
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index ba0de35f6323..3f7d0f2ac9da 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -12,6 +12,7 @@
>>   #include "gt/intel_engine_pm.h"
>>   #include "gt/intel_engine_heartbeat.h"
>>   #include "gt/intel_gt.h"
>> +#include "gt/intel_gt_clock_utils.h"
>>   #include "gt/intel_gt_irq.h"
>>   #include "gt/intel_gt_pm.h"
>>   #include "gt/intel_gt_requests.h"
>> @@ -20,6 +21,7 @@
>>   #include "gt/intel_mocs.h"
>>   #include "gt/intel_ring.h"
>>   
>> +#include "intel_guc_ads.h"
>>   #include "intel_guc_submission.h"
>>   
>>   #include "i915_drv.h"
>> @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>>   static void disable_submission(struct intel_guc *guc)
>>   {
>>   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	unsigned long flags;
>>   
>>   	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>>   		GEM_BUG_ON(!guc->ct.enabled);
>>   		__tasklet_disable_sync_once(&sched_engine->tasklet);
>>   		sched_engine->tasklet.callback = NULL;
>>   	}
>> +
>> +	cancel_delayed_work(&guc->timestamp.work);
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +
>> +	for_each_engine(engine, gt, id)
>> +		engine->stats.prev_total = 0;
>> +
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>   }
>>   
>>   static void enable_submission(struct intel_guc *guc)
>> @@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>>   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>>   }
>>   
>> +/*
>> + * GuC stores busyness stats for each engine at context in/out boundaries. A
>> + * context 'in' logs execution start time, 'out' adds in -> out delta to total.
>> + * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
>> + * GuC.
>> + *
>> + * __i915_pmu_event_read samples engine busyness. When sampling, if context id
>> + * is valid (!= ~0) and start is non-zero, the engine is considered to be
>> + * active. For an active engine total busyness = total + (now - start), where
>> + * 'now' is the time at which the busyness is sampled. For inactive engine,
>> + * total busyness = total.
>> + *
>> + * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
>> + *
>> + * The start and total values provided by GuC are 32 bits and wrap around in a
>> + * few minutes. Since perf pmu provides busyness as 64 bit monotonically
>> + * increasing ns values, there is a need for this implementation to account for
>> + * overflows and extend the GuC provided values to 64 bits before returning
>> + * busyness to the user. In order to do that, a worker runs periodically at
>> + * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
>> + * 27 seconds for a gt clock frequency of 19.2 MHz).
>> + */
>> +
>> +#define WRAP_TIME_CLKS U32_MAX
>> +#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
>> +
>> +static void
>> +__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
>> +{
>> +	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>> +	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
>> +
>> +	if (new_start == lower_32_bits(*prev_start))
>> +		return;
>> +
>> +	if (new_start < gt_stamp_last &&
>> +	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
>> +		gt_stamp_hi++;
>> +
>> +	if (new_start > gt_stamp_last &&
>> +	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
>> +		gt_stamp_hi--;
>> +
>> +	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
>> +}
>> +
>> +static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
>> +{
>> +	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
>> +	struct intel_guc *guc = &engine->gt->uc.guc;
>> +	u32 last_switch = rec->last_switch_in_stamp;
>> +	u32 ctx_id = rec->current_context_index;
>> +	u32 total = rec->total_runtime;
>> +
>> +	lockdep_assert_held(&guc->timestamp.lock);
>> +
>> +	engine->stats.running = ctx_id != ~0U && last_switch;
>> +	if (engine->stats.running)
>> +		__extend_last_switch(guc, &engine->stats.start_gt_clk,
>> +				     last_switch);
>> +
>> +	/*
>> +	 * Instead of adjusting the total for overflow, just add the
>> +	 * difference from previous sample to the stats.total_gt_clks
>> +	 */
>> +	if (total && total != ~0U) {
>> +		engine->stats.total_gt_clks += (u32)(total -
>> +						     engine->stats.prev_total);
>> +		engine->stats.prev_total = total;
>> +	}
>> +}
>> +
>> +static void guc_update_pm_timestamp(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	u32 gt_stamp_now, gt_stamp_hi;
>> +
>> +	lockdep_assert_held(&guc->timestamp.lock);
>> +
>> +	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>> +	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
>> +
>> +	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
>> +		gt_stamp_hi++;
>> +
>> +	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
>> +}
>> +
>> +/*
>> + * Unlike the execlist mode of submission total and active times are in terms of
>> + * gt clocks. The *now parameter is retained to return the cpu time at which the
>> + * busyness was sampled.
>> + */
>> +static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>> +{
>> +	struct intel_gt *gt = engine->gt;
>> +	struct intel_guc *guc = &gt->uc.guc;
>> +	unsigned long flags;
>> +	u64 total;
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +
>> +	*now = ktime_get();
>> +
>> +	/*
>> +	 * The active busyness depends on start_gt_clk and gt_stamp.
>> +	 * gt_stamp is updated by i915 only when gt is awake and the
>> +	 * start_gt_clk is derived from GuC state. To get a consistent
>> +	 * view of activity, we query the GuC state only if gt is awake.
>> +	 */
>> +	if (intel_gt_pm_get_if_awake(gt)) {
>> +		guc_update_engine_gt_clks(engine);
>> +		guc_update_pm_timestamp(guc);
>> +		intel_gt_pm_put_async(gt);
>> +	}
>> +
>> +	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
>> +	if (engine->stats.running) {
>> +		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
>> +
>> +		total += intel_gt_clock_interval_to_ns(gt, clk);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> +
>> +	return ns_to_ktime(total);
>> +}
>> +
>> +static void __update_guc_busyness_stats(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +
>> +	if (intel_gt_pm_get_if_awake(gt)) {
>> +		guc_update_pm_timestamp(guc);
>> +
>> +		for_each_engine(engine, gt, id)
>> +			guc_update_engine_gt_clks(engine);
>> +
>> +		intel_gt_pm_put_async(gt);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> +}
>> +
>> +static void guc_timestamp_ping(struct work_struct *wrk)
>> +{
>> +	struct intel_guc *guc = container_of(wrk, typeof(*guc),
>> +					     timestamp.work.work);
>> +
>> +	__update_guc_busyness_stats(guc);
>> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>> +			 guc->timestamp.ping_delay);
>> +}
>> +
>> +static int guc_action_enable_usage_stats(struct intel_guc *guc)
>> +{
>> +	u32 offset = intel_guc_engine_usage_offset(guc);
>> +	u32 action[] = {
>> +		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
>> +		offset,
>> +		0,
>> +	};
>> +
>> +	return intel_guc_send(guc, action, ARRAY_SIZE(action));
>> +}
>> +
>> +static void guc_init_engine_stats(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	intel_wakeref_t wakeref;
>> +
>> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>> +			 guc->timestamp.ping_delay);
>> +
>> +	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
>> +		int ret = guc_action_enable_usage_stats(guc);
>> +
>> +		if (ret)
>> +			drm_err(&gt->i915->drm,
>> +				"Failed to enable usage stats: %d!\n", ret);
>> +	}
>> +}
>> +
>> +void intel_guc_busyness_park(struct intel_gt *gt)
>> +{
>> +	struct intel_guc *guc = &gt->uc.guc;
>> +
>> +	cancel_delayed_work(&guc->timestamp.work);
>> +	__update_guc_busyness_stats(guc);
>> +}
>> +
>> +void intel_guc_busyness_unpark(struct intel_gt *gt)
>> +{
>> +	struct intel_guc *guc = &gt->uc.guc;
>> +
>> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>> +			 guc->timestamp.ping_delay);
>> +}
>> +
>>   /*
>>    * Set up the memory resources to be shared with the GuC (via the GGTT)
>>    * at firmware loading time.
>>    */
>>   int intel_guc_submission_init(struct intel_guc *guc)
>>   {
>> +	struct intel_gt *gt = guc_to_gt(guc);
>>   	int ret;
>>   
>>   	if (guc->lrc_desc_pool)
>> @@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>   	INIT_LIST_HEAD(&guc->guc_id_list);
>>   	ida_init(&guc->guc_ids);
>>   
>> +	spin_lock_init(&guc->timestamp.lock);
>> +	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
>> +	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
>> +
>>   	return 0;
>>   }
>>   
>> @@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
>>   		engine->emit_flush = gen12_emit_flush_xcs;
>>   	}
>>   	engine->set_default_submission = guc_set_default_submission;
>> +	engine->busyness = guc_engine_busyness;
>>   
>> +	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
>>   	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
>>   	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
>>   
>> @@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
>>   void intel_guc_submission_enable(struct intel_guc *guc)
>>   {
>>   	guc_init_lrc_mapping(guc);
>> +	guc_init_engine_stats(guc);
>>   }
>>   
>>   void intel_guc_submission_disable(struct intel_guc *guc)
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>> index c7ef44fa0c36..5a95a9f0a8e3 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>> @@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>>   void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>>   				    struct i915_request *hung_rq,
>>   				    struct drm_printer *m);
>> +void intel_guc_busyness_park(struct intel_gt *gt);
>> +void intel_guc_busyness_unpark(struct intel_gt *gt);
>>   
>>   bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
>>   
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index a897f4abea0c..9aee08425382 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>>   #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
>>   #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
>>   
>> +#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
>> +
>>   /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
>>   #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
>>   #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
@ 2021-10-06  8:22     ` Tvrtko Ursulin
  0 siblings, 0 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2021-10-06  8:22 UTC (permalink / raw)
  To: Matthew Brost, Umesh Nerlige Ramappa
  Cc: intel-gfx, dri-devel, john.c.harrison, daniel.vetter


On 06/10/2021 00:14, Matthew Brost wrote:
> On Tue, Oct 05, 2021 at 10:47:11AM -0700, Umesh Nerlige Ramappa wrote:
>> With GuC handling scheduling, i915 is not aware of the time that a
>> context is scheduled in and out of the engine. Since i915 pmu relies on
>> this info to provide engine busyness to the user, GuC shares this info
>> with i915 for all engines using shared memory. For each engine, this
>> info contains:
>>
>> - total busyness: total time that the context was running (total)
>> - id: id of the running context (id)
>> - start timestamp: timestamp when the context started running (start)
>>
>> At the time (now) of sampling the engine busyness, if the id is valid
>> (!= ~0), and start is non-zero, then the context is considered to be
>> active and the engine busyness is calculated using the below equation
>>
>> 	engine busyness = total + (now - start)
>>
>> All times are obtained from the gt clock base. For inactive contexts,
>> engine busyness is just equal to the total.
>>
>> The start and total values provided by GuC are 32 bits and wrap around
>> in a few minutes. Since perf pmu provides busyness as 64 bit
>> monotonically increasing values, there is a need for this implementation
>> to account for overflows and extend the time to 64 bits before returning
>> busyness to the user. In order to do that, a worker runs periodically at
>> frequency = 1/8th the time it takes for the timestamp to wrap. As an
>> example, that would be once in 27 seconds for a gt clock frequency of
>> 19.2 MHz.
>>
>> Opens and wip that are targeted for later patches:
>>
>> 1) On global gt reset the total busyness of engines resets and i915
>>     needs to fix that so that user sees monotonically increasing
>>     busyness.
>> 2) In runtime suspend mode, the worker may not need to be run. We could
>>     stop the worker on suspend and rerun it on resume provided that the
>>     guc pm timestamp does not tick during suspend.
>>
>> Note:
>> There might be an overaccounting of busyness due to the fact that GuC
>> may be updating the total and start values while kmd is reading them.
>> (i.e kmd may read the updated total and the stale start). In such a
>> case, user may see higher busyness value followed by smaller ones which
>> would eventually catch up to the higher value.
>>
>> v2: (Tvrtko)
>> - Include details in commit message
>> - Move intel engine busyness function into execlist code
>> - Use union inside engine->stats
>> - Use natural type for ping delay jiffies
>> - Drop active_work condition checks
>> - Use for_each_engine if iterating all engines
>> - Drop seq locking, use spinlock at guc level to update engine stats
>> - Document worker specific details
>>
>> v3: (Tvrtko/Umesh)
>> - Demarcate guc and execlist stat objects with comments
>> - Document known over-accounting issue in commit
>> - Provide a consistent view of guc state
>> - Add hooks to gt park/unpark for guc busyness
>> - Stop/start worker in gt park/unpark path
>> - Drop inline
>> - Move spinlock and worker inits to guc initialization
>> - Drop helpers that are called only once
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
>>   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
>>   .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
>>   drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
>>   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
>>   drivers/gpu/drm/i915/i915_reg.h               |   2 +
>>   12 files changed, 398 insertions(+), 49 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index 2ae57e4656a3..6fcc70a313d9 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>>   	intel_engine_print_breadcrumbs(engine, m);
>>   }
>>   
>> -static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>> -					    ktime_t *now)
>> -{
>> -	ktime_t total = engine->stats.total;
>> -
>> -	/*
>> -	 * If the engine is executing something at the moment
>> -	 * add it to the total.
>> -	 */
>> -	*now = ktime_get();
>> -	if (READ_ONCE(engine->stats.active))
>> -		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>> -
>> -	return total;
>> -}
>> -
>>   /**
>>    * intel_engine_get_busy_time() - Return current accumulated engine busyness
>>    * @engine: engine to report on
>> @@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>    */
>>   ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>>   {
>> -	unsigned int seq;
>> -	ktime_t total;
>> -
>> -	do {
>> -		seq = read_seqcount_begin(&engine->stats.lock);
>> -		total = __intel_engine_get_busy_time(engine, now);
>> -	} while (read_seqcount_retry(&engine->stats.lock, seq));
>> -
>> -	return total;
>> +	return engine->busyness(engine, now);
>>   }
>>   
>>   struct intel_context *
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>> index 5ae1207c363b..8e1b9c38a6fc 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>> @@ -432,6 +432,12 @@ struct intel_engine_cs {
>>   	void		(*add_active_request)(struct i915_request *rq);
>>   	void		(*remove_active_request)(struct i915_request *rq);
>>   
>> +	/*
>> +	 * Get engine busyness and the time at which the busyness was sampled.
>> +	 */
>> +	ktime_t		(*busyness)(struct intel_engine_cs *engine,
>> +				    ktime_t *now);
>> +
>>   	struct intel_engine_execlists execlists;
>>   
>>   	/*
>> @@ -481,30 +487,66 @@ struct intel_engine_cs {
>>   	u32 (*get_cmd_length_mask)(u32 cmd_header);
>>   
>>   	struct {
>> -		/**
>> -		 * @active: Number of contexts currently scheduled in.
>> -		 */
>> -		unsigned int active;
>> -
>> -		/**
>> -		 * @lock: Lock protecting the below fields.
>> -		 */
>> -		seqcount_t lock;
>> -
>> -		/**
>> -		 * @total: Total time this engine was busy.
>> -		 *
>> -		 * Accumulated time not counting the most recent block in cases
>> -		 * where engine is currently busy (active > 0).
>> -		 */
>> -		ktime_t total;
>> -
>> -		/**
>> -		 * @start: Timestamp of the last idle to active transition.
>> -		 *
>> -		 * Idle is defined as active == 0, active is active > 0.
>> -		 */
>> -		ktime_t start;
>> +		union {
>> +			/* Fields used by the execlists backend. */
>> +			struct {
>> +				/**
>> +				 * @active: Number of contexts currently
>> +				 * scheduled in.
>> +				 */
>> +				unsigned int active;
>> +
>> +				/**
>> +				 * @lock: Lock protecting the below fields.
>> +				 */
>> +				seqcount_t lock;
>> +
>> +				/**
>> +				 * @total: Total time this engine was busy.
>> +				 *
>> +				 * Accumulated time not counting the most recent
>> +				 * block in cases where engine is currently busy
>> +				 * (active > 0).
>> +				 */
>> +				ktime_t total;
>> +
>> +				/**
>> +				 * @start: Timestamp of the last idle to active
>> +				 * transition.
>> +				 *
>> +				 * Idle is defined as active == 0, active is
>> +				 * active > 0.
>> +				 */
>> +				ktime_t start;
>> +			};
> 
> Not anonymous? e.g.
> 
> struct {
> 	...
> } execlists;
> struct {
> 	...
> } guc;
> 
> IMO this is better as this is self documenting and if you touch an
> backend specific field in a non-backend specific file it pops out as
> incorrect.
> 
>> +
>> +			/* Fields used by the GuC backend. */
>> +			struct {
>> +				/**
>> +				 * @running: Active state of the engine when
>> +				 * busyness was last sampled.
>> +				 */
>> +				bool running;
>> +
>> +				/**
>> +				 * @prev_total: Previous value of total runtime
>> +				 * clock cycles.
>> +				 */
>> +				u32 prev_total;
>> +
>> +				/**
>> +				 * @total_gt_clks: Total gt clock cycles this
>> +				 * engine was busy.
>> +				 */
>> +				u64 total_gt_clks;
>> +
>> +				/**
>> +				 * @start_gt_clk: GT clock time of last idle to
>> +				 * active transition.
>> +				 */
>> +				u64 start_gt_clk;
>> +			};
>> +		};
>>   
>>   		/**
>>   		 * @rps: Utilisation at last RPS sampling.
>> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> index 7147fe80919e..5c9b695e906c 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> @@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
>>   	lrc_fini_wa_ctx(engine);
>>   }
>>   
>> +static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
>> +					   ktime_t *now)
>> +{
>> +	ktime_t total = engine->stats.total;
>> +
>> +	/*
>> +	 * If the engine is executing something at the moment
>> +	 * add it to the total.
>> +	 */
>> +	*now = ktime_get();
>> +	if (READ_ONCE(engine->stats.active))
>> +		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>> +
>> +	return total;
>> +}
>> +
>> +static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
>> +					 ktime_t *now)
>> +{
>> +	unsigned int seq;
>> +	ktime_t total;
>> +
>> +	do {
>> +		seq = read_seqcount_begin(&engine->stats.lock);
>> +		total = __execlists_engine_busyness(engine, now);
>> +	} while (read_seqcount_retry(&engine->stats.lock, seq));
>> +
>> +	return total;
>> +}
>> +
>>   static void
>>   logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>   {
>> @@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>   		engine->emit_bb_start = gen8_emit_bb_start;
>>   	else
>>   		engine->emit_bb_start = gen8_emit_bb_start_noarb;
>> +
>> +	engine->busyness = execlists_engine_busyness;
>>   }
>>   
>>   static void logical_ring_default_irqs(struct intel_engine_cs *engine)
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> index 524eaf678790..b4a8594bc46c 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> @@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
>>   	intel_rc6_unpark(&gt->rc6);
>>   	intel_rps_unpark(&gt->rps);
>>   	i915_pmu_gt_unparked(i915);
>> +	intel_guc_busyness_unpark(gt);
> 
> I personally don't mind this but in the spirit of correct layering, this
> likely should be generic wrapper inline func which calls a vfunc if
> present (e.g. set the vfunc for backend, don't set for execlists).
> 
>>   
>>   	intel_gt_unpark_requests(gt);
>>   	runtime_begin(gt);
>> @@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
>>   	runtime_end(gt);
>>   	intel_gt_park_requests(gt);
>>   
>> +	intel_guc_busyness_park(gt);
> 
> Same here.
> 
>>   	i915_vma_parked(gt);
>>   	i915_pmu_gt_parked(i915);
>>   	intel_rps_park(&gt->rps);
>> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> index 8ff582222aff..ff1311d4beff 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> @@ -143,6 +143,7 @@ enum intel_guc_action {
>>   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>>   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>>   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>> +	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>>   	INTEL_GUC_ACTION_LIMIT
>>   };
>>   
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> index 5dd174babf7a..22c30dbdf63a 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> @@ -104,6 +104,8 @@ struct intel_guc {
>>   	u32 ads_regset_size;
>>   	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>>   	u32 ads_golden_ctxt_size;
>> +	/** @ads_engine_usage_size: size of engine usage in the ADS */
>> +	u32 ads_engine_usage_size;
>>   
>>   	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
>>   	struct i915_vma *lrc_desc_pool;
>> @@ -138,6 +140,30 @@ struct intel_guc {
>>   
>>   	/** @send_mutex: used to serialize the intel_guc_send actions */
>>   	struct mutex send_mutex;
>> +
>> +	struct {
>> +		/**
>> +		 * @lock: Lock protecting the below fields and the engine stats.
>> +		 */
>> +		spinlock_t lock;
>> +
> 
> Again I really don't mind but I'm told not to add more spin locks than
> needed. This really should be protected by a generic GuC submission spin
> lock. e.g. Build on this patch and protect all of this by the
> submission_state.lock.

I see no good reason to use the submission lock here. The two are 
completely different paths, with completely different entry points and 
we don't want to introduce contention where it is trivially avoidable 
for no real cost. In other words I think this lock is well defined and 
localised both in code and in execution flows.

Regards,

Tvrtko

> 
> https://patchwork.freedesktop.org/patch/457310/?series=92789&rev=5
> 
> Whomevers series gets merged first can include the above patch.
> 
> Rest the series looks fine cosmetically to me.
> 
> Matt
> 
>> +		/**
>> +		 * @gt_stamp: 64 bit extended value of the GT timestamp.
>> +		 */
>> +		u64 gt_stamp;
>> +
>> +		/**
>> +		 * @ping_delay: Period for polling the GT timestamp for
>> +		 * overflow.
>> +		 */
>> +		unsigned long ping_delay;
>> +
>> +		/**
>> +		 * @work: Periodic work to adjust GT timestamp, engine and
>> +		 * context usage for overflows.
>> +		 */
>> +		struct delayed_work work;
>> +	} timestamp;
>>   };
>>   
>>   static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>> index 2c6ea64af7ec..ca9ab53999d5 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>> @@ -26,6 +26,8 @@
>>    *      | guc_policies                          |
>>    *      +---------------------------------------+
>>    *      | guc_gt_system_info                    |
>> + *      +---------------------------------------+
>> + *      | guc_engine_usage                      |
>>    *      +---------------------------------------+ <== static
>>    *      | guc_mmio_reg[countA] (engine 0.0)     |
>>    *      | guc_mmio_reg[countB] (engine 0.1)     |
>> @@ -47,6 +49,7 @@ struct __guc_ads_blob {
>>   	struct guc_ads ads;
>>   	struct guc_policies policies;
>>   	struct guc_gt_system_info system_info;
>> +	struct guc_engine_usage engine_usage;
>>   	/* From here on, location is dynamic! Refer to above diagram. */
>>   	struct guc_mmio_reg regset[0];
>>   } __packed;
>> @@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
>>   
>>   	guc_ads_private_data_reset(guc);
>>   }
>> +
>> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
>> +{
>> +	struct __guc_ads_blob *blob = guc->ads_blob;
>> +	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
>> +	u32 offset = base + ptr_offset(blob, engine_usage);
>> +
>> +	return offset;
>> +}
>> +
>> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
>> +{
>> +	struct intel_guc *guc = &engine->gt->uc.guc;
>> +	struct __guc_ads_blob *blob = guc->ads_blob;
>> +	u8 guc_class = engine_class_to_guc_class(engine->class);
>> +
>> +	return &blob->engine_usage.engines[guc_class][engine->instance];
>> +}
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>> index 3d85051d57e4..e74c110facff 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>> @@ -6,8 +6,11 @@
>>   #ifndef _INTEL_GUC_ADS_H_
>>   #define _INTEL_GUC_ADS_H_
>>   
>> +#include <linux/types.h>
>> +
>>   struct intel_guc;
>>   struct drm_printer;
>> +struct intel_engine_cs;
>>   
>>   int intel_guc_ads_create(struct intel_guc *guc);
>>   void intel_guc_ads_destroy(struct intel_guc *guc);
>> @@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
>>   void intel_guc_ads_reset(struct intel_guc *guc);
>>   void intel_guc_ads_print_policy_info(struct intel_guc *guc,
>>   				     struct drm_printer *p);
>> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
>> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
>>   
>>   #endif
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> index fa4be13c8854..7c9c081670fc 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> @@ -294,6 +294,19 @@ struct guc_ads {
>>   	u32 reserved[15];
>>   } __packed;
>>   
>> +/* Engine usage stats */
>> +struct guc_engine_usage_record {
>> +	u32 current_context_index;
>> +	u32 last_switch_in_stamp;
>> +	u32 reserved0;
>> +	u32 total_runtime;
>> +	u32 reserved1[4];
>> +} __packed;
>> +
>> +struct guc_engine_usage {
>> +	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
>> +} __packed;
>> +
>>   /* GuC logging structures */
>>   
>>   enum guc_log_buffer_type {
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index ba0de35f6323..3f7d0f2ac9da 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -12,6 +12,7 @@
>>   #include "gt/intel_engine_pm.h"
>>   #include "gt/intel_engine_heartbeat.h"
>>   #include "gt/intel_gt.h"
>> +#include "gt/intel_gt_clock_utils.h"
>>   #include "gt/intel_gt_irq.h"
>>   #include "gt/intel_gt_pm.h"
>>   #include "gt/intel_gt_requests.h"
>> @@ -20,6 +21,7 @@
>>   #include "gt/intel_mocs.h"
>>   #include "gt/intel_ring.h"
>>   
>> +#include "intel_guc_ads.h"
>>   #include "intel_guc_submission.h"
>>   
>>   #include "i915_drv.h"
>> @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>>   static void disable_submission(struct intel_guc *guc)
>>   {
>>   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	unsigned long flags;
>>   
>>   	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>>   		GEM_BUG_ON(!guc->ct.enabled);
>>   		__tasklet_disable_sync_once(&sched_engine->tasklet);
>>   		sched_engine->tasklet.callback = NULL;
>>   	}
>> +
>> +	cancel_delayed_work(&guc->timestamp.work);
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +
>> +	for_each_engine(engine, gt, id)
>> +		engine->stats.prev_total = 0;
>> +
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>   }
>>   
>>   static void enable_submission(struct intel_guc *guc)
>> @@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>>   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>>   }
>>   
>> +/*
>> + * GuC stores busyness stats for each engine at context in/out boundaries. A
>> + * context 'in' logs execution start time, 'out' adds in -> out delta to total.
>> + * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
>> + * GuC.
>> + *
>> + * __i915_pmu_event_read samples engine busyness. When sampling, if context id
>> + * is valid (!= ~0) and start is non-zero, the engine is considered to be
>> + * active. For an active engine total busyness = total + (now - start), where
>> + * 'now' is the time at which the busyness is sampled. For inactive engine,
>> + * total busyness = total.
>> + *
>> + * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
>> + *
>> + * The start and total values provided by GuC are 32 bits and wrap around in a
>> + * few minutes. Since perf pmu provides busyness as 64 bit monotonically
>> + * increasing ns values, there is a need for this implementation to account for
>> + * overflows and extend the GuC provided values to 64 bits before returning
>> + * busyness to the user. In order to do that, a worker runs periodically at
>> + * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
>> + * 27 seconds for a gt clock frequency of 19.2 MHz).
>> + */
>> +
>> +#define WRAP_TIME_CLKS U32_MAX
>> +#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
>> +
>> +static void
>> +__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
>> +{
>> +	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>> +	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
>> +
>> +	if (new_start == lower_32_bits(*prev_start))
>> +		return;
>> +
>> +	if (new_start < gt_stamp_last &&
>> +	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
>> +		gt_stamp_hi++;
>> +
>> +	if (new_start > gt_stamp_last &&
>> +	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
>> +		gt_stamp_hi--;
>> +
>> +	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
>> +}
>> +
>> +static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
>> +{
>> +	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
>> +	struct intel_guc *guc = &engine->gt->uc.guc;
>> +	u32 last_switch = rec->last_switch_in_stamp;
>> +	u32 ctx_id = rec->current_context_index;
>> +	u32 total = rec->total_runtime;
>> +
>> +	lockdep_assert_held(&guc->timestamp.lock);
>> +
>> +	engine->stats.running = ctx_id != ~0U && last_switch;
>> +	if (engine->stats.running)
>> +		__extend_last_switch(guc, &engine->stats.start_gt_clk,
>> +				     last_switch);
>> +
>> +	/*
>> +	 * Instead of adjusting the total for overflow, just add the
>> +	 * difference from previous sample to the stats.total_gt_clks
>> +	 */
>> +	if (total && total != ~0U) {
>> +		engine->stats.total_gt_clks += (u32)(total -
>> +						     engine->stats.prev_total);
>> +		engine->stats.prev_total = total;
>> +	}
>> +}
>> +
>> +static void guc_update_pm_timestamp(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	u32 gt_stamp_now, gt_stamp_hi;
>> +
>> +	lockdep_assert_held(&guc->timestamp.lock);
>> +
>> +	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>> +	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
>> +
>> +	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
>> +		gt_stamp_hi++;
>> +
>> +	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
>> +}
>> +
>> +/*
>> + * Unlike the execlist mode of submission total and active times are in terms of
>> + * gt clocks. The *now parameter is retained to return the cpu time at which the
>> + * busyness was sampled.
>> + */
>> +static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>> +{
>> +	struct intel_gt *gt = engine->gt;
>> +	struct intel_guc *guc = &gt->uc.guc;
>> +	unsigned long flags;
>> +	u64 total;
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +
>> +	*now = ktime_get();
>> +
>> +	/*
>> +	 * The active busyness depends on start_gt_clk and gt_stamp.
>> +	 * gt_stamp is updated by i915 only when gt is awake and the
>> +	 * start_gt_clk is derived from GuC state. To get a consistent
>> +	 * view of activity, we query the GuC state only if gt is awake.
>> +	 */
>> +	if (intel_gt_pm_get_if_awake(gt)) {
>> +		guc_update_engine_gt_clks(engine);
>> +		guc_update_pm_timestamp(guc);
>> +		intel_gt_pm_put_async(gt);
>> +	}
>> +
>> +	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
>> +	if (engine->stats.running) {
>> +		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
>> +
>> +		total += intel_gt_clock_interval_to_ns(gt, clk);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> +
>> +	return ns_to_ktime(total);
>> +}
>> +
>> +static void __update_guc_busyness_stats(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +
>> +	if (intel_gt_pm_get_if_awake(gt)) {
>> +		guc_update_pm_timestamp(guc);
>> +
>> +		for_each_engine(engine, gt, id)
>> +			guc_update_engine_gt_clks(engine);
>> +
>> +		intel_gt_pm_put_async(gt);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> +}
>> +
>> +static void guc_timestamp_ping(struct work_struct *wrk)
>> +{
>> +	struct intel_guc *guc = container_of(wrk, typeof(*guc),
>> +					     timestamp.work.work);
>> +
>> +	__update_guc_busyness_stats(guc);
>> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>> +			 guc->timestamp.ping_delay);
>> +}
>> +
>> +static int guc_action_enable_usage_stats(struct intel_guc *guc)
>> +{
>> +	u32 offset = intel_guc_engine_usage_offset(guc);
>> +	u32 action[] = {
>> +		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
>> +		offset,
>> +		0,
>> +	};
>> +
>> +	return intel_guc_send(guc, action, ARRAY_SIZE(action));
>> +}
>> +
>> +static void guc_init_engine_stats(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	intel_wakeref_t wakeref;
>> +
>> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>> +			 guc->timestamp.ping_delay);
>> +
>> +	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
>> +		int ret = guc_action_enable_usage_stats(guc);
>> +
>> +		if (ret)
>> +			drm_err(&gt->i915->drm,
>> +				"Failed to enable usage stats: %d!\n", ret);
>> +	}
>> +}
>> +
>> +void intel_guc_busyness_park(struct intel_gt *gt)
>> +{
>> +	struct intel_guc *guc = &gt->uc.guc;
>> +
>> +	cancel_delayed_work(&guc->timestamp.work);
>> +	__update_guc_busyness_stats(guc);
>> +}
>> +
>> +void intel_guc_busyness_unpark(struct intel_gt *gt)
>> +{
>> +	struct intel_guc *guc = &gt->uc.guc;
>> +
>> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>> +			 guc->timestamp.ping_delay);
>> +}
>> +
>>   /*
>>    * Set up the memory resources to be shared with the GuC (via the GGTT)
>>    * at firmware loading time.
>>    */
>>   int intel_guc_submission_init(struct intel_guc *guc)
>>   {
>> +	struct intel_gt *gt = guc_to_gt(guc);
>>   	int ret;
>>   
>>   	if (guc->lrc_desc_pool)
>> @@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>   	INIT_LIST_HEAD(&guc->guc_id_list);
>>   	ida_init(&guc->guc_ids);
>>   
>> +	spin_lock_init(&guc->timestamp.lock);
>> +	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
>> +	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
>> +
>>   	return 0;
>>   }
>>   
>> @@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
>>   		engine->emit_flush = gen12_emit_flush_xcs;
>>   	}
>>   	engine->set_default_submission = guc_set_default_submission;
>> +	engine->busyness = guc_engine_busyness;
>>   
>> +	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
>>   	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
>>   	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
>>   
>> @@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
>>   void intel_guc_submission_enable(struct intel_guc *guc)
>>   {
>>   	guc_init_lrc_mapping(guc);
>> +	guc_init_engine_stats(guc);
>>   }
>>   
>>   void intel_guc_submission_disable(struct intel_guc *guc)
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>> index c7ef44fa0c36..5a95a9f0a8e3 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>> @@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>>   void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>>   				    struct i915_request *hung_rq,
>>   				    struct drm_printer *m);
>> +void intel_guc_busyness_park(struct intel_gt *gt);
>> +void intel_guc_busyness_unpark(struct intel_gt *gt);
>>   
>>   bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
>>   
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index a897f4abea0c..9aee08425382 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>>   #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
>>   #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
>>   
>> +#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
>> +
>>   /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
>>   #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
>>   #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
  2021-10-05 17:47 ` [Intel-gfx] " Umesh Nerlige Ramappa
@ 2021-10-06  9:11   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2021-10-06  9:11 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa, intel-gfx, dri-devel
  Cc: john.c.harrison, daniel.vetter, Matthew Brost


On 05/10/2021 18:47, Umesh Nerlige Ramappa wrote:
> With GuC handling scheduling, i915 is not aware of the time that a
> context is scheduled in and out of the engine. Since i915 pmu relies on
> this info to provide engine busyness to the user, GuC shares this info
> with i915 for all engines using shared memory. For each engine, this
> info contains:
> 
> - total busyness: total time that the context was running (total)
> - id: id of the running context (id)
> - start timestamp: timestamp when the context started running (start)
> 
> At the time (now) of sampling the engine busyness, if the id is valid
> (!= ~0), and start is non-zero, then the context is considered to be
> active and the engine busyness is calculated using the below equation
> 
> 	engine busyness = total + (now - start)
> 
> All times are obtained from the gt clock base. For inactive contexts,
> engine busyness is just equal to the total.
> 
> The start and total values provided by GuC are 32 bits and wrap around
> in a few minutes. Since perf pmu provides busyness as 64 bit
> monotonically increasing values, there is a need for this implementation
> to account for overflows and extend the time to 64 bits before returning
> busyness to the user. In order to do that, a worker runs periodically at
> frequency = 1/8th the time it takes for the timestamp to wrap. As an
> example, that would be once in 27 seconds for a gt clock frequency of
> 19.2 MHz.
> 
> Opens and wip that are targeted for later patches:
> 
> 1) On global gt reset the total busyness of engines resets and i915
>     needs to fix that so that user sees monotonically increasing
>     busyness.
> 2) In runtime suspend mode, the worker may not need to be run. We could
>     stop the worker on suspend and rerun it on resume provided that the
>     guc pm timestamp does not tick during suspend.

Second point had now been addressed, right?

> 
> Note:
> There might be an overaccounting of busyness due to the fact that GuC
> may be updating the total and start values while kmd is reading them.
> (i.e kmd may read the updated total and the stale start). In such a
> case, user may see higher busyness value followed by smaller ones which
> would eventually catch up to the higher value.
> 
> v2: (Tvrtko)
> - Include details in commit message
> - Move intel engine busyness function into execlist code
> - Use union inside engine->stats
> - Use natural type for ping delay jiffies
> - Drop active_work condition checks
> - Use for_each_engine if iterating all engines
> - Drop seq locking, use spinlock at guc level to update engine stats
> - Document worker specific details
> 
> v3: (Tvrtko/Umesh)
> - Demarcate guc and execlist stat objects with comments
> - Document known over-accounting issue in commit
> - Provide a consistent view of guc state
> - Add hooks to gt park/unpark for guc busyness
> - Stop/start worker in gt park/unpark path
> - Drop inline
> - Move spinlock and worker inits to guc initialization
> - Drop helpers that are called only once
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
>   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
>   .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
>   drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
>   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
>   drivers/gpu/drm/i915/i915_reg.h               |   2 +
>   12 files changed, 398 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 2ae57e4656a3..6fcc70a313d9 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>   	intel_engine_print_breadcrumbs(engine, m);
>   }
>   
> -static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
> -					    ktime_t *now)
> -{
> -	ktime_t total = engine->stats.total;
> -
> -	/*
> -	 * If the engine is executing something at the moment
> -	 * add it to the total.
> -	 */
> -	*now = ktime_get();
> -	if (READ_ONCE(engine->stats.active))
> -		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> -
> -	return total;
> -}
> -
>   /**
>    * intel_engine_get_busy_time() - Return current accumulated engine busyness
>    * @engine: engine to report on
> @@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>    */
>   ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>   {
> -	unsigned int seq;
> -	ktime_t total;
> -
> -	do {
> -		seq = read_seqcount_begin(&engine->stats.lock);
> -		total = __intel_engine_get_busy_time(engine, now);
> -	} while (read_seqcount_retry(&engine->stats.lock, seq));
> -
> -	return total;
> +	return engine->busyness(engine, now);
>   }
>   
>   struct intel_context *
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 5ae1207c363b..8e1b9c38a6fc 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -432,6 +432,12 @@ struct intel_engine_cs {
>   	void		(*add_active_request)(struct i915_request *rq);
>   	void		(*remove_active_request)(struct i915_request *rq);
>   
> +	/*
> +	 * Get engine busyness and the time at which the busyness was sampled.
> +	 */
> +	ktime_t		(*busyness)(struct intel_engine_cs *engine,
> +				    ktime_t *now);
> +
>   	struct intel_engine_execlists execlists;
>   
>   	/*
> @@ -481,30 +487,66 @@ struct intel_engine_cs {
>   	u32 (*get_cmd_length_mask)(u32 cmd_header);
>   
>   	struct {
> -		/**
> -		 * @active: Number of contexts currently scheduled in.
> -		 */
> -		unsigned int active;
> -
> -		/**
> -		 * @lock: Lock protecting the below fields.
> -		 */
> -		seqcount_t lock;
> -
> -		/**
> -		 * @total: Total time this engine was busy.
> -		 *
> -		 * Accumulated time not counting the most recent block in cases
> -		 * where engine is currently busy (active > 0).
> -		 */
> -		ktime_t total;
> -
> -		/**
> -		 * @start: Timestamp of the last idle to active transition.
> -		 *
> -		 * Idle is defined as active == 0, active is active > 0.
> -		 */
> -		ktime_t start;
> +		union {
> +			/* Fields used by the execlists backend. */
> +			struct {
> +				/**
> +				 * @active: Number of contexts currently
> +				 * scheduled in.
> +				 */
> +				unsigned int active;
> +
> +				/**
> +				 * @lock: Lock protecting the below fields.
> +				 */
> +				seqcount_t lock;
> +
> +				/**
> +				 * @total: Total time this engine was busy.
> +				 *
> +				 * Accumulated time not counting the most recent
> +				 * block in cases where engine is currently busy
> +				 * (active > 0).
> +				 */
> +				ktime_t total;
> +
> +				/**
> +				 * @start: Timestamp of the last idle to active
> +				 * transition.
> +				 *
> +				 * Idle is defined as active == 0, active is
> +				 * active > 0.
> +				 */
> +				ktime_t start;
> +			};
> +
> +			/* Fields used by the GuC backend. */
> +			struct {
> +				/**
> +				 * @running: Active state of the engine when
> +				 * busyness was last sampled.
> +				 */
> +				bool running;
> +
> +				/**
> +				 * @prev_total: Previous value of total runtime
> +				 * clock cycles.
> +				 */
> +				u32 prev_total;
> +
> +				/**
> +				 * @total_gt_clks: Total gt clock cycles this
> +				 * engine was busy.
> +				 */
> +				u64 total_gt_clks;
> +
> +				/**
> +				 * @start_gt_clk: GT clock time of last idle to
> +				 * active transition.
> +				 */
> +				u64 start_gt_clk;
> +			};
> +		};
>   
>   		/**
>   		 * @rps: Utilisation at last RPS sampling.
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 7147fe80919e..5c9b695e906c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
>   	lrc_fini_wa_ctx(engine);
>   }
>   
> +static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
> +					   ktime_t *now)
> +{
> +	ktime_t total = engine->stats.total;
> +
> +	/*
> +	 * If the engine is executing something at the moment
> +	 * add it to the total.
> +	 */
> +	*now = ktime_get();
> +	if (READ_ONCE(engine->stats.active))
> +		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> +
> +	return total;
> +}
> +
> +static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
> +					 ktime_t *now)
> +{
> +	unsigned int seq;
> +	ktime_t total;
> +
> +	do {
> +		seq = read_seqcount_begin(&engine->stats.lock);
> +		total = __execlists_engine_busyness(engine, now);
> +	} while (read_seqcount_retry(&engine->stats.lock, seq));
> +
> +	return total;
> +}
> +
>   static void
>   logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>   {
> @@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>   		engine->emit_bb_start = gen8_emit_bb_start;
>   	else
>   		engine->emit_bb_start = gen8_emit_bb_start_noarb;
> +
> +	engine->busyness = execlists_engine_busyness;
>   }
>   
>   static void logical_ring_default_irqs(struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index 524eaf678790..b4a8594bc46c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
>   	intel_rc6_unpark(&gt->rc6);
>   	intel_rps_unpark(&gt->rps);
>   	i915_pmu_gt_unparked(i915);
> +	intel_guc_busyness_unpark(gt);
>   
>   	intel_gt_unpark_requests(gt);
>   	runtime_begin(gt);
> @@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
>   	runtime_end(gt);
>   	intel_gt_park_requests(gt);
>   
> +	intel_guc_busyness_park(gt);
>   	i915_vma_parked(gt);
>   	i915_pmu_gt_parked(i915);
>   	intel_rps_park(&gt->rps);
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> index 8ff582222aff..ff1311d4beff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> @@ -143,6 +143,7 @@ enum intel_guc_action {
>   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
> +	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>   	INTEL_GUC_ACTION_LIMIT
>   };
>   
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 5dd174babf7a..22c30dbdf63a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -104,6 +104,8 @@ struct intel_guc {
>   	u32 ads_regset_size;
>   	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>   	u32 ads_golden_ctxt_size;
> +	/** @ads_engine_usage_size: size of engine usage in the ADS */
> +	u32 ads_engine_usage_size;
>   
>   	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
>   	struct i915_vma *lrc_desc_pool;
> @@ -138,6 +140,30 @@ struct intel_guc {
>   
>   	/** @send_mutex: used to serialize the intel_guc_send actions */
>   	struct mutex send_mutex;
> +
> +	struct {
> +		/**
> +		 * @lock: Lock protecting the below fields and the engine stats.
> +		 */
> +		spinlock_t lock;
> +
> +		/**
> +		 * @gt_stamp: 64 bit extended value of the GT timestamp.
> +		 */
> +		u64 gt_stamp;
> +
> +		/**
> +		 * @ping_delay: Period for polling the GT timestamp for
> +		 * overflow.
> +		 */
> +		unsigned long ping_delay;
> +
> +		/**
> +		 * @work: Periodic work to adjust GT timestamp, engine and
> +		 * context usage for overflows.
> +		 */
> +		struct delayed_work work;
> +	} timestamp;
>   };
>   
>   static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> index 2c6ea64af7ec..ca9ab53999d5 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> @@ -26,6 +26,8 @@
>    *      | guc_policies                          |
>    *      +---------------------------------------+
>    *      | guc_gt_system_info                    |
> + *      +---------------------------------------+
> + *      | guc_engine_usage                      |
>    *      +---------------------------------------+ <== static
>    *      | guc_mmio_reg[countA] (engine 0.0)     |
>    *      | guc_mmio_reg[countB] (engine 0.1)     |
> @@ -47,6 +49,7 @@ struct __guc_ads_blob {
>   	struct guc_ads ads;
>   	struct guc_policies policies;
>   	struct guc_gt_system_info system_info;
> +	struct guc_engine_usage engine_usage;
>   	/* From here on, location is dynamic! Refer to above diagram. */
>   	struct guc_mmio_reg regset[0];
>   } __packed;
> @@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
>   
>   	guc_ads_private_data_reset(guc);
>   }
> +
> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
> +{
> +	struct __guc_ads_blob *blob = guc->ads_blob;
> +	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
> +	u32 offset = base + ptr_offset(blob, engine_usage);
> +
> +	return offset;
> +}
> +
> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
> +{
> +	struct intel_guc *guc = &engine->gt->uc.guc;
> +	struct __guc_ads_blob *blob = guc->ads_blob;
> +	u8 guc_class = engine_class_to_guc_class(engine->class);
> +
> +	return &blob->engine_usage.engines[guc_class][engine->instance];
> +}
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> index 3d85051d57e4..e74c110facff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> @@ -6,8 +6,11 @@
>   #ifndef _INTEL_GUC_ADS_H_
>   #define _INTEL_GUC_ADS_H_
>   
> +#include <linux/types.h>
> +
>   struct intel_guc;
>   struct drm_printer;
> +struct intel_engine_cs;
>   
>   int intel_guc_ads_create(struct intel_guc *guc);
>   void intel_guc_ads_destroy(struct intel_guc *guc);
> @@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
>   void intel_guc_ads_reset(struct intel_guc *guc);
>   void intel_guc_ads_print_policy_info(struct intel_guc *guc,
>   				     struct drm_printer *p);
> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
>   
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index fa4be13c8854..7c9c081670fc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -294,6 +294,19 @@ struct guc_ads {
>   	u32 reserved[15];
>   } __packed;
>   
> +/* Engine usage stats */
> +struct guc_engine_usage_record {
> +	u32 current_context_index;
> +	u32 last_switch_in_stamp;
> +	u32 reserved0;
> +	u32 total_runtime;
> +	u32 reserved1[4];
> +} __packed;
> +
> +struct guc_engine_usage {
> +	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
> +} __packed;
> +
>   /* GuC logging structures */
>   
>   enum guc_log_buffer_type {
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index ba0de35f6323..3f7d0f2ac9da 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -12,6 +12,7 @@
>   #include "gt/intel_engine_pm.h"
>   #include "gt/intel_engine_heartbeat.h"
>   #include "gt/intel_gt.h"
> +#include "gt/intel_gt_clock_utils.h"
>   #include "gt/intel_gt_irq.h"
>   #include "gt/intel_gt_pm.h"
>   #include "gt/intel_gt_requests.h"
> @@ -20,6 +21,7 @@
>   #include "gt/intel_mocs.h"
>   #include "gt/intel_ring.h"
>   
> +#include "intel_guc_ads.h"
>   #include "intel_guc_submission.h"
>   
>   #include "i915_drv.h"
> @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>   static void disable_submission(struct intel_guc *guc)
>   {
>   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	unsigned long flags;
>   
>   	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>   		GEM_BUG_ON(!guc->ct.enabled);
>   		__tasklet_disable_sync_once(&sched_engine->tasklet);
>   		sched_engine->tasklet.callback = NULL;
>   	}
> +
> +	cancel_delayed_work(&guc->timestamp.work);

I am not sure when disable_submission gets called so a question - could 
it be important to call cancel_delayed_work_sync here to ensure if the 
worker was running it had exited before proceeding?

Also, does this interact with the open about resets? Should/could 
parking helper be called from here?

> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	for_each_engine(engine, gt, id)
> +		engine->stats.prev_total = 0;
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>   }
>   
>   static void enable_submission(struct intel_guc *guc)
> @@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>   }
>   
> +/*
> + * GuC stores busyness stats for each engine at context in/out boundaries. A
> + * context 'in' logs execution start time, 'out' adds in -> out delta to total.
> + * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
> + * GuC.
> + *
> + * __i915_pmu_event_read samples engine busyness. When sampling, if context id
> + * is valid (!= ~0) and start is non-zero, the engine is considered to be
> + * active. For an active engine total busyness = total + (now - start), where
> + * 'now' is the time at which the busyness is sampled. For inactive engine,
> + * total busyness = total.
> + *
> + * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
> + *
> + * The start and total values provided by GuC are 32 bits and wrap around in a
> + * few minutes. Since perf pmu provides busyness as 64 bit monotonically
> + * increasing ns values, there is a need for this implementation to account for
> + * overflows and extend the GuC provided values to 64 bits before returning
> + * busyness to the user. In order to do that, a worker runs periodically at
> + * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
> + * 27 seconds for a gt clock frequency of 19.2 MHz).
> + */
> +
> +#define WRAP_TIME_CLKS U32_MAX
> +#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
> +
> +static void
> +__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
> +{
> +	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> +	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
> +
> +	if (new_start == lower_32_bits(*prev_start))
> +		return;
> +
> +	if (new_start < gt_stamp_last &&
> +	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
> +		gt_stamp_hi++;
> +
> +	if (new_start > gt_stamp_last &&
> +	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
> +		gt_stamp_hi--;
> +
> +	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
> +}
> +
> +static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
> +{
> +	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
> +	struct intel_guc *guc = &engine->gt->uc.guc;
> +	u32 last_switch = rec->last_switch_in_stamp;
> +	u32 ctx_id = rec->current_context_index;
> +	u32 total = rec->total_runtime;
> +
> +	lockdep_assert_held(&guc->timestamp.lock);
> +
> +	engine->stats.running = ctx_id != ~0U && last_switch;
> +	if (engine->stats.running)
> +		__extend_last_switch(guc, &engine->stats.start_gt_clk,
> +				     last_switch);
> +
> +	/*
> +	 * Instead of adjusting the total for overflow, just add the
> +	 * difference from previous sample to the stats.total_gt_clks
> +	 */
> +	if (total && total != ~0U) {
> +		engine->stats.total_gt_clks += (u32)(total -
> +						     engine->stats.prev_total);
> +		engine->stats.prev_total = total;
> +	}
> +}
> +
> +static void guc_update_pm_timestamp(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	u32 gt_stamp_now, gt_stamp_hi;
> +
> +	lockdep_assert_held(&guc->timestamp.lock);
> +
> +	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> +	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
> +
> +	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
> +		gt_stamp_hi++;
> +
> +	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
> +}
> +
> +/*
> + * Unlike the execlist mode of submission total and active times are in terms of
> + * gt clocks. The *now parameter is retained to return the cpu time at which the
> + * busyness was sampled.
> + */
> +static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
> +{
> +	struct intel_gt *gt = engine->gt;
> +	struct intel_guc *guc = &gt->uc.guc;
> +	unsigned long flags;
> +	u64 total;
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	*now = ktime_get();
> +
> +	/*
> +	 * The active busyness depends on start_gt_clk and gt_stamp.
> +	 * gt_stamp is updated by i915 only when gt is awake and the
> +	 * start_gt_clk is derived from GuC state. To get a consistent
> +	 * view of activity, we query the GuC state only if gt is awake.
> +	 */
> +	if (intel_gt_pm_get_if_awake(gt)) {
> +		guc_update_engine_gt_clks(engine);
> +		guc_update_pm_timestamp(guc);
> +		intel_gt_pm_put_async(gt);
> +	}
> +
> +	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
> +	if (engine->stats.running) {
> +		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
> +
> +		total += intel_gt_clock_interval_to_ns(gt, clk);
> +	}
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> +
> +	return ns_to_ktime(total);
> +}
> +
> +static void __update_guc_busyness_stats(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	if (intel_gt_pm_get_if_awake(gt)) {
> +		guc_update_pm_timestamp(guc);
> +
> +		for_each_engine(engine, gt, id)
> +			guc_update_engine_gt_clks(engine);
> +
> +		intel_gt_pm_put_async(gt);
> +	}
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> +}
> +
> +static void guc_timestamp_ping(struct work_struct *wrk)
> +{
> +	struct intel_guc *guc = container_of(wrk, typeof(*guc),
> +					     timestamp.work.work);
> +
> +	__update_guc_busyness_stats(guc);

 From ping you may need to ensure you wake up the GPU (not call 
intel_gt_pm_get_if_awake in update) or I think there is a chance ping 
gets unlucky and fails to do its job.

Probably get the pm ref here and remove it from 
__update_guc_busyness_stats, since the other caller (park) guarantees pm 
ref is still held.

> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> +			 guc->timestamp.ping_delay);
> +}
> +
> +static int guc_action_enable_usage_stats(struct intel_guc *guc)
> +{
> +	u32 offset = intel_guc_engine_usage_offset(guc);
> +	u32 action[] = {
> +		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
> +		offset,
> +		0,
> +	};
> +
> +	return intel_guc_send(guc, action, ARRAY_SIZE(action));
> +}
> +
> +static void guc_init_engine_stats(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	intel_wakeref_t wakeref;
> +
> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> +			 guc->timestamp.ping_delay);

Not sure how this slots in with unpark. It will probably be called two 
times but it also probably does not matter? If you can figure it out 
perhaps you can remove this call from here. Or maybe there is a separate 
path where disable-enable can be called without the park-unpark 
transition. In which case you could call the unpark helper here. Not 
sure really.

Regards,

Tvrtko

> +
> +	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
> +		int ret = guc_action_enable_usage_stats(guc);
> +
> +		if (ret)
> +			drm_err(&gt->i915->drm,
> +				"Failed to enable usage stats: %d!\n", ret);
> +	}
> +}
> +
> +void intel_guc_busyness_park(struct intel_gt *gt)
> +{
> +	struct intel_guc *guc = &gt->uc.guc;
> +
> +	cancel_delayed_work(&guc->timestamp.work);
> +	__update_guc_busyness_stats(guc);
> +}
> +
> +void intel_guc_busyness_unpark(struct intel_gt *gt)
> +{
> +	struct intel_guc *guc = &gt->uc.guc;
> +
> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> +			 guc->timestamp.ping_delay);
> +}
> +
>   /*
>    * Set up the memory resources to be shared with the GuC (via the GGTT)
>    * at firmware loading time.
>    */
>   int intel_guc_submission_init(struct intel_guc *guc)
>   {
> +	struct intel_gt *gt = guc_to_gt(guc);
>   	int ret;
>   
>   	if (guc->lrc_desc_pool)
> @@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	INIT_LIST_HEAD(&guc->guc_id_list);
>   	ida_init(&guc->guc_ids);
>   
> +	spin_lock_init(&guc->timestamp.lock);
> +	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
> +	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
> +
>   	return 0;
>   }
>   
> @@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
>   		engine->emit_flush = gen12_emit_flush_xcs;
>   	}
>   	engine->set_default_submission = guc_set_default_submission;
> +	engine->busyness = guc_engine_busyness;
>   
> +	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
>   	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
>   	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
>   
> @@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
>   void intel_guc_submission_enable(struct intel_guc *guc)
>   {
>   	guc_init_lrc_mapping(guc);
> +	guc_init_engine_stats(guc);
>   }
>   
>   void intel_guc_submission_disable(struct intel_guc *guc)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> index c7ef44fa0c36..5a95a9f0a8e3 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> @@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>   				    struct i915_request *hung_rq,
>   				    struct drm_printer *m);
> +void intel_guc_busyness_park(struct intel_gt *gt);
> +void intel_guc_busyness_unpark(struct intel_gt *gt);
>   
>   bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
>   
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index a897f4abea0c..9aee08425382 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>   #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
>   #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
>   
> +#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
> +
>   /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
>   #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
>   #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
@ 2021-10-06  9:11   ` Tvrtko Ursulin
  0 siblings, 0 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2021-10-06  9:11 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa, intel-gfx, dri-devel
  Cc: john.c.harrison, daniel.vetter, Matthew Brost


On 05/10/2021 18:47, Umesh Nerlige Ramappa wrote:
> With GuC handling scheduling, i915 is not aware of the time that a
> context is scheduled in and out of the engine. Since i915 pmu relies on
> this info to provide engine busyness to the user, GuC shares this info
> with i915 for all engines using shared memory. For each engine, this
> info contains:
> 
> - total busyness: total time that the context was running (total)
> - id: id of the running context (id)
> - start timestamp: timestamp when the context started running (start)
> 
> At the time (now) of sampling the engine busyness, if the id is valid
> (!= ~0), and start is non-zero, then the context is considered to be
> active and the engine busyness is calculated using the below equation
> 
> 	engine busyness = total + (now - start)
> 
> All times are obtained from the gt clock base. For inactive contexts,
> engine busyness is just equal to the total.
> 
> The start and total values provided by GuC are 32 bits and wrap around
> in a few minutes. Since perf pmu provides busyness as 64 bit
> monotonically increasing values, there is a need for this implementation
> to account for overflows and extend the time to 64 bits before returning
> busyness to the user. In order to do that, a worker runs periodically at
> frequency = 1/8th the time it takes for the timestamp to wrap. As an
> example, that would be once in 27 seconds for a gt clock frequency of
> 19.2 MHz.
> 
> Opens and wip that are targeted for later patches:
> 
> 1) On global gt reset the total busyness of engines resets and i915
>     needs to fix that so that user sees monotonically increasing
>     busyness.
> 2) In runtime suspend mode, the worker may not need to be run. We could
>     stop the worker on suspend and rerun it on resume provided that the
>     guc pm timestamp does not tick during suspend.

Second point had now been addressed, right?

> 
> Note:
> There might be an overaccounting of busyness due to the fact that GuC
> may be updating the total and start values while kmd is reading them.
> (i.e kmd may read the updated total and the stale start). In such a
> case, user may see higher busyness value followed by smaller ones which
> would eventually catch up to the higher value.
> 
> v2: (Tvrtko)
> - Include details in commit message
> - Move intel engine busyness function into execlist code
> - Use union inside engine->stats
> - Use natural type for ping delay jiffies
> - Drop active_work condition checks
> - Use for_each_engine if iterating all engines
> - Drop seq locking, use spinlock at guc level to update engine stats
> - Document worker specific details
> 
> v3: (Tvrtko/Umesh)
> - Demarcate guc and execlist stat objects with comments
> - Document known over-accounting issue in commit
> - Provide a consistent view of guc state
> - Add hooks to gt park/unpark for guc busyness
> - Stop/start worker in gt park/unpark path
> - Drop inline
> - Move spinlock and worker inits to guc initialization
> - Drop helpers that are called only once
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
>   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
>   .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
>   drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
>   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
>   drivers/gpu/drm/i915/i915_reg.h               |   2 +
>   12 files changed, 398 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 2ae57e4656a3..6fcc70a313d9 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>   	intel_engine_print_breadcrumbs(engine, m);
>   }
>   
> -static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
> -					    ktime_t *now)
> -{
> -	ktime_t total = engine->stats.total;
> -
> -	/*
> -	 * If the engine is executing something at the moment
> -	 * add it to the total.
> -	 */
> -	*now = ktime_get();
> -	if (READ_ONCE(engine->stats.active))
> -		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> -
> -	return total;
> -}
> -
>   /**
>    * intel_engine_get_busy_time() - Return current accumulated engine busyness
>    * @engine: engine to report on
> @@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>    */
>   ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>   {
> -	unsigned int seq;
> -	ktime_t total;
> -
> -	do {
> -		seq = read_seqcount_begin(&engine->stats.lock);
> -		total = __intel_engine_get_busy_time(engine, now);
> -	} while (read_seqcount_retry(&engine->stats.lock, seq));
> -
> -	return total;
> +	return engine->busyness(engine, now);
>   }
>   
>   struct intel_context *
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 5ae1207c363b..8e1b9c38a6fc 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -432,6 +432,12 @@ struct intel_engine_cs {
>   	void		(*add_active_request)(struct i915_request *rq);
>   	void		(*remove_active_request)(struct i915_request *rq);
>   
> +	/*
> +	 * Get engine busyness and the time at which the busyness was sampled.
> +	 */
> +	ktime_t		(*busyness)(struct intel_engine_cs *engine,
> +				    ktime_t *now);
> +
>   	struct intel_engine_execlists execlists;
>   
>   	/*
> @@ -481,30 +487,66 @@ struct intel_engine_cs {
>   	u32 (*get_cmd_length_mask)(u32 cmd_header);
>   
>   	struct {
> -		/**
> -		 * @active: Number of contexts currently scheduled in.
> -		 */
> -		unsigned int active;
> -
> -		/**
> -		 * @lock: Lock protecting the below fields.
> -		 */
> -		seqcount_t lock;
> -
> -		/**
> -		 * @total: Total time this engine was busy.
> -		 *
> -		 * Accumulated time not counting the most recent block in cases
> -		 * where engine is currently busy (active > 0).
> -		 */
> -		ktime_t total;
> -
> -		/**
> -		 * @start: Timestamp of the last idle to active transition.
> -		 *
> -		 * Idle is defined as active == 0, active is active > 0.
> -		 */
> -		ktime_t start;
> +		union {
> +			/* Fields used by the execlists backend. */
> +			struct {
> +				/**
> +				 * @active: Number of contexts currently
> +				 * scheduled in.
> +				 */
> +				unsigned int active;
> +
> +				/**
> +				 * @lock: Lock protecting the below fields.
> +				 */
> +				seqcount_t lock;
> +
> +				/**
> +				 * @total: Total time this engine was busy.
> +				 *
> +				 * Accumulated time not counting the most recent
> +				 * block in cases where engine is currently busy
> +				 * (active > 0).
> +				 */
> +				ktime_t total;
> +
> +				/**
> +				 * @start: Timestamp of the last idle to active
> +				 * transition.
> +				 *
> +				 * Idle is defined as active == 0, active is
> +				 * active > 0.
> +				 */
> +				ktime_t start;
> +			};
> +
> +			/* Fields used by the GuC backend. */
> +			struct {
> +				/**
> +				 * @running: Active state of the engine when
> +				 * busyness was last sampled.
> +				 */
> +				bool running;
> +
> +				/**
> +				 * @prev_total: Previous value of total runtime
> +				 * clock cycles.
> +				 */
> +				u32 prev_total;
> +
> +				/**
> +				 * @total_gt_clks: Total gt clock cycles this
> +				 * engine was busy.
> +				 */
> +				u64 total_gt_clks;
> +
> +				/**
> +				 * @start_gt_clk: GT clock time of last idle to
> +				 * active transition.
> +				 */
> +				u64 start_gt_clk;
> +			};
> +		};
>   
>   		/**
>   		 * @rps: Utilisation at last RPS sampling.
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 7147fe80919e..5c9b695e906c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
>   	lrc_fini_wa_ctx(engine);
>   }
>   
> +static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
> +					   ktime_t *now)
> +{
> +	ktime_t total = engine->stats.total;
> +
> +	/*
> +	 * If the engine is executing something at the moment
> +	 * add it to the total.
> +	 */
> +	*now = ktime_get();
> +	if (READ_ONCE(engine->stats.active))
> +		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> +
> +	return total;
> +}
> +
> +static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
> +					 ktime_t *now)
> +{
> +	unsigned int seq;
> +	ktime_t total;
> +
> +	do {
> +		seq = read_seqcount_begin(&engine->stats.lock);
> +		total = __execlists_engine_busyness(engine, now);
> +	} while (read_seqcount_retry(&engine->stats.lock, seq));
> +
> +	return total;
> +}
> +
>   static void
>   logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>   {
> @@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>   		engine->emit_bb_start = gen8_emit_bb_start;
>   	else
>   		engine->emit_bb_start = gen8_emit_bb_start_noarb;
> +
> +	engine->busyness = execlists_engine_busyness;
>   }
>   
>   static void logical_ring_default_irqs(struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index 524eaf678790..b4a8594bc46c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
>   	intel_rc6_unpark(&gt->rc6);
>   	intel_rps_unpark(&gt->rps);
>   	i915_pmu_gt_unparked(i915);
> +	intel_guc_busyness_unpark(gt);
>   
>   	intel_gt_unpark_requests(gt);
>   	runtime_begin(gt);
> @@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
>   	runtime_end(gt);
>   	intel_gt_park_requests(gt);
>   
> +	intel_guc_busyness_park(gt);
>   	i915_vma_parked(gt);
>   	i915_pmu_gt_parked(i915);
>   	intel_rps_park(&gt->rps);
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> index 8ff582222aff..ff1311d4beff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> @@ -143,6 +143,7 @@ enum intel_guc_action {
>   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
> +	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>   	INTEL_GUC_ACTION_LIMIT
>   };
>   
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 5dd174babf7a..22c30dbdf63a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -104,6 +104,8 @@ struct intel_guc {
>   	u32 ads_regset_size;
>   	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>   	u32 ads_golden_ctxt_size;
> +	/** @ads_engine_usage_size: size of engine usage in the ADS */
> +	u32 ads_engine_usage_size;
>   
>   	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
>   	struct i915_vma *lrc_desc_pool;
> @@ -138,6 +140,30 @@ struct intel_guc {
>   
>   	/** @send_mutex: used to serialize the intel_guc_send actions */
>   	struct mutex send_mutex;
> +
> +	struct {
> +		/**
> +		 * @lock: Lock protecting the below fields and the engine stats.
> +		 */
> +		spinlock_t lock;
> +
> +		/**
> +		 * @gt_stamp: 64 bit extended value of the GT timestamp.
> +		 */
> +		u64 gt_stamp;
> +
> +		/**
> +		 * @ping_delay: Period for polling the GT timestamp for
> +		 * overflow.
> +		 */
> +		unsigned long ping_delay;
> +
> +		/**
> +		 * @work: Periodic work to adjust GT timestamp, engine and
> +		 * context usage for overflows.
> +		 */
> +		struct delayed_work work;
> +	} timestamp;
>   };
>   
>   static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> index 2c6ea64af7ec..ca9ab53999d5 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> @@ -26,6 +26,8 @@
>    *      | guc_policies                          |
>    *      +---------------------------------------+
>    *      | guc_gt_system_info                    |
> + *      +---------------------------------------+
> + *      | guc_engine_usage                      |
>    *      +---------------------------------------+ <== static
>    *      | guc_mmio_reg[countA] (engine 0.0)     |
>    *      | guc_mmio_reg[countB] (engine 0.1)     |
> @@ -47,6 +49,7 @@ struct __guc_ads_blob {
>   	struct guc_ads ads;
>   	struct guc_policies policies;
>   	struct guc_gt_system_info system_info;
> +	struct guc_engine_usage engine_usage;
>   	/* From here on, location is dynamic! Refer to above diagram. */
>   	struct guc_mmio_reg regset[0];
>   } __packed;
> @@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
>   
>   	guc_ads_private_data_reset(guc);
>   }
> +
> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
> +{
> +	struct __guc_ads_blob *blob = guc->ads_blob;
> +	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
> +	u32 offset = base + ptr_offset(blob, engine_usage);
> +
> +	return offset;
> +}
> +
> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
> +{
> +	struct intel_guc *guc = &engine->gt->uc.guc;
> +	struct __guc_ads_blob *blob = guc->ads_blob;
> +	u8 guc_class = engine_class_to_guc_class(engine->class);
> +
> +	return &blob->engine_usage.engines[guc_class][engine->instance];
> +}
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> index 3d85051d57e4..e74c110facff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> @@ -6,8 +6,11 @@
>   #ifndef _INTEL_GUC_ADS_H_
>   #define _INTEL_GUC_ADS_H_
>   
> +#include <linux/types.h>
> +
>   struct intel_guc;
>   struct drm_printer;
> +struct intel_engine_cs;
>   
>   int intel_guc_ads_create(struct intel_guc *guc);
>   void intel_guc_ads_destroy(struct intel_guc *guc);
> @@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
>   void intel_guc_ads_reset(struct intel_guc *guc);
>   void intel_guc_ads_print_policy_info(struct intel_guc *guc,
>   				     struct drm_printer *p);
> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
>   
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index fa4be13c8854..7c9c081670fc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -294,6 +294,19 @@ struct guc_ads {
>   	u32 reserved[15];
>   } __packed;
>   
> +/* Engine usage stats */
> +struct guc_engine_usage_record {
> +	u32 current_context_index;
> +	u32 last_switch_in_stamp;
> +	u32 reserved0;
> +	u32 total_runtime;
> +	u32 reserved1[4];
> +} __packed;
> +
> +struct guc_engine_usage {
> +	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
> +} __packed;
> +
>   /* GuC logging structures */
>   
>   enum guc_log_buffer_type {
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index ba0de35f6323..3f7d0f2ac9da 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -12,6 +12,7 @@
>   #include "gt/intel_engine_pm.h"
>   #include "gt/intel_engine_heartbeat.h"
>   #include "gt/intel_gt.h"
> +#include "gt/intel_gt_clock_utils.h"
>   #include "gt/intel_gt_irq.h"
>   #include "gt/intel_gt_pm.h"
>   #include "gt/intel_gt_requests.h"
> @@ -20,6 +21,7 @@
>   #include "gt/intel_mocs.h"
>   #include "gt/intel_ring.h"
>   
> +#include "intel_guc_ads.h"
>   #include "intel_guc_submission.h"
>   
>   #include "i915_drv.h"
> @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>   static void disable_submission(struct intel_guc *guc)
>   {
>   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	unsigned long flags;
>   
>   	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>   		GEM_BUG_ON(!guc->ct.enabled);
>   		__tasklet_disable_sync_once(&sched_engine->tasklet);
>   		sched_engine->tasklet.callback = NULL;
>   	}
> +
> +	cancel_delayed_work(&guc->timestamp.work);

I am not sure when disable_submission gets called so a question - could 
it be important to call cancel_delayed_work_sync here to ensure if the 
worker was running it had exited before proceeding?

Also, does this interact with the open about resets? Should/could 
parking helper be called from here?

> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	for_each_engine(engine, gt, id)
> +		engine->stats.prev_total = 0;
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>   }
>   
>   static void enable_submission(struct intel_guc *guc)
> @@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>   }
>   
> +/*
> + * GuC stores busyness stats for each engine at context in/out boundaries. A
> + * context 'in' logs execution start time, 'out' adds in -> out delta to total.
> + * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
> + * GuC.
> + *
> + * __i915_pmu_event_read samples engine busyness. When sampling, if context id
> + * is valid (!= ~0) and start is non-zero, the engine is considered to be
> + * active. For an active engine total busyness = total + (now - start), where
> + * 'now' is the time at which the busyness is sampled. For inactive engine,
> + * total busyness = total.
> + *
> + * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
> + *
> + * The start and total values provided by GuC are 32 bits and wrap around in a
> + * few minutes. Since perf pmu provides busyness as 64 bit monotonically
> + * increasing ns values, there is a need for this implementation to account for
> + * overflows and extend the GuC provided values to 64 bits before returning
> + * busyness to the user. In order to do that, a worker runs periodically at
> + * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
> + * 27 seconds for a gt clock frequency of 19.2 MHz).
> + */
> +
> +#define WRAP_TIME_CLKS U32_MAX
> +#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
> +
> +static void
> +__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
> +{
> +	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> +	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
> +
> +	if (new_start == lower_32_bits(*prev_start))
> +		return;
> +
> +	if (new_start < gt_stamp_last &&
> +	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
> +		gt_stamp_hi++;
> +
> +	if (new_start > gt_stamp_last &&
> +	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
> +		gt_stamp_hi--;
> +
> +	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
> +}
> +
> +static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
> +{
> +	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
> +	struct intel_guc *guc = &engine->gt->uc.guc;
> +	u32 last_switch = rec->last_switch_in_stamp;
> +	u32 ctx_id = rec->current_context_index;
> +	u32 total = rec->total_runtime;
> +
> +	lockdep_assert_held(&guc->timestamp.lock);
> +
> +	engine->stats.running = ctx_id != ~0U && last_switch;
> +	if (engine->stats.running)
> +		__extend_last_switch(guc, &engine->stats.start_gt_clk,
> +				     last_switch);
> +
> +	/*
> +	 * Instead of adjusting the total for overflow, just add the
> +	 * difference from previous sample to the stats.total_gt_clks
> +	 */
> +	if (total && total != ~0U) {
> +		engine->stats.total_gt_clks += (u32)(total -
> +						     engine->stats.prev_total);
> +		engine->stats.prev_total = total;
> +	}
> +}
> +
> +static void guc_update_pm_timestamp(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	u32 gt_stamp_now, gt_stamp_hi;
> +
> +	lockdep_assert_held(&guc->timestamp.lock);
> +
> +	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> +	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
> +
> +	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
> +		gt_stamp_hi++;
> +
> +	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
> +}
> +
> +/*
> + * Unlike the execlist mode of submission total and active times are in terms of
> + * gt clocks. The *now parameter is retained to return the cpu time at which the
> + * busyness was sampled.
> + */
> +static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
> +{
> +	struct intel_gt *gt = engine->gt;
> +	struct intel_guc *guc = &gt->uc.guc;
> +	unsigned long flags;
> +	u64 total;
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	*now = ktime_get();
> +
> +	/*
> +	 * The active busyness depends on start_gt_clk and gt_stamp.
> +	 * gt_stamp is updated by i915 only when gt is awake and the
> +	 * start_gt_clk is derived from GuC state. To get a consistent
> +	 * view of activity, we query the GuC state only if gt is awake.
> +	 */
> +	if (intel_gt_pm_get_if_awake(gt)) {
> +		guc_update_engine_gt_clks(engine);
> +		guc_update_pm_timestamp(guc);
> +		intel_gt_pm_put_async(gt);
> +	}
> +
> +	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
> +	if (engine->stats.running) {
> +		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
> +
> +		total += intel_gt_clock_interval_to_ns(gt, clk);
> +	}
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> +
> +	return ns_to_ktime(total);
> +}
> +
> +static void __update_guc_busyness_stats(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	if (intel_gt_pm_get_if_awake(gt)) {
> +		guc_update_pm_timestamp(guc);
> +
> +		for_each_engine(engine, gt, id)
> +			guc_update_engine_gt_clks(engine);
> +
> +		intel_gt_pm_put_async(gt);
> +	}
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> +}
> +
> +static void guc_timestamp_ping(struct work_struct *wrk)
> +{
> +	struct intel_guc *guc = container_of(wrk, typeof(*guc),
> +					     timestamp.work.work);
> +
> +	__update_guc_busyness_stats(guc);

 From ping you may need to ensure you wake up the GPU (not call 
intel_gt_pm_get_if_awake in update) or I think there is a chance ping 
gets unlucky and fails to do its job.

Probably get the pm ref here and remove it from 
__update_guc_busyness_stats, since the other caller (park) guarantees pm 
ref is still held.

> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> +			 guc->timestamp.ping_delay);
> +}
> +
> +static int guc_action_enable_usage_stats(struct intel_guc *guc)
> +{
> +	u32 offset = intel_guc_engine_usage_offset(guc);
> +	u32 action[] = {
> +		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
> +		offset,
> +		0,
> +	};
> +
> +	return intel_guc_send(guc, action, ARRAY_SIZE(action));
> +}
> +
> +static void guc_init_engine_stats(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	intel_wakeref_t wakeref;
> +
> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> +			 guc->timestamp.ping_delay);

Not sure how this slots in with unpark. It will probably be called two 
times but it also probably does not matter? If you can figure it out 
perhaps you can remove this call from here. Or maybe there is a separate 
path where disable-enable can be called without the park-unpark 
transition. In which case you could call the unpark helper here. Not 
sure really.

Regards,

Tvrtko

> +
> +	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
> +		int ret = guc_action_enable_usage_stats(guc);
> +
> +		if (ret)
> +			drm_err(&gt->i915->drm,
> +				"Failed to enable usage stats: %d!\n", ret);
> +	}
> +}
> +
> +void intel_guc_busyness_park(struct intel_gt *gt)
> +{
> +	struct intel_guc *guc = &gt->uc.guc;
> +
> +	cancel_delayed_work(&guc->timestamp.work);
> +	__update_guc_busyness_stats(guc);
> +}
> +
> +void intel_guc_busyness_unpark(struct intel_gt *gt)
> +{
> +	struct intel_guc *guc = &gt->uc.guc;
> +
> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> +			 guc->timestamp.ping_delay);
> +}
> +
>   /*
>    * Set up the memory resources to be shared with the GuC (via the GGTT)
>    * at firmware loading time.
>    */
>   int intel_guc_submission_init(struct intel_guc *guc)
>   {
> +	struct intel_gt *gt = guc_to_gt(guc);
>   	int ret;
>   
>   	if (guc->lrc_desc_pool)
> @@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
>   	INIT_LIST_HEAD(&guc->guc_id_list);
>   	ida_init(&guc->guc_ids);
>   
> +	spin_lock_init(&guc->timestamp.lock);
> +	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
> +	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
> +
>   	return 0;
>   }
>   
> @@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
>   		engine->emit_flush = gen12_emit_flush_xcs;
>   	}
>   	engine->set_default_submission = guc_set_default_submission;
> +	engine->busyness = guc_engine_busyness;
>   
> +	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
>   	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
>   	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
>   
> @@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
>   void intel_guc_submission_enable(struct intel_guc *guc)
>   {
>   	guc_init_lrc_mapping(guc);
> +	guc_init_engine_stats(guc);
>   }
>   
>   void intel_guc_submission_disable(struct intel_guc *guc)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> index c7ef44fa0c36..5a95a9f0a8e3 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> @@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>   void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>   				    struct i915_request *hung_rq,
>   				    struct drm_printer *m);
> +void intel_guc_busyness_park(struct intel_gt *gt);
> +void intel_guc_busyness_unpark(struct intel_gt *gt);
>   
>   bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
>   
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index a897f4abea0c..9aee08425382 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>   #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
>   #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
>   
> +#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
> +
>   /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
>   #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
>   #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
  2021-10-06  8:22     ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-10-06 17:04       ` Matthew Brost
  -1 siblings, 0 replies; 24+ messages in thread
From: Matthew Brost @ 2021-10-06 17:04 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Umesh Nerlige Ramappa, intel-gfx, dri-devel, john.c.harrison,
	daniel.vetter

On Wed, Oct 06, 2021 at 09:22:42AM +0100, Tvrtko Ursulin wrote:
> 
> On 06/10/2021 00:14, Matthew Brost wrote:
> > On Tue, Oct 05, 2021 at 10:47:11AM -0700, Umesh Nerlige Ramappa wrote:
> > > With GuC handling scheduling, i915 is not aware of the time that a
> > > context is scheduled in and out of the engine. Since i915 pmu relies on
> > > this info to provide engine busyness to the user, GuC shares this info
> > > with i915 for all engines using shared memory. For each engine, this
> > > info contains:
> > > 
> > > - total busyness: total time that the context was running (total)
> > > - id: id of the running context (id)
> > > - start timestamp: timestamp when the context started running (start)
> > > 
> > > At the time (now) of sampling the engine busyness, if the id is valid
> > > (!= ~0), and start is non-zero, then the context is considered to be
> > > active and the engine busyness is calculated using the below equation
> > > 
> > > 	engine busyness = total + (now - start)
> > > 
> > > All times are obtained from the gt clock base. For inactive contexts,
> > > engine busyness is just equal to the total.
> > > 
> > > The start and total values provided by GuC are 32 bits and wrap around
> > > in a few minutes. Since perf pmu provides busyness as 64 bit
> > > monotonically increasing values, there is a need for this implementation
> > > to account for overflows and extend the time to 64 bits before returning
> > > busyness to the user. In order to do that, a worker runs periodically at
> > > frequency = 1/8th the time it takes for the timestamp to wrap. As an
> > > example, that would be once in 27 seconds for a gt clock frequency of
> > > 19.2 MHz.
> > > 
> > > Opens and wip that are targeted for later patches:
> > > 
> > > 1) On global gt reset the total busyness of engines resets and i915
> > >     needs to fix that so that user sees monotonically increasing
> > >     busyness.
> > > 2) In runtime suspend mode, the worker may not need to be run. We could
> > >     stop the worker on suspend and rerun it on resume provided that the
> > >     guc pm timestamp does not tick during suspend.
> > > 
> > > Note:
> > > There might be an overaccounting of busyness due to the fact that GuC
> > > may be updating the total and start values while kmd is reading them.
> > > (i.e kmd may read the updated total and the stale start). In such a
> > > case, user may see higher busyness value followed by smaller ones which
> > > would eventually catch up to the higher value.
> > > 
> > > v2: (Tvrtko)
> > > - Include details in commit message
> > > - Move intel engine busyness function into execlist code
> > > - Use union inside engine->stats
> > > - Use natural type for ping delay jiffies
> > > - Drop active_work condition checks
> > > - Use for_each_engine if iterating all engines
> > > - Drop seq locking, use spinlock at guc level to update engine stats
> > > - Document worker specific details
> > > 
> > > v3: (Tvrtko/Umesh)
> > > - Demarcate guc and execlist stat objects with comments
> > > - Document known over-accounting issue in commit
> > > - Provide a consistent view of guc state
> > > - Add hooks to gt park/unpark for guc busyness
> > > - Stop/start worker in gt park/unpark path
> > > - Drop inline
> > > - Move spinlock and worker inits to guc initialization
> > > - Drop helpers that are called only once
> > > 
> > > Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> > > Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> > > ---
> > >   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
> > >   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
> > >   .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
> > >   drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
> > >   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
> > >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
> > >   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
> > >   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
> > >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
> > >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
> > >   .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
> > >   drivers/gpu/drm/i915/i915_reg.h               |   2 +
> > >   12 files changed, 398 insertions(+), 49 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > index 2ae57e4656a3..6fcc70a313d9 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > @@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
> > >   	intel_engine_print_breadcrumbs(engine, m);
> > >   }
> > > -static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
> > > -					    ktime_t *now)
> > > -{
> > > -	ktime_t total = engine->stats.total;
> > > -
> > > -	/*
> > > -	 * If the engine is executing something at the moment
> > > -	 * add it to the total.
> > > -	 */
> > > -	*now = ktime_get();
> > > -	if (READ_ONCE(engine->stats.active))
> > > -		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> > > -
> > > -	return total;
> > > -}
> > > -
> > >   /**
> > >    * intel_engine_get_busy_time() - Return current accumulated engine busyness
> > >    * @engine: engine to report on
> > > @@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
> > >    */
> > >   ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
> > >   {
> > > -	unsigned int seq;
> > > -	ktime_t total;
> > > -
> > > -	do {
> > > -		seq = read_seqcount_begin(&engine->stats.lock);
> > > -		total = __intel_engine_get_busy_time(engine, now);
> > > -	} while (read_seqcount_retry(&engine->stats.lock, seq));
> > > -
> > > -	return total;
> > > +	return engine->busyness(engine, now);
> > >   }
> > >   struct intel_context *
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > index 5ae1207c363b..8e1b9c38a6fc 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > @@ -432,6 +432,12 @@ struct intel_engine_cs {
> > >   	void		(*add_active_request)(struct i915_request *rq);
> > >   	void		(*remove_active_request)(struct i915_request *rq);
> > > +	/*
> > > +	 * Get engine busyness and the time at which the busyness was sampled.
> > > +	 */
> > > +	ktime_t		(*busyness)(struct intel_engine_cs *engine,
> > > +				    ktime_t *now);
> > > +
> > >   	struct intel_engine_execlists execlists;
> > >   	/*
> > > @@ -481,30 +487,66 @@ struct intel_engine_cs {
> > >   	u32 (*get_cmd_length_mask)(u32 cmd_header);
> > >   	struct {
> > > -		/**
> > > -		 * @active: Number of contexts currently scheduled in.
> > > -		 */
> > > -		unsigned int active;
> > > -
> > > -		/**
> > > -		 * @lock: Lock protecting the below fields.
> > > -		 */
> > > -		seqcount_t lock;
> > > -
> > > -		/**
> > > -		 * @total: Total time this engine was busy.
> > > -		 *
> > > -		 * Accumulated time not counting the most recent block in cases
> > > -		 * where engine is currently busy (active > 0).
> > > -		 */
> > > -		ktime_t total;
> > > -
> > > -		/**
> > > -		 * @start: Timestamp of the last idle to active transition.
> > > -		 *
> > > -		 * Idle is defined as active == 0, active is active > 0.
> > > -		 */
> > > -		ktime_t start;
> > > +		union {
> > > +			/* Fields used by the execlists backend. */
> > > +			struct {
> > > +				/**
> > > +				 * @active: Number of contexts currently
> > > +				 * scheduled in.
> > > +				 */
> > > +				unsigned int active;
> > > +
> > > +				/**
> > > +				 * @lock: Lock protecting the below fields.
> > > +				 */
> > > +				seqcount_t lock;
> > > +
> > > +				/**
> > > +				 * @total: Total time this engine was busy.
> > > +				 *
> > > +				 * Accumulated time not counting the most recent
> > > +				 * block in cases where engine is currently busy
> > > +				 * (active > 0).
> > > +				 */
> > > +				ktime_t total;
> > > +
> > > +				/**
> > > +				 * @start: Timestamp of the last idle to active
> > > +				 * transition.
> > > +				 *
> > > +				 * Idle is defined as active == 0, active is
> > > +				 * active > 0.
> > > +				 */
> > > +				ktime_t start;
> > > +			};
> > 
> > Not anonymous? e.g.
> > 
> > struct {
> > 	...
> > } execlists;
> > struct {
> > 	...
> > } guc;
> > 
> > IMO this is better as this is self documenting and if you touch an
> > backend specific field in a non-backend specific file it pops out as
> > incorrect.
> > 
> > > +
> > > +			/* Fields used by the GuC backend. */
> > > +			struct {
> > > +				/**
> > > +				 * @running: Active state of the engine when
> > > +				 * busyness was last sampled.
> > > +				 */
> > > +				bool running;
> > > +
> > > +				/**
> > > +				 * @prev_total: Previous value of total runtime
> > > +				 * clock cycles.
> > > +				 */
> > > +				u32 prev_total;
> > > +
> > > +				/**
> > > +				 * @total_gt_clks: Total gt clock cycles this
> > > +				 * engine was busy.
> > > +				 */
> > > +				u64 total_gt_clks;
> > > +
> > > +				/**
> > > +				 * @start_gt_clk: GT clock time of last idle to
> > > +				 * active transition.
> > > +				 */
> > > +				u64 start_gt_clk;
> > > +			};
> > > +		};
> > >   		/**
> > >   		 * @rps: Utilisation at last RPS sampling.
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > index 7147fe80919e..5c9b695e906c 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > @@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
> > >   	lrc_fini_wa_ctx(engine);
> > >   }
> > > +static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
> > > +					   ktime_t *now)
> > > +{
> > > +	ktime_t total = engine->stats.total;
> > > +
> > > +	/*
> > > +	 * If the engine is executing something at the moment
> > > +	 * add it to the total.
> > > +	 */
> > > +	*now = ktime_get();
> > > +	if (READ_ONCE(engine->stats.active))
> > > +		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> > > +
> > > +	return total;
> > > +}
> > > +
> > > +static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
> > > +					 ktime_t *now)
> > > +{
> > > +	unsigned int seq;
> > > +	ktime_t total;
> > > +
> > > +	do {
> > > +		seq = read_seqcount_begin(&engine->stats.lock);
> > > +		total = __execlists_engine_busyness(engine, now);
> > > +	} while (read_seqcount_retry(&engine->stats.lock, seq));
> > > +
> > > +	return total;
> > > +}
> > > +
> > >   static void
> > >   logical_ring_default_vfuncs(struct intel_engine_cs *engine)
> > >   {
> > > @@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
> > >   		engine->emit_bb_start = gen8_emit_bb_start;
> > >   	else
> > >   		engine->emit_bb_start = gen8_emit_bb_start_noarb;
> > > +
> > > +	engine->busyness = execlists_engine_busyness;
> > >   }
> > >   static void logical_ring_default_irqs(struct intel_engine_cs *engine)
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > > index 524eaf678790..b4a8594bc46c 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > > @@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
> > >   	intel_rc6_unpark(&gt->rc6);
> > >   	intel_rps_unpark(&gt->rps);
> > >   	i915_pmu_gt_unparked(i915);
> > > +	intel_guc_busyness_unpark(gt);
> > 
> > I personally don't mind this but in the spirit of correct layering, this
> > likely should be generic wrapper inline func which calls a vfunc if
> > present (e.g. set the vfunc for backend, don't set for execlists).
> > 
> > >   	intel_gt_unpark_requests(gt);
> > >   	runtime_begin(gt);
> > > @@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
> > >   	runtime_end(gt);
> > >   	intel_gt_park_requests(gt);
> > > +	intel_guc_busyness_park(gt);
> > 
> > Same here.
> > 
> > >   	i915_vma_parked(gt);
> > >   	i915_pmu_gt_parked(i915);
> > >   	intel_rps_park(&gt->rps);
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > > index 8ff582222aff..ff1311d4beff 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > > @@ -143,6 +143,7 @@ enum intel_guc_action {
> > >   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
> > >   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
> > >   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
> > > +	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
> > >   	INTEL_GUC_ACTION_LIMIT
> > >   };
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > index 5dd174babf7a..22c30dbdf63a 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > @@ -104,6 +104,8 @@ struct intel_guc {
> > >   	u32 ads_regset_size;
> > >   	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
> > >   	u32 ads_golden_ctxt_size;
> > > +	/** @ads_engine_usage_size: size of engine usage in the ADS */
> > > +	u32 ads_engine_usage_size;
> > >   	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
> > >   	struct i915_vma *lrc_desc_pool;
> > > @@ -138,6 +140,30 @@ struct intel_guc {
> > >   	/** @send_mutex: used to serialize the intel_guc_send actions */
> > >   	struct mutex send_mutex;
> > > +
> > > +	struct {
> > > +		/**
> > > +		 * @lock: Lock protecting the below fields and the engine stats.
> > > +		 */
> > > +		spinlock_t lock;
> > > +
> > 
> > Again I really don't mind but I'm told not to add more spin locks than
> > needed. This really should be protected by a generic GuC submission spin
> > lock. e.g. Build on this patch and protect all of this by the
> > submission_state.lock.
> 
> I see no good reason to use the submission lock here. The two are completely
> different paths, with completely different entry points and we don't want to
> introduce contention where it is trivially avoidable for no real cost. In
> other words I think this lock is well defined and localised both in code and
> in execution flows.
> 

The direction from architecture is to use as few as locks as possible,
not saying I agree just passing along the top down direction.

Matt

> Regards,
> 
> Tvrtko
> 
> > 
> > https://patchwork.freedesktop.org/patch/457310/?series=92789&rev=5
> > 
> > Whomevers series gets merged first can include the above patch.
> > 
> > Rest the series looks fine cosmetically to me.
> > 
> > Matt
> > 
> > > +		/**
> > > +		 * @gt_stamp: 64 bit extended value of the GT timestamp.
> > > +		 */
> > > +		u64 gt_stamp;
> > > +
> > > +		/**
> > > +		 * @ping_delay: Period for polling the GT timestamp for
> > > +		 * overflow.
> > > +		 */
> > > +		unsigned long ping_delay;
> > > +
> > > +		/**
> > > +		 * @work: Periodic work to adjust GT timestamp, engine and
> > > +		 * context usage for overflows.
> > > +		 */
> > > +		struct delayed_work work;
> > > +	} timestamp;
> > >   };
> > >   static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > index 2c6ea64af7ec..ca9ab53999d5 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > @@ -26,6 +26,8 @@
> > >    *      | guc_policies                          |
> > >    *      +---------------------------------------+
> > >    *      | guc_gt_system_info                    |
> > > + *      +---------------------------------------+
> > > + *      | guc_engine_usage                      |
> > >    *      +---------------------------------------+ <== static
> > >    *      | guc_mmio_reg[countA] (engine 0.0)     |
> > >    *      | guc_mmio_reg[countB] (engine 0.1)     |
> > > @@ -47,6 +49,7 @@ struct __guc_ads_blob {
> > >   	struct guc_ads ads;
> > >   	struct guc_policies policies;
> > >   	struct guc_gt_system_info system_info;
> > > +	struct guc_engine_usage engine_usage;
> > >   	/* From here on, location is dynamic! Refer to above diagram. */
> > >   	struct guc_mmio_reg regset[0];
> > >   } __packed;
> > > @@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
> > >   	guc_ads_private_data_reset(guc);
> > >   }
> > > +
> > > +u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
> > > +{
> > > +	struct __guc_ads_blob *blob = guc->ads_blob;
> > > +	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
> > > +	u32 offset = base + ptr_offset(blob, engine_usage);
> > > +
> > > +	return offset;
> > > +}
> > > +
> > > +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
> > > +{
> > > +	struct intel_guc *guc = &engine->gt->uc.guc;
> > > +	struct __guc_ads_blob *blob = guc->ads_blob;
> > > +	u8 guc_class = engine_class_to_guc_class(engine->class);
> > > +
> > > +	return &blob->engine_usage.engines[guc_class][engine->instance];
> > > +}
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> > > index 3d85051d57e4..e74c110facff 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> > > @@ -6,8 +6,11 @@
> > >   #ifndef _INTEL_GUC_ADS_H_
> > >   #define _INTEL_GUC_ADS_H_
> > > +#include <linux/types.h>
> > > +
> > >   struct intel_guc;
> > >   struct drm_printer;
> > > +struct intel_engine_cs;
> > >   int intel_guc_ads_create(struct intel_guc *guc);
> > >   void intel_guc_ads_destroy(struct intel_guc *guc);
> > > @@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
> > >   void intel_guc_ads_reset(struct intel_guc *guc);
> > >   void intel_guc_ads_print_policy_info(struct intel_guc *guc,
> > >   				     struct drm_printer *p);
> > > +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
> > > +u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
> > >   #endif
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > index fa4be13c8854..7c9c081670fc 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > @@ -294,6 +294,19 @@ struct guc_ads {
> > >   	u32 reserved[15];
> > >   } __packed;
> > > +/* Engine usage stats */
> > > +struct guc_engine_usage_record {
> > > +	u32 current_context_index;
> > > +	u32 last_switch_in_stamp;
> > > +	u32 reserved0;
> > > +	u32 total_runtime;
> > > +	u32 reserved1[4];
> > > +} __packed;
> > > +
> > > +struct guc_engine_usage {
> > > +	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
> > > +} __packed;
> > > +
> > >   /* GuC logging structures */
> > >   enum guc_log_buffer_type {
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index ba0de35f6323..3f7d0f2ac9da 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -12,6 +12,7 @@
> > >   #include "gt/intel_engine_pm.h"
> > >   #include "gt/intel_engine_heartbeat.h"
> > >   #include "gt/intel_gt.h"
> > > +#include "gt/intel_gt_clock_utils.h"
> > >   #include "gt/intel_gt_irq.h"
> > >   #include "gt/intel_gt_pm.h"
> > >   #include "gt/intel_gt_requests.h"
> > > @@ -20,6 +21,7 @@
> > >   #include "gt/intel_mocs.h"
> > >   #include "gt/intel_ring.h"
> > > +#include "intel_guc_ads.h"
> > >   #include "intel_guc_submission.h"
> > >   #include "i915_drv.h"
> > > @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
> > >   static void disable_submission(struct intel_guc *guc)
> > >   {
> > >   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > +	struct intel_engine_cs *engine;
> > > +	enum intel_engine_id id;
> > > +	unsigned long flags;
> > >   	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
> > >   		GEM_BUG_ON(!guc->ct.enabled);
> > >   		__tasklet_disable_sync_once(&sched_engine->tasklet);
> > >   		sched_engine->tasklet.callback = NULL;
> > >   	}
> > > +
> > > +	cancel_delayed_work(&guc->timestamp.work);
> > > +
> > > +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> > > +
> > > +	for_each_engine(engine, gt, id)
> > > +		engine->stats.prev_total = 0;
> > > +
> > > +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> > >   }
> > >   static void enable_submission(struct intel_guc *guc)
> > > @@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> > >   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
> > >   }
> > > +/*
> > > + * GuC stores busyness stats for each engine at context in/out boundaries. A
> > > + * context 'in' logs execution start time, 'out' adds in -> out delta to total.
> > > + * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
> > > + * GuC.
> > > + *
> > > + * __i915_pmu_event_read samples engine busyness. When sampling, if context id
> > > + * is valid (!= ~0) and start is non-zero, the engine is considered to be
> > > + * active. For an active engine total busyness = total + (now - start), where
> > > + * 'now' is the time at which the busyness is sampled. For inactive engine,
> > > + * total busyness = total.
> > > + *
> > > + * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
> > > + *
> > > + * The start and total values provided by GuC are 32 bits and wrap around in a
> > > + * few minutes. Since perf pmu provides busyness as 64 bit monotonically
> > > + * increasing ns values, there is a need for this implementation to account for
> > > + * overflows and extend the GuC provided values to 64 bits before returning
> > > + * busyness to the user. In order to do that, a worker runs periodically at
> > > + * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
> > > + * 27 seconds for a gt clock frequency of 19.2 MHz).
> > > + */
> > > +
> > > +#define WRAP_TIME_CLKS U32_MAX
> > > +#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
> > > +
> > > +static void
> > > +__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
> > > +{
> > > +	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> > > +	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
> > > +
> > > +	if (new_start == lower_32_bits(*prev_start))
> > > +		return;
> > > +
> > > +	if (new_start < gt_stamp_last &&
> > > +	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
> > > +		gt_stamp_hi++;
> > > +
> > > +	if (new_start > gt_stamp_last &&
> > > +	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
> > > +		gt_stamp_hi--;
> > > +
> > > +	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
> > > +}
> > > +
> > > +static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
> > > +{
> > > +	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
> > > +	struct intel_guc *guc = &engine->gt->uc.guc;
> > > +	u32 last_switch = rec->last_switch_in_stamp;
> > > +	u32 ctx_id = rec->current_context_index;
> > > +	u32 total = rec->total_runtime;
> > > +
> > > +	lockdep_assert_held(&guc->timestamp.lock);
> > > +
> > > +	engine->stats.running = ctx_id != ~0U && last_switch;
> > > +	if (engine->stats.running)
> > > +		__extend_last_switch(guc, &engine->stats.start_gt_clk,
> > > +				     last_switch);
> > > +
> > > +	/*
> > > +	 * Instead of adjusting the total for overflow, just add the
> > > +	 * difference from previous sample to the stats.total_gt_clks
> > > +	 */
> > > +	if (total && total != ~0U) {
> > > +		engine->stats.total_gt_clks += (u32)(total -
> > > +						     engine->stats.prev_total);
> > > +		engine->stats.prev_total = total;
> > > +	}
> > > +}
> > > +
> > > +static void guc_update_pm_timestamp(struct intel_guc *guc)
> > > +{
> > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > +	u32 gt_stamp_now, gt_stamp_hi;
> > > +
> > > +	lockdep_assert_held(&guc->timestamp.lock);
> > > +
> > > +	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> > > +	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
> > > +
> > > +	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
> > > +		gt_stamp_hi++;
> > > +
> > > +	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
> > > +}
> > > +
> > > +/*
> > > + * Unlike the execlist mode of submission total and active times are in terms of
> > > + * gt clocks. The *now parameter is retained to return the cpu time at which the
> > > + * busyness was sampled.
> > > + */
> > > +static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
> > > +{
> > > +	struct intel_gt *gt = engine->gt;
> > > +	struct intel_guc *guc = &gt->uc.guc;
> > > +	unsigned long flags;
> > > +	u64 total;
> > > +
> > > +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> > > +
> > > +	*now = ktime_get();
> > > +
> > > +	/*
> > > +	 * The active busyness depends on start_gt_clk and gt_stamp.
> > > +	 * gt_stamp is updated by i915 only when gt is awake and the
> > > +	 * start_gt_clk is derived from GuC state. To get a consistent
> > > +	 * view of activity, we query the GuC state only if gt is awake.
> > > +	 */
> > > +	if (intel_gt_pm_get_if_awake(gt)) {
> > > +		guc_update_engine_gt_clks(engine);
> > > +		guc_update_pm_timestamp(guc);
> > > +		intel_gt_pm_put_async(gt);
> > > +	}
> > > +
> > > +	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
> > > +	if (engine->stats.running) {
> > > +		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
> > > +
> > > +		total += intel_gt_clock_interval_to_ns(gt, clk);
> > > +	}
> > > +
> > > +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> > > +
> > > +	return ns_to_ktime(total);
> > > +}
> > > +
> > > +static void __update_guc_busyness_stats(struct intel_guc *guc)
> > > +{
> > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > +	struct intel_engine_cs *engine;
> > > +	enum intel_engine_id id;
> > > +	unsigned long flags;
> > > +
> > > +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> > > +
> > > +	if (intel_gt_pm_get_if_awake(gt)) {
> > > +		guc_update_pm_timestamp(guc);
> > > +
> > > +		for_each_engine(engine, gt, id)
> > > +			guc_update_engine_gt_clks(engine);
> > > +
> > > +		intel_gt_pm_put_async(gt);
> > > +	}
> > > +
> > > +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> > > +}
> > > +
> > > +static void guc_timestamp_ping(struct work_struct *wrk)
> > > +{
> > > +	struct intel_guc *guc = container_of(wrk, typeof(*guc),
> > > +					     timestamp.work.work);
> > > +
> > > +	__update_guc_busyness_stats(guc);
> > > +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> > > +			 guc->timestamp.ping_delay);
> > > +}
> > > +
> > > +static int guc_action_enable_usage_stats(struct intel_guc *guc)
> > > +{
> > > +	u32 offset = intel_guc_engine_usage_offset(guc);
> > > +	u32 action[] = {
> > > +		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
> > > +		offset,
> > > +		0,
> > > +	};
> > > +
> > > +	return intel_guc_send(guc, action, ARRAY_SIZE(action));
> > > +}
> > > +
> > > +static void guc_init_engine_stats(struct intel_guc *guc)
> > > +{
> > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > +	intel_wakeref_t wakeref;
> > > +
> > > +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> > > +			 guc->timestamp.ping_delay);
> > > +
> > > +	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
> > > +		int ret = guc_action_enable_usage_stats(guc);
> > > +
> > > +		if (ret)
> > > +			drm_err(&gt->i915->drm,
> > > +				"Failed to enable usage stats: %d!\n", ret);
> > > +	}
> > > +}
> > > +
> > > +void intel_guc_busyness_park(struct intel_gt *gt)
> > > +{
> > > +	struct intel_guc *guc = &gt->uc.guc;
> > > +
> > > +	cancel_delayed_work(&guc->timestamp.work);
> > > +	__update_guc_busyness_stats(guc);
> > > +}
> > > +
> > > +void intel_guc_busyness_unpark(struct intel_gt *gt)
> > > +{
> > > +	struct intel_guc *guc = &gt->uc.guc;
> > > +
> > > +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> > > +			 guc->timestamp.ping_delay);
> > > +}
> > > +
> > >   /*
> > >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> > >    * at firmware loading time.
> > >    */
> > >   int intel_guc_submission_init(struct intel_guc *guc)
> > >   {
> > > +	struct intel_gt *gt = guc_to_gt(guc);
> > >   	int ret;
> > >   	if (guc->lrc_desc_pool)
> > > @@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
> > >   	INIT_LIST_HEAD(&guc->guc_id_list);
> > >   	ida_init(&guc->guc_ids);
> > > +	spin_lock_init(&guc->timestamp.lock);
> > > +	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
> > > +	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
> > > +
> > >   	return 0;
> > >   }
> > > @@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
> > >   		engine->emit_flush = gen12_emit_flush_xcs;
> > >   	}
> > >   	engine->set_default_submission = guc_set_default_submission;
> > > +	engine->busyness = guc_engine_busyness;
> > > +	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
> > >   	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
> > >   	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
> > > @@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
> > >   void intel_guc_submission_enable(struct intel_guc *guc)
> > >   {
> > >   	guc_init_lrc_mapping(guc);
> > > +	guc_init_engine_stats(guc);
> > >   }
> > >   void intel_guc_submission_disable(struct intel_guc *guc)
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> > > index c7ef44fa0c36..5a95a9f0a8e3 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> > > @@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > >   void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
> > >   				    struct i915_request *hung_rq,
> > >   				    struct drm_printer *m);
> > > +void intel_guc_busyness_park(struct intel_gt *gt);
> > > +void intel_guc_busyness_unpark(struct intel_gt *gt);
> > >   bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
> > > diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> > > index a897f4abea0c..9aee08425382 100644
> > > --- a/drivers/gpu/drm/i915/i915_reg.h
> > > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > > @@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
> > >   #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
> > >   #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
> > > +#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
> > > +
> > >   /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
> > >   #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
> > >   #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
> > > -- 
> > > 2.20.1
> > > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
@ 2021-10-06 17:04       ` Matthew Brost
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Brost @ 2021-10-06 17:04 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Umesh Nerlige Ramappa, intel-gfx, dri-devel, john.c.harrison,
	daniel.vetter

On Wed, Oct 06, 2021 at 09:22:42AM +0100, Tvrtko Ursulin wrote:
> 
> On 06/10/2021 00:14, Matthew Brost wrote:
> > On Tue, Oct 05, 2021 at 10:47:11AM -0700, Umesh Nerlige Ramappa wrote:
> > > With GuC handling scheduling, i915 is not aware of the time that a
> > > context is scheduled in and out of the engine. Since i915 pmu relies on
> > > this info to provide engine busyness to the user, GuC shares this info
> > > with i915 for all engines using shared memory. For each engine, this
> > > info contains:
> > > 
> > > - total busyness: total time that the context was running (total)
> > > - id: id of the running context (id)
> > > - start timestamp: timestamp when the context started running (start)
> > > 
> > > At the time (now) of sampling the engine busyness, if the id is valid
> > > (!= ~0), and start is non-zero, then the context is considered to be
> > > active and the engine busyness is calculated using the below equation
> > > 
> > > 	engine busyness = total + (now - start)
> > > 
> > > All times are obtained from the gt clock base. For inactive contexts,
> > > engine busyness is just equal to the total.
> > > 
> > > The start and total values provided by GuC are 32 bits and wrap around
> > > in a few minutes. Since perf pmu provides busyness as 64 bit
> > > monotonically increasing values, there is a need for this implementation
> > > to account for overflows and extend the time to 64 bits before returning
> > > busyness to the user. In order to do that, a worker runs periodically at
> > > frequency = 1/8th the time it takes for the timestamp to wrap. As an
> > > example, that would be once in 27 seconds for a gt clock frequency of
> > > 19.2 MHz.
> > > 
> > > Opens and wip that are targeted for later patches:
> > > 
> > > 1) On global gt reset the total busyness of engines resets and i915
> > >     needs to fix that so that user sees monotonically increasing
> > >     busyness.
> > > 2) In runtime suspend mode, the worker may not need to be run. We could
> > >     stop the worker on suspend and rerun it on resume provided that the
> > >     guc pm timestamp does not tick during suspend.
> > > 
> > > Note:
> > > There might be an overaccounting of busyness due to the fact that GuC
> > > may be updating the total and start values while kmd is reading them.
> > > (i.e kmd may read the updated total and the stale start). In such a
> > > case, user may see higher busyness value followed by smaller ones which
> > > would eventually catch up to the higher value.
> > > 
> > > v2: (Tvrtko)
> > > - Include details in commit message
> > > - Move intel engine busyness function into execlist code
> > > - Use union inside engine->stats
> > > - Use natural type for ping delay jiffies
> > > - Drop active_work condition checks
> > > - Use for_each_engine if iterating all engines
> > > - Drop seq locking, use spinlock at guc level to update engine stats
> > > - Document worker specific details
> > > 
> > > v3: (Tvrtko/Umesh)
> > > - Demarcate guc and execlist stat objects with comments
> > > - Document known over-accounting issue in commit
> > > - Provide a consistent view of guc state
> > > - Add hooks to gt park/unpark for guc busyness
> > > - Stop/start worker in gt park/unpark path
> > > - Drop inline
> > > - Move spinlock and worker inits to guc initialization
> > > - Drop helpers that are called only once
> > > 
> > > Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> > > Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> > > ---
> > >   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
> > >   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
> > >   .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
> > >   drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
> > >   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
> > >   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
> > >   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
> > >   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
> > >   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
> > >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
> > >   .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
> > >   drivers/gpu/drm/i915/i915_reg.h               |   2 +
> > >   12 files changed, 398 insertions(+), 49 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > index 2ae57e4656a3..6fcc70a313d9 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> > > @@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
> > >   	intel_engine_print_breadcrumbs(engine, m);
> > >   }
> > > -static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
> > > -					    ktime_t *now)
> > > -{
> > > -	ktime_t total = engine->stats.total;
> > > -
> > > -	/*
> > > -	 * If the engine is executing something at the moment
> > > -	 * add it to the total.
> > > -	 */
> > > -	*now = ktime_get();
> > > -	if (READ_ONCE(engine->stats.active))
> > > -		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> > > -
> > > -	return total;
> > > -}
> > > -
> > >   /**
> > >    * intel_engine_get_busy_time() - Return current accumulated engine busyness
> > >    * @engine: engine to report on
> > > @@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
> > >    */
> > >   ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
> > >   {
> > > -	unsigned int seq;
> > > -	ktime_t total;
> > > -
> > > -	do {
> > > -		seq = read_seqcount_begin(&engine->stats.lock);
> > > -		total = __intel_engine_get_busy_time(engine, now);
> > > -	} while (read_seqcount_retry(&engine->stats.lock, seq));
> > > -
> > > -	return total;
> > > +	return engine->busyness(engine, now);
> > >   }
> > >   struct intel_context *
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > index 5ae1207c363b..8e1b9c38a6fc 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> > > @@ -432,6 +432,12 @@ struct intel_engine_cs {
> > >   	void		(*add_active_request)(struct i915_request *rq);
> > >   	void		(*remove_active_request)(struct i915_request *rq);
> > > +	/*
> > > +	 * Get engine busyness and the time at which the busyness was sampled.
> > > +	 */
> > > +	ktime_t		(*busyness)(struct intel_engine_cs *engine,
> > > +				    ktime_t *now);
> > > +
> > >   	struct intel_engine_execlists execlists;
> > >   	/*
> > > @@ -481,30 +487,66 @@ struct intel_engine_cs {
> > >   	u32 (*get_cmd_length_mask)(u32 cmd_header);
> > >   	struct {
> > > -		/**
> > > -		 * @active: Number of contexts currently scheduled in.
> > > -		 */
> > > -		unsigned int active;
> > > -
> > > -		/**
> > > -		 * @lock: Lock protecting the below fields.
> > > -		 */
> > > -		seqcount_t lock;
> > > -
> > > -		/**
> > > -		 * @total: Total time this engine was busy.
> > > -		 *
> > > -		 * Accumulated time not counting the most recent block in cases
> > > -		 * where engine is currently busy (active > 0).
> > > -		 */
> > > -		ktime_t total;
> > > -
> > > -		/**
> > > -		 * @start: Timestamp of the last idle to active transition.
> > > -		 *
> > > -		 * Idle is defined as active == 0, active is active > 0.
> > > -		 */
> > > -		ktime_t start;
> > > +		union {
> > > +			/* Fields used by the execlists backend. */
> > > +			struct {
> > > +				/**
> > > +				 * @active: Number of contexts currently
> > > +				 * scheduled in.
> > > +				 */
> > > +				unsigned int active;
> > > +
> > > +				/**
> > > +				 * @lock: Lock protecting the below fields.
> > > +				 */
> > > +				seqcount_t lock;
> > > +
> > > +				/**
> > > +				 * @total: Total time this engine was busy.
> > > +				 *
> > > +				 * Accumulated time not counting the most recent
> > > +				 * block in cases where engine is currently busy
> > > +				 * (active > 0).
> > > +				 */
> > > +				ktime_t total;
> > > +
> > > +				/**
> > > +				 * @start: Timestamp of the last idle to active
> > > +				 * transition.
> > > +				 *
> > > +				 * Idle is defined as active == 0, active is
> > > +				 * active > 0.
> > > +				 */
> > > +				ktime_t start;
> > > +			};
> > 
> > Not anonymous? e.g.
> > 
> > struct {
> > 	...
> > } execlists;
> > struct {
> > 	...
> > } guc;
> > 
> > IMO this is better as this is self documenting and if you touch an
> > backend specific field in a non-backend specific file it pops out as
> > incorrect.
> > 
> > > +
> > > +			/* Fields used by the GuC backend. */
> > > +			struct {
> > > +				/**
> > > +				 * @running: Active state of the engine when
> > > +				 * busyness was last sampled.
> > > +				 */
> > > +				bool running;
> > > +
> > > +				/**
> > > +				 * @prev_total: Previous value of total runtime
> > > +				 * clock cycles.
> > > +				 */
> > > +				u32 prev_total;
> > > +
> > > +				/**
> > > +				 * @total_gt_clks: Total gt clock cycles this
> > > +				 * engine was busy.
> > > +				 */
> > > +				u64 total_gt_clks;
> > > +
> > > +				/**
> > > +				 * @start_gt_clk: GT clock time of last idle to
> > > +				 * active transition.
> > > +				 */
> > > +				u64 start_gt_clk;
> > > +			};
> > > +		};
> > >   		/**
> > >   		 * @rps: Utilisation at last RPS sampling.
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > index 7147fe80919e..5c9b695e906c 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > @@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
> > >   	lrc_fini_wa_ctx(engine);
> > >   }
> > > +static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
> > > +					   ktime_t *now)
> > > +{
> > > +	ktime_t total = engine->stats.total;
> > > +
> > > +	/*
> > > +	 * If the engine is executing something at the moment
> > > +	 * add it to the total.
> > > +	 */
> > > +	*now = ktime_get();
> > > +	if (READ_ONCE(engine->stats.active))
> > > +		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> > > +
> > > +	return total;
> > > +}
> > > +
> > > +static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
> > > +					 ktime_t *now)
> > > +{
> > > +	unsigned int seq;
> > > +	ktime_t total;
> > > +
> > > +	do {
> > > +		seq = read_seqcount_begin(&engine->stats.lock);
> > > +		total = __execlists_engine_busyness(engine, now);
> > > +	} while (read_seqcount_retry(&engine->stats.lock, seq));
> > > +
> > > +	return total;
> > > +}
> > > +
> > >   static void
> > >   logical_ring_default_vfuncs(struct intel_engine_cs *engine)
> > >   {
> > > @@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
> > >   		engine->emit_bb_start = gen8_emit_bb_start;
> > >   	else
> > >   		engine->emit_bb_start = gen8_emit_bb_start_noarb;
> > > +
> > > +	engine->busyness = execlists_engine_busyness;
> > >   }
> > >   static void logical_ring_default_irqs(struct intel_engine_cs *engine)
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > > index 524eaf678790..b4a8594bc46c 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > > @@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
> > >   	intel_rc6_unpark(&gt->rc6);
> > >   	intel_rps_unpark(&gt->rps);
> > >   	i915_pmu_gt_unparked(i915);
> > > +	intel_guc_busyness_unpark(gt);
> > 
> > I personally don't mind this but in the spirit of correct layering, this
> > likely should be generic wrapper inline func which calls a vfunc if
> > present (e.g. set the vfunc for backend, don't set for execlists).
> > 
> > >   	intel_gt_unpark_requests(gt);
> > >   	runtime_begin(gt);
> > > @@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
> > >   	runtime_end(gt);
> > >   	intel_gt_park_requests(gt);
> > > +	intel_guc_busyness_park(gt);
> > 
> > Same here.
> > 
> > >   	i915_vma_parked(gt);
> > >   	i915_pmu_gt_parked(i915);
> > >   	intel_rps_park(&gt->rps);
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > > index 8ff582222aff..ff1311d4beff 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> > > @@ -143,6 +143,7 @@ enum intel_guc_action {
> > >   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
> > >   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
> > >   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
> > > +	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
> > >   	INTEL_GUC_ACTION_LIMIT
> > >   };
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > index 5dd174babf7a..22c30dbdf63a 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > @@ -104,6 +104,8 @@ struct intel_guc {
> > >   	u32 ads_regset_size;
> > >   	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
> > >   	u32 ads_golden_ctxt_size;
> > > +	/** @ads_engine_usage_size: size of engine usage in the ADS */
> > > +	u32 ads_engine_usage_size;
> > >   	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
> > >   	struct i915_vma *lrc_desc_pool;
> > > @@ -138,6 +140,30 @@ struct intel_guc {
> > >   	/** @send_mutex: used to serialize the intel_guc_send actions */
> > >   	struct mutex send_mutex;
> > > +
> > > +	struct {
> > > +		/**
> > > +		 * @lock: Lock protecting the below fields and the engine stats.
> > > +		 */
> > > +		spinlock_t lock;
> > > +
> > 
> > Again I really don't mind but I'm told not to add more spin locks than
> > needed. This really should be protected by a generic GuC submission spin
> > lock. e.g. Build on this patch and protect all of this by the
> > submission_state.lock.
> 
> I see no good reason to use the submission lock here. The two are completely
> different paths, with completely different entry points and we don't want to
> introduce contention where it is trivially avoidable for no real cost. In
> other words I think this lock is well defined and localised both in code and
> in execution flows.
> 

The direction from architecture is to use as few as locks as possible,
not saying I agree just passing along the top down direction.

Matt

> Regards,
> 
> Tvrtko
> 
> > 
> > https://patchwork.freedesktop.org/patch/457310/?series=92789&rev=5
> > 
> > Whomevers series gets merged first can include the above patch.
> > 
> > Rest the series looks fine cosmetically to me.
> > 
> > Matt
> > 
> > > +		/**
> > > +		 * @gt_stamp: 64 bit extended value of the GT timestamp.
> > > +		 */
> > > +		u64 gt_stamp;
> > > +
> > > +		/**
> > > +		 * @ping_delay: Period for polling the GT timestamp for
> > > +		 * overflow.
> > > +		 */
> > > +		unsigned long ping_delay;
> > > +
> > > +		/**
> > > +		 * @work: Periodic work to adjust GT timestamp, engine and
> > > +		 * context usage for overflows.
> > > +		 */
> > > +		struct delayed_work work;
> > > +	} timestamp;
> > >   };
> > >   static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > index 2c6ea64af7ec..ca9ab53999d5 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> > > @@ -26,6 +26,8 @@
> > >    *      | guc_policies                          |
> > >    *      +---------------------------------------+
> > >    *      | guc_gt_system_info                    |
> > > + *      +---------------------------------------+
> > > + *      | guc_engine_usage                      |
> > >    *      +---------------------------------------+ <== static
> > >    *      | guc_mmio_reg[countA] (engine 0.0)     |
> > >    *      | guc_mmio_reg[countB] (engine 0.1)     |
> > > @@ -47,6 +49,7 @@ struct __guc_ads_blob {
> > >   	struct guc_ads ads;
> > >   	struct guc_policies policies;
> > >   	struct guc_gt_system_info system_info;
> > > +	struct guc_engine_usage engine_usage;
> > >   	/* From here on, location is dynamic! Refer to above diagram. */
> > >   	struct guc_mmio_reg regset[0];
> > >   } __packed;
> > > @@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
> > >   	guc_ads_private_data_reset(guc);
> > >   }
> > > +
> > > +u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
> > > +{
> > > +	struct __guc_ads_blob *blob = guc->ads_blob;
> > > +	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
> > > +	u32 offset = base + ptr_offset(blob, engine_usage);
> > > +
> > > +	return offset;
> > > +}
> > > +
> > > +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
> > > +{
> > > +	struct intel_guc *guc = &engine->gt->uc.guc;
> > > +	struct __guc_ads_blob *blob = guc->ads_blob;
> > > +	u8 guc_class = engine_class_to_guc_class(engine->class);
> > > +
> > > +	return &blob->engine_usage.engines[guc_class][engine->instance];
> > > +}
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> > > index 3d85051d57e4..e74c110facff 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> > > @@ -6,8 +6,11 @@
> > >   #ifndef _INTEL_GUC_ADS_H_
> > >   #define _INTEL_GUC_ADS_H_
> > > +#include <linux/types.h>
> > > +
> > >   struct intel_guc;
> > >   struct drm_printer;
> > > +struct intel_engine_cs;
> > >   int intel_guc_ads_create(struct intel_guc *guc);
> > >   void intel_guc_ads_destroy(struct intel_guc *guc);
> > > @@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
> > >   void intel_guc_ads_reset(struct intel_guc *guc);
> > >   void intel_guc_ads_print_policy_info(struct intel_guc *guc,
> > >   				     struct drm_printer *p);
> > > +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
> > > +u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
> > >   #endif
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > index fa4be13c8854..7c9c081670fc 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > @@ -294,6 +294,19 @@ struct guc_ads {
> > >   	u32 reserved[15];
> > >   } __packed;
> > > +/* Engine usage stats */
> > > +struct guc_engine_usage_record {
> > > +	u32 current_context_index;
> > > +	u32 last_switch_in_stamp;
> > > +	u32 reserved0;
> > > +	u32 total_runtime;
> > > +	u32 reserved1[4];
> > > +} __packed;
> > > +
> > > +struct guc_engine_usage {
> > > +	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
> > > +} __packed;
> > > +
> > >   /* GuC logging structures */
> > >   enum guc_log_buffer_type {
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index ba0de35f6323..3f7d0f2ac9da 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -12,6 +12,7 @@
> > >   #include "gt/intel_engine_pm.h"
> > >   #include "gt/intel_engine_heartbeat.h"
> > >   #include "gt/intel_gt.h"
> > > +#include "gt/intel_gt_clock_utils.h"
> > >   #include "gt/intel_gt_irq.h"
> > >   #include "gt/intel_gt_pm.h"
> > >   #include "gt/intel_gt_requests.h"
> > > @@ -20,6 +21,7 @@
> > >   #include "gt/intel_mocs.h"
> > >   #include "gt/intel_ring.h"
> > > +#include "intel_guc_ads.h"
> > >   #include "intel_guc_submission.h"
> > >   #include "i915_drv.h"
> > > @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
> > >   static void disable_submission(struct intel_guc *guc)
> > >   {
> > >   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > +	struct intel_engine_cs *engine;
> > > +	enum intel_engine_id id;
> > > +	unsigned long flags;
> > >   	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
> > >   		GEM_BUG_ON(!guc->ct.enabled);
> > >   		__tasklet_disable_sync_once(&sched_engine->tasklet);
> > >   		sched_engine->tasklet.callback = NULL;
> > >   	}
> > > +
> > > +	cancel_delayed_work(&guc->timestamp.work);
> > > +
> > > +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> > > +
> > > +	for_each_engine(engine, gt, id)
> > > +		engine->stats.prev_total = 0;
> > > +
> > > +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> > >   }
> > >   static void enable_submission(struct intel_guc *guc)
> > > @@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
> > >   	intel_gt_unpark_heartbeats(guc_to_gt(guc));
> > >   }
> > > +/*
> > > + * GuC stores busyness stats for each engine at context in/out boundaries. A
> > > + * context 'in' logs execution start time, 'out' adds in -> out delta to total.
> > > + * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
> > > + * GuC.
> > > + *
> > > + * __i915_pmu_event_read samples engine busyness. When sampling, if context id
> > > + * is valid (!= ~0) and start is non-zero, the engine is considered to be
> > > + * active. For an active engine total busyness = total + (now - start), where
> > > + * 'now' is the time at which the busyness is sampled. For inactive engine,
> > > + * total busyness = total.
> > > + *
> > > + * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
> > > + *
> > > + * The start and total values provided by GuC are 32 bits and wrap around in a
> > > + * few minutes. Since perf pmu provides busyness as 64 bit monotonically
> > > + * increasing ns values, there is a need for this implementation to account for
> > > + * overflows and extend the GuC provided values to 64 bits before returning
> > > + * busyness to the user. In order to do that, a worker runs periodically at
> > > + * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
> > > + * 27 seconds for a gt clock frequency of 19.2 MHz).
> > > + */
> > > +
> > > +#define WRAP_TIME_CLKS U32_MAX
> > > +#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
> > > +
> > > +static void
> > > +__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
> > > +{
> > > +	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> > > +	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
> > > +
> > > +	if (new_start == lower_32_bits(*prev_start))
> > > +		return;
> > > +
> > > +	if (new_start < gt_stamp_last &&
> > > +	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
> > > +		gt_stamp_hi++;
> > > +
> > > +	if (new_start > gt_stamp_last &&
> > > +	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
> > > +		gt_stamp_hi--;
> > > +
> > > +	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
> > > +}
> > > +
> > > +static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
> > > +{
> > > +	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
> > > +	struct intel_guc *guc = &engine->gt->uc.guc;
> > > +	u32 last_switch = rec->last_switch_in_stamp;
> > > +	u32 ctx_id = rec->current_context_index;
> > > +	u32 total = rec->total_runtime;
> > > +
> > > +	lockdep_assert_held(&guc->timestamp.lock);
> > > +
> > > +	engine->stats.running = ctx_id != ~0U && last_switch;
> > > +	if (engine->stats.running)
> > > +		__extend_last_switch(guc, &engine->stats.start_gt_clk,
> > > +				     last_switch);
> > > +
> > > +	/*
> > > +	 * Instead of adjusting the total for overflow, just add the
> > > +	 * difference from previous sample to the stats.total_gt_clks
> > > +	 */
> > > +	if (total && total != ~0U) {
> > > +		engine->stats.total_gt_clks += (u32)(total -
> > > +						     engine->stats.prev_total);
> > > +		engine->stats.prev_total = total;
> > > +	}
> > > +}
> > > +
> > > +static void guc_update_pm_timestamp(struct intel_guc *guc)
> > > +{
> > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > +	u32 gt_stamp_now, gt_stamp_hi;
> > > +
> > > +	lockdep_assert_held(&guc->timestamp.lock);
> > > +
> > > +	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> > > +	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
> > > +
> > > +	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
> > > +		gt_stamp_hi++;
> > > +
> > > +	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
> > > +}
> > > +
> > > +/*
> > > + * Unlike the execlist mode of submission total and active times are in terms of
> > > + * gt clocks. The *now parameter is retained to return the cpu time at which the
> > > + * busyness was sampled.
> > > + */
> > > +static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
> > > +{
> > > +	struct intel_gt *gt = engine->gt;
> > > +	struct intel_guc *guc = &gt->uc.guc;
> > > +	unsigned long flags;
> > > +	u64 total;
> > > +
> > > +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> > > +
> > > +	*now = ktime_get();
> > > +
> > > +	/*
> > > +	 * The active busyness depends on start_gt_clk and gt_stamp.
> > > +	 * gt_stamp is updated by i915 only when gt is awake and the
> > > +	 * start_gt_clk is derived from GuC state. To get a consistent
> > > +	 * view of activity, we query the GuC state only if gt is awake.
> > > +	 */
> > > +	if (intel_gt_pm_get_if_awake(gt)) {
> > > +		guc_update_engine_gt_clks(engine);
> > > +		guc_update_pm_timestamp(guc);
> > > +		intel_gt_pm_put_async(gt);
> > > +	}
> > > +
> > > +	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
> > > +	if (engine->stats.running) {
> > > +		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
> > > +
> > > +		total += intel_gt_clock_interval_to_ns(gt, clk);
> > > +	}
> > > +
> > > +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> > > +
> > > +	return ns_to_ktime(total);
> > > +}
> > > +
> > > +static void __update_guc_busyness_stats(struct intel_guc *guc)
> > > +{
> > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > +	struct intel_engine_cs *engine;
> > > +	enum intel_engine_id id;
> > > +	unsigned long flags;
> > > +
> > > +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> > > +
> > > +	if (intel_gt_pm_get_if_awake(gt)) {
> > > +		guc_update_pm_timestamp(guc);
> > > +
> > > +		for_each_engine(engine, gt, id)
> > > +			guc_update_engine_gt_clks(engine);
> > > +
> > > +		intel_gt_pm_put_async(gt);
> > > +	}
> > > +
> > > +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> > > +}
> > > +
> > > +static void guc_timestamp_ping(struct work_struct *wrk)
> > > +{
> > > +	struct intel_guc *guc = container_of(wrk, typeof(*guc),
> > > +					     timestamp.work.work);
> > > +
> > > +	__update_guc_busyness_stats(guc);
> > > +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> > > +			 guc->timestamp.ping_delay);
> > > +}
> > > +
> > > +static int guc_action_enable_usage_stats(struct intel_guc *guc)
> > > +{
> > > +	u32 offset = intel_guc_engine_usage_offset(guc);
> > > +	u32 action[] = {
> > > +		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
> > > +		offset,
> > > +		0,
> > > +	};
> > > +
> > > +	return intel_guc_send(guc, action, ARRAY_SIZE(action));
> > > +}
> > > +
> > > +static void guc_init_engine_stats(struct intel_guc *guc)
> > > +{
> > > +	struct intel_gt *gt = guc_to_gt(guc);
> > > +	intel_wakeref_t wakeref;
> > > +
> > > +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> > > +			 guc->timestamp.ping_delay);
> > > +
> > > +	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
> > > +		int ret = guc_action_enable_usage_stats(guc);
> > > +
> > > +		if (ret)
> > > +			drm_err(&gt->i915->drm,
> > > +				"Failed to enable usage stats: %d!\n", ret);
> > > +	}
> > > +}
> > > +
> > > +void intel_guc_busyness_park(struct intel_gt *gt)
> > > +{
> > > +	struct intel_guc *guc = &gt->uc.guc;
> > > +
> > > +	cancel_delayed_work(&guc->timestamp.work);
> > > +	__update_guc_busyness_stats(guc);
> > > +}
> > > +
> > > +void intel_guc_busyness_unpark(struct intel_gt *gt)
> > > +{
> > > +	struct intel_guc *guc = &gt->uc.guc;
> > > +
> > > +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
> > > +			 guc->timestamp.ping_delay);
> > > +}
> > > +
> > >   /*
> > >    * Set up the memory resources to be shared with the GuC (via the GGTT)
> > >    * at firmware loading time.
> > >    */
> > >   int intel_guc_submission_init(struct intel_guc *guc)
> > >   {
> > > +	struct intel_gt *gt = guc_to_gt(guc);
> > >   	int ret;
> > >   	if (guc->lrc_desc_pool)
> > > @@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
> > >   	INIT_LIST_HEAD(&guc->guc_id_list);
> > >   	ida_init(&guc->guc_ids);
> > > +	spin_lock_init(&guc->timestamp.lock);
> > > +	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
> > > +	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
> > > +
> > >   	return 0;
> > >   }
> > > @@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
> > >   		engine->emit_flush = gen12_emit_flush_xcs;
> > >   	}
> > >   	engine->set_default_submission = guc_set_default_submission;
> > > +	engine->busyness = guc_engine_busyness;
> > > +	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
> > >   	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
> > >   	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
> > > @@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
> > >   void intel_guc_submission_enable(struct intel_guc *guc)
> > >   {
> > >   	guc_init_lrc_mapping(guc);
> > > +	guc_init_engine_stats(guc);
> > >   }
> > >   void intel_guc_submission_disable(struct intel_guc *guc)
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> > > index c7ef44fa0c36..5a95a9f0a8e3 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
> > > @@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > >   void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
> > >   				    struct i915_request *hung_rq,
> > >   				    struct drm_printer *m);
> > > +void intel_guc_busyness_park(struct intel_gt *gt);
> > > +void intel_guc_busyness_unpark(struct intel_gt *gt);
> > >   bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
> > > diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> > > index a897f4abea0c..9aee08425382 100644
> > > --- a/drivers/gpu/drm/i915/i915_reg.h
> > > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > > @@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
> > >   #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
> > >   #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
> > > +#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
> > > +
> > >   /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
> > >   #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
> > >   #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
> > > -- 
> > > 2.20.1
> > > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
  2021-10-06  9:11   ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-10-06 20:45     ` Umesh Nerlige Ramappa
  -1 siblings, 0 replies; 24+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-10-06 20:45 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: intel-gfx, dri-devel, john.c.harrison, daniel.vetter, Matthew Brost

On Wed, Oct 06, 2021 at 10:11:58AM +0100, Tvrtko Ursulin wrote:
>
>On 05/10/2021 18:47, Umesh Nerlige Ramappa wrote:
>>With GuC handling scheduling, i915 is not aware of the time that a
>>context is scheduled in and out of the engine. Since i915 pmu relies on
>>this info to provide engine busyness to the user, GuC shares this info
>>with i915 for all engines using shared memory. For each engine, this
>>info contains:
>>
>>- total busyness: total time that the context was running (total)
>>- id: id of the running context (id)
>>- start timestamp: timestamp when the context started running (start)
>>
>>At the time (now) of sampling the engine busyness, if the id is valid
>>(!= ~0), and start is non-zero, then the context is considered to be
>>active and the engine busyness is calculated using the below equation
>>
>>	engine busyness = total + (now - start)
>>
>>All times are obtained from the gt clock base. For inactive contexts,
>>engine busyness is just equal to the total.
>>
>>The start and total values provided by GuC are 32 bits and wrap around
>>in a few minutes. Since perf pmu provides busyness as 64 bit
>>monotonically increasing values, there is a need for this implementation
>>to account for overflows and extend the time to 64 bits before returning
>>busyness to the user. In order to do that, a worker runs periodically at
>>frequency = 1/8th the time it takes for the timestamp to wrap. As an
>>example, that would be once in 27 seconds for a gt clock frequency of
>>19.2 MHz.
>>
>>Opens and wip that are targeted for later patches:
>>
>>1) On global gt reset the total busyness of engines resets and i915
>>    needs to fix that so that user sees monotonically increasing
>>    busyness.
>>2) In runtime suspend mode, the worker may not need to be run. We could
>>    stop the worker on suspend and rerun it on resume provided that the
>>    guc pm timestamp does not tick during suspend.
>
>Second point had now been addressed, right?

Both were addressed actually. For reset, I was mainly running busy-hang 
and after adding your suggestion of maintaining a consistent view, the 
busy-hang is fixed too.

I will remove them from the commit msg.

>
>>
>>Note:
>>There might be an overaccounting of busyness due to the fact that GuC
>>may be updating the total and start values while kmd is reading them.
>>(i.e kmd may read the updated total and the stale start). In such a
>>case, user may see higher busyness value followed by smaller ones which
>>would eventually catch up to the higher value.
>>
>>v2: (Tvrtko)
>>- Include details in commit message
>>- Move intel engine busyness function into execlist code
>>- Use union inside engine->stats
>>- Use natural type for ping delay jiffies
>>- Drop active_work condition checks
>>- Use for_each_engine if iterating all engines
>>- Drop seq locking, use spinlock at guc level to update engine stats
>>- Document worker specific details
>>
>>v3: (Tvrtko/Umesh)
>>- Demarcate guc and execlist stat objects with comments
>>- Document known over-accounting issue in commit
>>- Provide a consistent view of guc state
>>- Add hooks to gt park/unpark for guc busyness
>>- Stop/start worker in gt park/unpark path
>>- Drop inline
>>- Move spinlock and worker inits to guc initialization
>>- Drop helpers that are called only once
>>
>>Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>---
>>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
>>  drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
>>  .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
>>  drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
>>  .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
>>  drivers/gpu/drm/i915/i915_reg.h               |   2 +
>>  12 files changed, 398 insertions(+), 49 deletions(-)
>>
>>diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>index 2ae57e4656a3..6fcc70a313d9 100644
>>--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>@@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>>  	intel_engine_print_breadcrumbs(engine, m);
>>  }
>>-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>-					    ktime_t *now)
>>-{
>>-	ktime_t total = engine->stats.total;
>>-
>>-	/*
>>-	 * If the engine is executing something at the moment
>>-	 * add it to the total.
>>-	 */
>>-	*now = ktime_get();
>>-	if (READ_ONCE(engine->stats.active))
>>-		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>>-
>>-	return total;
>>-}
>>-
>>  /**
>>   * intel_engine_get_busy_time() - Return current accumulated engine busyness
>>   * @engine: engine to report on
>>@@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>   */
>>  ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>>  {
>>-	unsigned int seq;
>>-	ktime_t total;
>>-
>>-	do {
>>-		seq = read_seqcount_begin(&engine->stats.lock);
>>-		total = __intel_engine_get_busy_time(engine, now);
>>-	} while (read_seqcount_retry(&engine->stats.lock, seq));
>>-
>>-	return total;
>>+	return engine->busyness(engine, now);
>>  }
>>  struct intel_context *
>>diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>index 5ae1207c363b..8e1b9c38a6fc 100644
>>--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>@@ -432,6 +432,12 @@ struct intel_engine_cs {
>>  	void		(*add_active_request)(struct i915_request *rq);
>>  	void		(*remove_active_request)(struct i915_request *rq);
>>+	/*
>>+	 * Get engine busyness and the time at which the busyness was sampled.
>>+	 */
>>+	ktime_t		(*busyness)(struct intel_engine_cs *engine,
>>+				    ktime_t *now);
>>+
>>  	struct intel_engine_execlists execlists;
>>  	/*
>>@@ -481,30 +487,66 @@ struct intel_engine_cs {
>>  	u32 (*get_cmd_length_mask)(u32 cmd_header);
>>  	struct {
>>-		/**
>>-		 * @active: Number of contexts currently scheduled in.
>>-		 */
>>-		unsigned int active;
>>-
>>-		/**
>>-		 * @lock: Lock protecting the below fields.
>>-		 */
>>-		seqcount_t lock;
>>-
>>-		/**
>>-		 * @total: Total time this engine was busy.
>>-		 *
>>-		 * Accumulated time not counting the most recent block in cases
>>-		 * where engine is currently busy (active > 0).
>>-		 */
>>-		ktime_t total;
>>-
>>-		/**
>>-		 * @start: Timestamp of the last idle to active transition.
>>-		 *
>>-		 * Idle is defined as active == 0, active is active > 0.
>>-		 */
>>-		ktime_t start;
>>+		union {
>>+			/* Fields used by the execlists backend. */
>>+			struct {
>>+				/**
>>+				 * @active: Number of contexts currently
>>+				 * scheduled in.
>>+				 */
>>+				unsigned int active;
>>+
>>+				/**
>>+				 * @lock: Lock protecting the below fields.
>>+				 */
>>+				seqcount_t lock;
>>+
>>+				/**
>>+				 * @total: Total time this engine was busy.
>>+				 *
>>+				 * Accumulated time not counting the most recent
>>+				 * block in cases where engine is currently busy
>>+				 * (active > 0).
>>+				 */
>>+				ktime_t total;
>>+
>>+				/**
>>+				 * @start: Timestamp of the last idle to active
>>+				 * transition.
>>+				 *
>>+				 * Idle is defined as active == 0, active is
>>+				 * active > 0.
>>+				 */
>>+				ktime_t start;
>>+			};
>>+
>>+			/* Fields used by the GuC backend. */
>>+			struct {
>>+				/**
>>+				 * @running: Active state of the engine when
>>+				 * busyness was last sampled.
>>+				 */
>>+				bool running;
>>+
>>+				/**
>>+				 * @prev_total: Previous value of total runtime
>>+				 * clock cycles.
>>+				 */
>>+				u32 prev_total;
>>+
>>+				/**
>>+				 * @total_gt_clks: Total gt clock cycles this
>>+				 * engine was busy.
>>+				 */
>>+				u64 total_gt_clks;
>>+
>>+				/**
>>+				 * @start_gt_clk: GT clock time of last idle to
>>+				 * active transition.
>>+				 */
>>+				u64 start_gt_clk;
>>+			};
>>+		};
>>  		/**
>>  		 * @rps: Utilisation at last RPS sampling.
>>diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>index 7147fe80919e..5c9b695e906c 100644
>>--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>@@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
>>  	lrc_fini_wa_ctx(engine);
>>  }
>>+static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
>>+					   ktime_t *now)
>>+{
>>+	ktime_t total = engine->stats.total;
>>+
>>+	/*
>>+	 * If the engine is executing something at the moment
>>+	 * add it to the total.
>>+	 */
>>+	*now = ktime_get();
>>+	if (READ_ONCE(engine->stats.active))
>>+		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>>+
>>+	return total;
>>+}
>>+
>>+static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
>>+					 ktime_t *now)
>>+{
>>+	unsigned int seq;
>>+	ktime_t total;
>>+
>>+	do {
>>+		seq = read_seqcount_begin(&engine->stats.lock);
>>+		total = __execlists_engine_busyness(engine, now);
>>+	} while (read_seqcount_retry(&engine->stats.lock, seq));
>>+
>>+	return total;
>>+}
>>+
>>  static void
>>  logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>  {
>>@@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>  		engine->emit_bb_start = gen8_emit_bb_start;
>>  	else
>>  		engine->emit_bb_start = gen8_emit_bb_start_noarb;
>>+
>>+	engine->busyness = execlists_engine_busyness;
>>  }
>>  static void logical_ring_default_irqs(struct intel_engine_cs *engine)
>>diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>index 524eaf678790..b4a8594bc46c 100644
>>--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>@@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
>>  	intel_rc6_unpark(&gt->rc6);
>>  	intel_rps_unpark(&gt->rps);
>>  	i915_pmu_gt_unparked(i915);
>>+	intel_guc_busyness_unpark(gt);
>>  	intel_gt_unpark_requests(gt);
>>  	runtime_begin(gt);
>>@@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
>>  	runtime_end(gt);
>>  	intel_gt_park_requests(gt);
>>+	intel_guc_busyness_park(gt);
>>  	i915_vma_parked(gt);
>>  	i915_pmu_gt_parked(i915);
>>  	intel_rps_park(&gt->rps);
>>diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>>index 8ff582222aff..ff1311d4beff 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>>@@ -143,6 +143,7 @@ enum intel_guc_action {
>>  	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>>  	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>>  	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>>+	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>>  	INTEL_GUC_ACTION_LIMIT
>>  };
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>index 5dd174babf7a..22c30dbdf63a 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>@@ -104,6 +104,8 @@ struct intel_guc {
>>  	u32 ads_regset_size;
>>  	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>>  	u32 ads_golden_ctxt_size;
>>+	/** @ads_engine_usage_size: size of engine usage in the ADS */
>>+	u32 ads_engine_usage_size;
>>  	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
>>  	struct i915_vma *lrc_desc_pool;
>>@@ -138,6 +140,30 @@ struct intel_guc {
>>  	/** @send_mutex: used to serialize the intel_guc_send actions */
>>  	struct mutex send_mutex;
>>+
>>+	struct {
>>+		/**
>>+		 * @lock: Lock protecting the below fields and the engine stats.
>>+		 */
>>+		spinlock_t lock;
>>+
>>+		/**
>>+		 * @gt_stamp: 64 bit extended value of the GT timestamp.
>>+		 */
>>+		u64 gt_stamp;
>>+
>>+		/**
>>+		 * @ping_delay: Period for polling the GT timestamp for
>>+		 * overflow.
>>+		 */
>>+		unsigned long ping_delay;
>>+
>>+		/**
>>+		 * @work: Periodic work to adjust GT timestamp, engine and
>>+		 * context usage for overflows.
>>+		 */
>>+		struct delayed_work work;
>>+	} timestamp;
>>  };
>>  static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>index 2c6ea64af7ec..ca9ab53999d5 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>@@ -26,6 +26,8 @@
>>   *      | guc_policies                          |
>>   *      +---------------------------------------+
>>   *      | guc_gt_system_info                    |
>>+ *      +---------------------------------------+
>>+ *      | guc_engine_usage                      |
>>   *      +---------------------------------------+ <== static
>>   *      | guc_mmio_reg[countA] (engine 0.0)     |
>>   *      | guc_mmio_reg[countB] (engine 0.1)     |
>>@@ -47,6 +49,7 @@ struct __guc_ads_blob {
>>  	struct guc_ads ads;
>>  	struct guc_policies policies;
>>  	struct guc_gt_system_info system_info;
>>+	struct guc_engine_usage engine_usage;
>>  	/* From here on, location is dynamic! Refer to above diagram. */
>>  	struct guc_mmio_reg regset[0];
>>  } __packed;
>>@@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
>>  	guc_ads_private_data_reset(guc);
>>  }
>>+
>>+u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
>>+{
>>+	struct __guc_ads_blob *blob = guc->ads_blob;
>>+	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
>>+	u32 offset = base + ptr_offset(blob, engine_usage);
>>+
>>+	return offset;
>>+}
>>+
>>+struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
>>+{
>>+	struct intel_guc *guc = &engine->gt->uc.guc;
>>+	struct __guc_ads_blob *blob = guc->ads_blob;
>>+	u8 guc_class = engine_class_to_guc_class(engine->class);
>>+
>>+	return &blob->engine_usage.engines[guc_class][engine->instance];
>>+}
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>>index 3d85051d57e4..e74c110facff 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>>@@ -6,8 +6,11 @@
>>  #ifndef _INTEL_GUC_ADS_H_
>>  #define _INTEL_GUC_ADS_H_
>>+#include <linux/types.h>
>>+
>>  struct intel_guc;
>>  struct drm_printer;
>>+struct intel_engine_cs;
>>  int intel_guc_ads_create(struct intel_guc *guc);
>>  void intel_guc_ads_destroy(struct intel_guc *guc);
>>@@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
>>  void intel_guc_ads_reset(struct intel_guc *guc);
>>  void intel_guc_ads_print_policy_info(struct intel_guc *guc,
>>  				     struct drm_printer *p);
>>+struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
>>+u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
>>  #endif
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>index fa4be13c8854..7c9c081670fc 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>@@ -294,6 +294,19 @@ struct guc_ads {
>>  	u32 reserved[15];
>>  } __packed;
>>+/* Engine usage stats */
>>+struct guc_engine_usage_record {
>>+	u32 current_context_index;
>>+	u32 last_switch_in_stamp;
>>+	u32 reserved0;
>>+	u32 total_runtime;
>>+	u32 reserved1[4];
>>+} __packed;
>>+
>>+struct guc_engine_usage {
>>+	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
>>+} __packed;
>>+
>>  /* GuC logging structures */
>>  enum guc_log_buffer_type {
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>index ba0de35f6323..3f7d0f2ac9da 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>@@ -12,6 +12,7 @@
>>  #include "gt/intel_engine_pm.h"
>>  #include "gt/intel_engine_heartbeat.h"
>>  #include "gt/intel_gt.h"
>>+#include "gt/intel_gt_clock_utils.h"
>>  #include "gt/intel_gt_irq.h"
>>  #include "gt/intel_gt_pm.h"
>>  #include "gt/intel_gt_requests.h"
>>@@ -20,6 +21,7 @@
>>  #include "gt/intel_mocs.h"
>>  #include "gt/intel_ring.h"
>>+#include "intel_guc_ads.h"
>>  #include "intel_guc_submission.h"
>>  #include "i915_drv.h"
>>@@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>>  static void disable_submission(struct intel_guc *guc)
>>  {
>>  	struct i915_sched_engine * const sched_engine = guc->sched_engine;
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+	struct intel_engine_cs *engine;
>>+	enum intel_engine_id id;
>>+	unsigned long flags;
>>  	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>>  		GEM_BUG_ON(!guc->ct.enabled);
>>  		__tasklet_disable_sync_once(&sched_engine->tasklet);
>>  		sched_engine->tasklet.callback = NULL;
>>  	}
>>+
>>+	cancel_delayed_work(&guc->timestamp.work);
>
>I am not sure when disable_submission gets called so a question - 
>could it be important to call cancel_delayed_work_sync here to ensure 
>if the worker was running it had exited before proceeding?

disable_submission is called in the reset_prepare path for uc resets. I 
see this happening only with busy-hang test which does a global gt 
reset. The counterpart for this is the guc_init_engine_stats which is 
called post reset in the path to initialize GuC.

I tried cancel_delayed_work_sync both here and in park. Seems to work 
fine, so will change the calls to _sync versions.

>
>Also, does this interact with the open about resets? Should/could 
>parking helper be called from here?

It is related to reset. Below, I am only updating the engine prev_total 
to 0 since it gets reset on gt reset. I thought that's all we need to 
keep the busyness increasing monotonically. By calling parking helper, 
are you suggesting we should update the other stats too (total, start, 
gt_stamp etc.)?

>
>>+
>>+	spin_lock_irqsave(&guc->timestamp.lock, flags);
>>+
>>+	for_each_engine(engine, gt, id)
>>+		engine->stats.prev_total = 0;
>>+
>>+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>  }
>>  static void enable_submission(struct intel_guc *guc)
>>@@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>>  	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>>  }
>>+/*
>>+ * GuC stores busyness stats for each engine at context in/out boundaries. A
>>+ * context 'in' logs execution start time, 'out' adds in -> out delta to total.
>>+ * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
>>+ * GuC.
>>+ *
>>+ * __i915_pmu_event_read samples engine busyness. When sampling, if context id
>>+ * is valid (!= ~0) and start is non-zero, the engine is considered to be
>>+ * active. For an active engine total busyness = total + (now - start), where
>>+ * 'now' is the time at which the busyness is sampled. For inactive engine,
>>+ * total busyness = total.
>>+ *
>>+ * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
>>+ *
>>+ * The start and total values provided by GuC are 32 bits and wrap around in a
>>+ * few minutes. Since perf pmu provides busyness as 64 bit monotonically
>>+ * increasing ns values, there is a need for this implementation to account for
>>+ * overflows and extend the GuC provided values to 64 bits before returning
>>+ * busyness to the user. In order to do that, a worker runs periodically at
>>+ * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
>>+ * 27 seconds for a gt clock frequency of 19.2 MHz).
>>+ */
>>+
>>+#define WRAP_TIME_CLKS U32_MAX
>>+#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
>>+
>>+static void
>>+__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
>>+{
>>+	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>>+	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
>>+
>>+	if (new_start == lower_32_bits(*prev_start))
>>+		return;
>>+
>>+	if (new_start < gt_stamp_last &&
>>+	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
>>+		gt_stamp_hi++;
>>+
>>+	if (new_start > gt_stamp_last &&
>>+	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
>>+		gt_stamp_hi--;
>>+
>>+	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
>>+}
>>+
>>+static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
>>+{
>>+	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
>>+	struct intel_guc *guc = &engine->gt->uc.guc;
>>+	u32 last_switch = rec->last_switch_in_stamp;
>>+	u32 ctx_id = rec->current_context_index;
>>+	u32 total = rec->total_runtime;
>>+
>>+	lockdep_assert_held(&guc->timestamp.lock);
>>+
>>+	engine->stats.running = ctx_id != ~0U && last_switch;
>>+	if (engine->stats.running)
>>+		__extend_last_switch(guc, &engine->stats.start_gt_clk,
>>+				     last_switch);
>>+
>>+	/*
>>+	 * Instead of adjusting the total for overflow, just add the
>>+	 * difference from previous sample to the stats.total_gt_clks
>>+	 */
>>+	if (total && total != ~0U) {
>>+		engine->stats.total_gt_clks += (u32)(total -
>>+						     engine->stats.prev_total);
>>+		engine->stats.prev_total = total;
>>+	}
>>+}
>>+
>>+static void guc_update_pm_timestamp(struct intel_guc *guc)
>>+{
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+	u32 gt_stamp_now, gt_stamp_hi;
>>+
>>+	lockdep_assert_held(&guc->timestamp.lock);
>>+
>>+	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>>+	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
>>+
>>+	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
>>+		gt_stamp_hi++;
>>+
>>+	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
>>+}
>>+
>>+/*
>>+ * Unlike the execlist mode of submission total and active times are in terms of
>>+ * gt clocks. The *now parameter is retained to return the cpu time at which the
>>+ * busyness was sampled.
>>+ */
>>+static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>>+{
>>+	struct intel_gt *gt = engine->gt;
>>+	struct intel_guc *guc = &gt->uc.guc;
>>+	unsigned long flags;
>>+	u64 total;
>>+
>>+	spin_lock_irqsave(&guc->timestamp.lock, flags);
>>+
>>+	*now = ktime_get();
>>+
>>+	/*
>>+	 * The active busyness depends on start_gt_clk and gt_stamp.
>>+	 * gt_stamp is updated by i915 only when gt is awake and the
>>+	 * start_gt_clk is derived from GuC state. To get a consistent
>>+	 * view of activity, we query the GuC state only if gt is awake.
>>+	 */
>>+	if (intel_gt_pm_get_if_awake(gt)) {
>>+		guc_update_engine_gt_clks(engine);
>>+		guc_update_pm_timestamp(guc);
>>+		intel_gt_pm_put_async(gt);
>>+	}
>>+
>>+	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
>>+	if (engine->stats.running) {
>>+		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
>>+
>>+		total += intel_gt_clock_interval_to_ns(gt, clk);
>>+	}
>>+
>>+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>+
>>+	return ns_to_ktime(total);
>>+}
>>+
>>+static void __update_guc_busyness_stats(struct intel_guc *guc)
>>+{
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+	struct intel_engine_cs *engine;
>>+	enum intel_engine_id id;
>>+	unsigned long flags;
>>+
>>+	spin_lock_irqsave(&guc->timestamp.lock, flags);
>>+
>>+	if (intel_gt_pm_get_if_awake(gt)) {
>>+		guc_update_pm_timestamp(guc);
>>+
>>+		for_each_engine(engine, gt, id)
>>+			guc_update_engine_gt_clks(engine);
>>+
>>+		intel_gt_pm_put_async(gt);
>>+	}
>>+
>>+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>+}
>>+
>>+static void guc_timestamp_ping(struct work_struct *wrk)
>>+{
>>+	struct intel_guc *guc = container_of(wrk, typeof(*guc),
>>+					     timestamp.work.work);
>>+
>>+	__update_guc_busyness_stats(guc);
>
>From ping you may need to ensure you wake up the GPU (not call 
>intel_gt_pm_get_if_awake in update) or I think there is a chance ping 
>gets unlucky and fails to do its job.
>
>Probably get the pm ref here and remove it from 
>__update_guc_busyness_stats, since the other caller (park) guarantees 
>pm ref is still held.

will do

>
>>+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>>+			 guc->timestamp.ping_delay);
>>+}
>>+
>>+static int guc_action_enable_usage_stats(struct intel_guc *guc)
>>+{
>>+	u32 offset = intel_guc_engine_usage_offset(guc);
>>+	u32 action[] = {
>>+		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
>>+		offset,
>>+		0,
>>+	};
>>+
>>+	return intel_guc_send(guc, action, ARRAY_SIZE(action));
>>+}
>>+
>>+static void guc_init_engine_stats(struct intel_guc *guc)
>>+{
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+	intel_wakeref_t wakeref;
>>+
>>+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>>+			 guc->timestamp.ping_delay);
>
>Not sure how this slots in with unpark. It will probably be called two 
>times but it also probably does not matter? If you can figure it out 
>perhaps you can remove this call from here. Or maybe there is a 
>separate path where disable-enable can be called without the 
>park-unpark transition. In which case you could call the unpark helper 
>here. Not sure really.

- disable_submission pairs with guc_init_engine_stats for the gt reset 
  path.
- park/unpark just follow the gt_park/gt_unpark paths.

I haven't checked if reset eventually results in park/unpark or if they 
are separate paths though. In the reset path, there are a bunch of 
i915_requests going on, so difficult to say if reset caused the 
gt_park/gt_unpark or was it the requests.

The cases where mod_delayed_work is called twice are:

1) module load
2) i915_gem_resume (based on rc6-suspend test)

In both cases, unpark is followed by guc_init_engine_stats. Looking a 
bit at what is returned from the mod_delayed_work, I see that it just 
modifies the time if the work is already queued/pending, so I am 
thinking we should be okay.

I don't see cancel getting called twice without a mod_delayed_work in 
between.

Thanks,
Umesh

>
>Regards,
>
>Tvrtko
>
>>+
>>+	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
>>+		int ret = guc_action_enable_usage_stats(guc);
>>+
>>+		if (ret)
>>+			drm_err(&gt->i915->drm,
>>+				"Failed to enable usage stats: %d!\n", ret);
>>+	}
>>+}
>>+
>>+void intel_guc_busyness_park(struct intel_gt *gt)
>>+{
>>+	struct intel_guc *guc = &gt->uc.guc;
>>+
>>+	cancel_delayed_work(&guc->timestamp.work);
>>+	__update_guc_busyness_stats(guc);
>>+}
>>+
>>+void intel_guc_busyness_unpark(struct intel_gt *gt)
>>+{
>>+	struct intel_guc *guc = &gt->uc.guc;
>>+
>>+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>>+			 guc->timestamp.ping_delay);
>>+}
>>+
>>  /*
>>   * Set up the memory resources to be shared with the GuC (via the GGTT)
>>   * at firmware loading time.
>>   */
>>  int intel_guc_submission_init(struct intel_guc *guc)
>>  {
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>  	int ret;
>>  	if (guc->lrc_desc_pool)
>>@@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>  	INIT_LIST_HEAD(&guc->guc_id_list);
>>  	ida_init(&guc->guc_ids);
>>+	spin_lock_init(&guc->timestamp.lock);
>>+	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
>>+	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
>>+
>>  	return 0;
>>  }
>>@@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
>>  		engine->emit_flush = gen12_emit_flush_xcs;
>>  	}
>>  	engine->set_default_submission = guc_set_default_submission;
>>+	engine->busyness = guc_engine_busyness;
>>+	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
>>  	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
>>  	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
>>@@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
>>  void intel_guc_submission_enable(struct intel_guc *guc)
>>  {
>>  	guc_init_lrc_mapping(guc);
>>+	guc_init_engine_stats(guc);
>>  }
>>  void intel_guc_submission_disable(struct intel_guc *guc)
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>>index c7ef44fa0c36..5a95a9f0a8e3 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>>@@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>>  void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>>  				    struct i915_request *hung_rq,
>>  				    struct drm_printer *m);
>>+void intel_guc_busyness_park(struct intel_gt *gt);
>>+void intel_guc_busyness_unpark(struct intel_gt *gt);
>>  bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
>>diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>>index a897f4abea0c..9aee08425382 100644
>>--- a/drivers/gpu/drm/i915/i915_reg.h
>>+++ b/drivers/gpu/drm/i915/i915_reg.h
>>@@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>>  #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
>>  #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
>>+#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
>>+
>>  /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
>>  #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
>>  #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
@ 2021-10-06 20:45     ` Umesh Nerlige Ramappa
  0 siblings, 0 replies; 24+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-10-06 20:45 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: intel-gfx, dri-devel, john.c.harrison, daniel.vetter, Matthew Brost

On Wed, Oct 06, 2021 at 10:11:58AM +0100, Tvrtko Ursulin wrote:
>
>On 05/10/2021 18:47, Umesh Nerlige Ramappa wrote:
>>With GuC handling scheduling, i915 is not aware of the time that a
>>context is scheduled in and out of the engine. Since i915 pmu relies on
>>this info to provide engine busyness to the user, GuC shares this info
>>with i915 for all engines using shared memory. For each engine, this
>>info contains:
>>
>>- total busyness: total time that the context was running (total)
>>- id: id of the running context (id)
>>- start timestamp: timestamp when the context started running (start)
>>
>>At the time (now) of sampling the engine busyness, if the id is valid
>>(!= ~0), and start is non-zero, then the context is considered to be
>>active and the engine busyness is calculated using the below equation
>>
>>	engine busyness = total + (now - start)
>>
>>All times are obtained from the gt clock base. For inactive contexts,
>>engine busyness is just equal to the total.
>>
>>The start and total values provided by GuC are 32 bits and wrap around
>>in a few minutes. Since perf pmu provides busyness as 64 bit
>>monotonically increasing values, there is a need for this implementation
>>to account for overflows and extend the time to 64 bits before returning
>>busyness to the user. In order to do that, a worker runs periodically at
>>frequency = 1/8th the time it takes for the timestamp to wrap. As an
>>example, that would be once in 27 seconds for a gt clock frequency of
>>19.2 MHz.
>>
>>Opens and wip that are targeted for later patches:
>>
>>1) On global gt reset the total busyness of engines resets and i915
>>    needs to fix that so that user sees monotonically increasing
>>    busyness.
>>2) In runtime suspend mode, the worker may not need to be run. We could
>>    stop the worker on suspend and rerun it on resume provided that the
>>    guc pm timestamp does not tick during suspend.
>
>Second point had now been addressed, right?

Both were addressed actually. For reset, I was mainly running busy-hang 
and after adding your suggestion of maintaining a consistent view, the 
busy-hang is fixed too.

I will remove them from the commit msg.

>
>>
>>Note:
>>There might be an overaccounting of busyness due to the fact that GuC
>>may be updating the total and start values while kmd is reading them.
>>(i.e kmd may read the updated total and the stale start). In such a
>>case, user may see higher busyness value followed by smaller ones which
>>would eventually catch up to the higher value.
>>
>>v2: (Tvrtko)
>>- Include details in commit message
>>- Move intel engine busyness function into execlist code
>>- Use union inside engine->stats
>>- Use natural type for ping delay jiffies
>>- Drop active_work condition checks
>>- Use for_each_engine if iterating all engines
>>- Drop seq locking, use spinlock at guc level to update engine stats
>>- Document worker specific details
>>
>>v3: (Tvrtko/Umesh)
>>- Demarcate guc and execlist stat objects with comments
>>- Document known over-accounting issue in commit
>>- Provide a consistent view of guc state
>>- Add hooks to gt park/unpark for guc busyness
>>- Stop/start worker in gt park/unpark path
>>- Drop inline
>>- Move spinlock and worker inits to guc initialization
>>- Drop helpers that are called only once
>>
>>Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>---
>>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
>>  drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
>>  .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
>>  drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
>>  .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
>>  drivers/gpu/drm/i915/i915_reg.h               |   2 +
>>  12 files changed, 398 insertions(+), 49 deletions(-)
>>
>>diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>index 2ae57e4656a3..6fcc70a313d9 100644
>>--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>@@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>>  	intel_engine_print_breadcrumbs(engine, m);
>>  }
>>-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>-					    ktime_t *now)
>>-{
>>-	ktime_t total = engine->stats.total;
>>-
>>-	/*
>>-	 * If the engine is executing something at the moment
>>-	 * add it to the total.
>>-	 */
>>-	*now = ktime_get();
>>-	if (READ_ONCE(engine->stats.active))
>>-		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>>-
>>-	return total;
>>-}
>>-
>>  /**
>>   * intel_engine_get_busy_time() - Return current accumulated engine busyness
>>   * @engine: engine to report on
>>@@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>   */
>>  ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>>  {
>>-	unsigned int seq;
>>-	ktime_t total;
>>-
>>-	do {
>>-		seq = read_seqcount_begin(&engine->stats.lock);
>>-		total = __intel_engine_get_busy_time(engine, now);
>>-	} while (read_seqcount_retry(&engine->stats.lock, seq));
>>-
>>-	return total;
>>+	return engine->busyness(engine, now);
>>  }
>>  struct intel_context *
>>diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>index 5ae1207c363b..8e1b9c38a6fc 100644
>>--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>@@ -432,6 +432,12 @@ struct intel_engine_cs {
>>  	void		(*add_active_request)(struct i915_request *rq);
>>  	void		(*remove_active_request)(struct i915_request *rq);
>>+	/*
>>+	 * Get engine busyness and the time at which the busyness was sampled.
>>+	 */
>>+	ktime_t		(*busyness)(struct intel_engine_cs *engine,
>>+				    ktime_t *now);
>>+
>>  	struct intel_engine_execlists execlists;
>>  	/*
>>@@ -481,30 +487,66 @@ struct intel_engine_cs {
>>  	u32 (*get_cmd_length_mask)(u32 cmd_header);
>>  	struct {
>>-		/**
>>-		 * @active: Number of contexts currently scheduled in.
>>-		 */
>>-		unsigned int active;
>>-
>>-		/**
>>-		 * @lock: Lock protecting the below fields.
>>-		 */
>>-		seqcount_t lock;
>>-
>>-		/**
>>-		 * @total: Total time this engine was busy.
>>-		 *
>>-		 * Accumulated time not counting the most recent block in cases
>>-		 * where engine is currently busy (active > 0).
>>-		 */
>>-		ktime_t total;
>>-
>>-		/**
>>-		 * @start: Timestamp of the last idle to active transition.
>>-		 *
>>-		 * Idle is defined as active == 0, active is active > 0.
>>-		 */
>>-		ktime_t start;
>>+		union {
>>+			/* Fields used by the execlists backend. */
>>+			struct {
>>+				/**
>>+				 * @active: Number of contexts currently
>>+				 * scheduled in.
>>+				 */
>>+				unsigned int active;
>>+
>>+				/**
>>+				 * @lock: Lock protecting the below fields.
>>+				 */
>>+				seqcount_t lock;
>>+
>>+				/**
>>+				 * @total: Total time this engine was busy.
>>+				 *
>>+				 * Accumulated time not counting the most recent
>>+				 * block in cases where engine is currently busy
>>+				 * (active > 0).
>>+				 */
>>+				ktime_t total;
>>+
>>+				/**
>>+				 * @start: Timestamp of the last idle to active
>>+				 * transition.
>>+				 *
>>+				 * Idle is defined as active == 0, active is
>>+				 * active > 0.
>>+				 */
>>+				ktime_t start;
>>+			};
>>+
>>+			/* Fields used by the GuC backend. */
>>+			struct {
>>+				/**
>>+				 * @running: Active state of the engine when
>>+				 * busyness was last sampled.
>>+				 */
>>+				bool running;
>>+
>>+				/**
>>+				 * @prev_total: Previous value of total runtime
>>+				 * clock cycles.
>>+				 */
>>+				u32 prev_total;
>>+
>>+				/**
>>+				 * @total_gt_clks: Total gt clock cycles this
>>+				 * engine was busy.
>>+				 */
>>+				u64 total_gt_clks;
>>+
>>+				/**
>>+				 * @start_gt_clk: GT clock time of last idle to
>>+				 * active transition.
>>+				 */
>>+				u64 start_gt_clk;
>>+			};
>>+		};
>>  		/**
>>  		 * @rps: Utilisation at last RPS sampling.
>>diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>index 7147fe80919e..5c9b695e906c 100644
>>--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>@@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
>>  	lrc_fini_wa_ctx(engine);
>>  }
>>+static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
>>+					   ktime_t *now)
>>+{
>>+	ktime_t total = engine->stats.total;
>>+
>>+	/*
>>+	 * If the engine is executing something at the moment
>>+	 * add it to the total.
>>+	 */
>>+	*now = ktime_get();
>>+	if (READ_ONCE(engine->stats.active))
>>+		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>>+
>>+	return total;
>>+}
>>+
>>+static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
>>+					 ktime_t *now)
>>+{
>>+	unsigned int seq;
>>+	ktime_t total;
>>+
>>+	do {
>>+		seq = read_seqcount_begin(&engine->stats.lock);
>>+		total = __execlists_engine_busyness(engine, now);
>>+	} while (read_seqcount_retry(&engine->stats.lock, seq));
>>+
>>+	return total;
>>+}
>>+
>>  static void
>>  logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>  {
>>@@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>  		engine->emit_bb_start = gen8_emit_bb_start;
>>  	else
>>  		engine->emit_bb_start = gen8_emit_bb_start_noarb;
>>+
>>+	engine->busyness = execlists_engine_busyness;
>>  }
>>  static void logical_ring_default_irqs(struct intel_engine_cs *engine)
>>diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>index 524eaf678790..b4a8594bc46c 100644
>>--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>@@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
>>  	intel_rc6_unpark(&gt->rc6);
>>  	intel_rps_unpark(&gt->rps);
>>  	i915_pmu_gt_unparked(i915);
>>+	intel_guc_busyness_unpark(gt);
>>  	intel_gt_unpark_requests(gt);
>>  	runtime_begin(gt);
>>@@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
>>  	runtime_end(gt);
>>  	intel_gt_park_requests(gt);
>>+	intel_guc_busyness_park(gt);
>>  	i915_vma_parked(gt);
>>  	i915_pmu_gt_parked(i915);
>>  	intel_rps_park(&gt->rps);
>>diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>>index 8ff582222aff..ff1311d4beff 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>>@@ -143,6 +143,7 @@ enum intel_guc_action {
>>  	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>>  	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>>  	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>>+	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>>  	INTEL_GUC_ACTION_LIMIT
>>  };
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>index 5dd174babf7a..22c30dbdf63a 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>@@ -104,6 +104,8 @@ struct intel_guc {
>>  	u32 ads_regset_size;
>>  	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>>  	u32 ads_golden_ctxt_size;
>>+	/** @ads_engine_usage_size: size of engine usage in the ADS */
>>+	u32 ads_engine_usage_size;
>>  	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
>>  	struct i915_vma *lrc_desc_pool;
>>@@ -138,6 +140,30 @@ struct intel_guc {
>>  	/** @send_mutex: used to serialize the intel_guc_send actions */
>>  	struct mutex send_mutex;
>>+
>>+	struct {
>>+		/**
>>+		 * @lock: Lock protecting the below fields and the engine stats.
>>+		 */
>>+		spinlock_t lock;
>>+
>>+		/**
>>+		 * @gt_stamp: 64 bit extended value of the GT timestamp.
>>+		 */
>>+		u64 gt_stamp;
>>+
>>+		/**
>>+		 * @ping_delay: Period for polling the GT timestamp for
>>+		 * overflow.
>>+		 */
>>+		unsigned long ping_delay;
>>+
>>+		/**
>>+		 * @work: Periodic work to adjust GT timestamp, engine and
>>+		 * context usage for overflows.
>>+		 */
>>+		struct delayed_work work;
>>+	} timestamp;
>>  };
>>  static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>index 2c6ea64af7ec..ca9ab53999d5 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>@@ -26,6 +26,8 @@
>>   *      | guc_policies                          |
>>   *      +---------------------------------------+
>>   *      | guc_gt_system_info                    |
>>+ *      +---------------------------------------+
>>+ *      | guc_engine_usage                      |
>>   *      +---------------------------------------+ <== static
>>   *      | guc_mmio_reg[countA] (engine 0.0)     |
>>   *      | guc_mmio_reg[countB] (engine 0.1)     |
>>@@ -47,6 +49,7 @@ struct __guc_ads_blob {
>>  	struct guc_ads ads;
>>  	struct guc_policies policies;
>>  	struct guc_gt_system_info system_info;
>>+	struct guc_engine_usage engine_usage;
>>  	/* From here on, location is dynamic! Refer to above diagram. */
>>  	struct guc_mmio_reg regset[0];
>>  } __packed;
>>@@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
>>  	guc_ads_private_data_reset(guc);
>>  }
>>+
>>+u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
>>+{
>>+	struct __guc_ads_blob *blob = guc->ads_blob;
>>+	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
>>+	u32 offset = base + ptr_offset(blob, engine_usage);
>>+
>>+	return offset;
>>+}
>>+
>>+struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
>>+{
>>+	struct intel_guc *guc = &engine->gt->uc.guc;
>>+	struct __guc_ads_blob *blob = guc->ads_blob;
>>+	u8 guc_class = engine_class_to_guc_class(engine->class);
>>+
>>+	return &blob->engine_usage.engines[guc_class][engine->instance];
>>+}
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>>index 3d85051d57e4..e74c110facff 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>>@@ -6,8 +6,11 @@
>>  #ifndef _INTEL_GUC_ADS_H_
>>  #define _INTEL_GUC_ADS_H_
>>+#include <linux/types.h>
>>+
>>  struct intel_guc;
>>  struct drm_printer;
>>+struct intel_engine_cs;
>>  int intel_guc_ads_create(struct intel_guc *guc);
>>  void intel_guc_ads_destroy(struct intel_guc *guc);
>>@@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
>>  void intel_guc_ads_reset(struct intel_guc *guc);
>>  void intel_guc_ads_print_policy_info(struct intel_guc *guc,
>>  				     struct drm_printer *p);
>>+struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
>>+u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
>>  #endif
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>index fa4be13c8854..7c9c081670fc 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>@@ -294,6 +294,19 @@ struct guc_ads {
>>  	u32 reserved[15];
>>  } __packed;
>>+/* Engine usage stats */
>>+struct guc_engine_usage_record {
>>+	u32 current_context_index;
>>+	u32 last_switch_in_stamp;
>>+	u32 reserved0;
>>+	u32 total_runtime;
>>+	u32 reserved1[4];
>>+} __packed;
>>+
>>+struct guc_engine_usage {
>>+	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
>>+} __packed;
>>+
>>  /* GuC logging structures */
>>  enum guc_log_buffer_type {
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>index ba0de35f6323..3f7d0f2ac9da 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>@@ -12,6 +12,7 @@
>>  #include "gt/intel_engine_pm.h"
>>  #include "gt/intel_engine_heartbeat.h"
>>  #include "gt/intel_gt.h"
>>+#include "gt/intel_gt_clock_utils.h"
>>  #include "gt/intel_gt_irq.h"
>>  #include "gt/intel_gt_pm.h"
>>  #include "gt/intel_gt_requests.h"
>>@@ -20,6 +21,7 @@
>>  #include "gt/intel_mocs.h"
>>  #include "gt/intel_ring.h"
>>+#include "intel_guc_ads.h"
>>  #include "intel_guc_submission.h"
>>  #include "i915_drv.h"
>>@@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>>  static void disable_submission(struct intel_guc *guc)
>>  {
>>  	struct i915_sched_engine * const sched_engine = guc->sched_engine;
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+	struct intel_engine_cs *engine;
>>+	enum intel_engine_id id;
>>+	unsigned long flags;
>>  	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>>  		GEM_BUG_ON(!guc->ct.enabled);
>>  		__tasklet_disable_sync_once(&sched_engine->tasklet);
>>  		sched_engine->tasklet.callback = NULL;
>>  	}
>>+
>>+	cancel_delayed_work(&guc->timestamp.work);
>
>I am not sure when disable_submission gets called so a question - 
>could it be important to call cancel_delayed_work_sync here to ensure 
>if the worker was running it had exited before proceeding?

disable_submission is called in the reset_prepare path for uc resets. I 
see this happening only with busy-hang test which does a global gt 
reset. The counterpart for this is the guc_init_engine_stats which is 
called post reset in the path to initialize GuC.

I tried cancel_delayed_work_sync both here and in park. Seems to work 
fine, so will change the calls to _sync versions.

>
>Also, does this interact with the open about resets? Should/could 
>parking helper be called from here?

It is related to reset. Below, I am only updating the engine prev_total 
to 0 since it gets reset on gt reset. I thought that's all we need to 
keep the busyness increasing monotonically. By calling parking helper, 
are you suggesting we should update the other stats too (total, start, 
gt_stamp etc.)?

>
>>+
>>+	spin_lock_irqsave(&guc->timestamp.lock, flags);
>>+
>>+	for_each_engine(engine, gt, id)
>>+		engine->stats.prev_total = 0;
>>+
>>+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>  }
>>  static void enable_submission(struct intel_guc *guc)
>>@@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>>  	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>>  }
>>+/*
>>+ * GuC stores busyness stats for each engine at context in/out boundaries. A
>>+ * context 'in' logs execution start time, 'out' adds in -> out delta to total.
>>+ * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
>>+ * GuC.
>>+ *
>>+ * __i915_pmu_event_read samples engine busyness. When sampling, if context id
>>+ * is valid (!= ~0) and start is non-zero, the engine is considered to be
>>+ * active. For an active engine total busyness = total + (now - start), where
>>+ * 'now' is the time at which the busyness is sampled. For inactive engine,
>>+ * total busyness = total.
>>+ *
>>+ * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
>>+ *
>>+ * The start and total values provided by GuC are 32 bits and wrap around in a
>>+ * few minutes. Since perf pmu provides busyness as 64 bit monotonically
>>+ * increasing ns values, there is a need for this implementation to account for
>>+ * overflows and extend the GuC provided values to 64 bits before returning
>>+ * busyness to the user. In order to do that, a worker runs periodically at
>>+ * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
>>+ * 27 seconds for a gt clock frequency of 19.2 MHz).
>>+ */
>>+
>>+#define WRAP_TIME_CLKS U32_MAX
>>+#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
>>+
>>+static void
>>+__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
>>+{
>>+	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>>+	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
>>+
>>+	if (new_start == lower_32_bits(*prev_start))
>>+		return;
>>+
>>+	if (new_start < gt_stamp_last &&
>>+	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
>>+		gt_stamp_hi++;
>>+
>>+	if (new_start > gt_stamp_last &&
>>+	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
>>+		gt_stamp_hi--;
>>+
>>+	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
>>+}
>>+
>>+static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
>>+{
>>+	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
>>+	struct intel_guc *guc = &engine->gt->uc.guc;
>>+	u32 last_switch = rec->last_switch_in_stamp;
>>+	u32 ctx_id = rec->current_context_index;
>>+	u32 total = rec->total_runtime;
>>+
>>+	lockdep_assert_held(&guc->timestamp.lock);
>>+
>>+	engine->stats.running = ctx_id != ~0U && last_switch;
>>+	if (engine->stats.running)
>>+		__extend_last_switch(guc, &engine->stats.start_gt_clk,
>>+				     last_switch);
>>+
>>+	/*
>>+	 * Instead of adjusting the total for overflow, just add the
>>+	 * difference from previous sample to the stats.total_gt_clks
>>+	 */
>>+	if (total && total != ~0U) {
>>+		engine->stats.total_gt_clks += (u32)(total -
>>+						     engine->stats.prev_total);
>>+		engine->stats.prev_total = total;
>>+	}
>>+}
>>+
>>+static void guc_update_pm_timestamp(struct intel_guc *guc)
>>+{
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+	u32 gt_stamp_now, gt_stamp_hi;
>>+
>>+	lockdep_assert_held(&guc->timestamp.lock);
>>+
>>+	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>>+	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
>>+
>>+	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
>>+		gt_stamp_hi++;
>>+
>>+	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
>>+}
>>+
>>+/*
>>+ * Unlike the execlist mode of submission total and active times are in terms of
>>+ * gt clocks. The *now parameter is retained to return the cpu time at which the
>>+ * busyness was sampled.
>>+ */
>>+static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>>+{
>>+	struct intel_gt *gt = engine->gt;
>>+	struct intel_guc *guc = &gt->uc.guc;
>>+	unsigned long flags;
>>+	u64 total;
>>+
>>+	spin_lock_irqsave(&guc->timestamp.lock, flags);
>>+
>>+	*now = ktime_get();
>>+
>>+	/*
>>+	 * The active busyness depends on start_gt_clk and gt_stamp.
>>+	 * gt_stamp is updated by i915 only when gt is awake and the
>>+	 * start_gt_clk is derived from GuC state. To get a consistent
>>+	 * view of activity, we query the GuC state only if gt is awake.
>>+	 */
>>+	if (intel_gt_pm_get_if_awake(gt)) {
>>+		guc_update_engine_gt_clks(engine);
>>+		guc_update_pm_timestamp(guc);
>>+		intel_gt_pm_put_async(gt);
>>+	}
>>+
>>+	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
>>+	if (engine->stats.running) {
>>+		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
>>+
>>+		total += intel_gt_clock_interval_to_ns(gt, clk);
>>+	}
>>+
>>+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>+
>>+	return ns_to_ktime(total);
>>+}
>>+
>>+static void __update_guc_busyness_stats(struct intel_guc *guc)
>>+{
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+	struct intel_engine_cs *engine;
>>+	enum intel_engine_id id;
>>+	unsigned long flags;
>>+
>>+	spin_lock_irqsave(&guc->timestamp.lock, flags);
>>+
>>+	if (intel_gt_pm_get_if_awake(gt)) {
>>+		guc_update_pm_timestamp(guc);
>>+
>>+		for_each_engine(engine, gt, id)
>>+			guc_update_engine_gt_clks(engine);
>>+
>>+		intel_gt_pm_put_async(gt);
>>+	}
>>+
>>+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>+}
>>+
>>+static void guc_timestamp_ping(struct work_struct *wrk)
>>+{
>>+	struct intel_guc *guc = container_of(wrk, typeof(*guc),
>>+					     timestamp.work.work);
>>+
>>+	__update_guc_busyness_stats(guc);
>
>From ping you may need to ensure you wake up the GPU (not call 
>intel_gt_pm_get_if_awake in update) or I think there is a chance ping 
>gets unlucky and fails to do its job.
>
>Probably get the pm ref here and remove it from 
>__update_guc_busyness_stats, since the other caller (park) guarantees 
>pm ref is still held.

will do

>
>>+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>>+			 guc->timestamp.ping_delay);
>>+}
>>+
>>+static int guc_action_enable_usage_stats(struct intel_guc *guc)
>>+{
>>+	u32 offset = intel_guc_engine_usage_offset(guc);
>>+	u32 action[] = {
>>+		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
>>+		offset,
>>+		0,
>>+	};
>>+
>>+	return intel_guc_send(guc, action, ARRAY_SIZE(action));
>>+}
>>+
>>+static void guc_init_engine_stats(struct intel_guc *guc)
>>+{
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+	intel_wakeref_t wakeref;
>>+
>>+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>>+			 guc->timestamp.ping_delay);
>
>Not sure how this slots in with unpark. It will probably be called two 
>times but it also probably does not matter? If you can figure it out 
>perhaps you can remove this call from here. Or maybe there is a 
>separate path where disable-enable can be called without the 
>park-unpark transition. In which case you could call the unpark helper 
>here. Not sure really.

- disable_submission pairs with guc_init_engine_stats for the gt reset 
  path.
- park/unpark just follow the gt_park/gt_unpark paths.

I haven't checked if reset eventually results in park/unpark or if they 
are separate paths though. In the reset path, there are a bunch of 
i915_requests going on, so difficult to say if reset caused the 
gt_park/gt_unpark or was it the requests.

The cases where mod_delayed_work is called twice are:

1) module load
2) i915_gem_resume (based on rc6-suspend test)

In both cases, unpark is followed by guc_init_engine_stats. Looking a 
bit at what is returned from the mod_delayed_work, I see that it just 
modifies the time if the work is already queued/pending, so I am 
thinking we should be okay.

I don't see cancel getting called twice without a mod_delayed_work in 
between.

Thanks,
Umesh

>
>Regards,
>
>Tvrtko
>
>>+
>>+	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
>>+		int ret = guc_action_enable_usage_stats(guc);
>>+
>>+		if (ret)
>>+			drm_err(&gt->i915->drm,
>>+				"Failed to enable usage stats: %d!\n", ret);
>>+	}
>>+}
>>+
>>+void intel_guc_busyness_park(struct intel_gt *gt)
>>+{
>>+	struct intel_guc *guc = &gt->uc.guc;
>>+
>>+	cancel_delayed_work(&guc->timestamp.work);
>>+	__update_guc_busyness_stats(guc);
>>+}
>>+
>>+void intel_guc_busyness_unpark(struct intel_gt *gt)
>>+{
>>+	struct intel_guc *guc = &gt->uc.guc;
>>+
>>+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>>+			 guc->timestamp.ping_delay);
>>+}
>>+
>>  /*
>>   * Set up the memory resources to be shared with the GuC (via the GGTT)
>>   * at firmware loading time.
>>   */
>>  int intel_guc_submission_init(struct intel_guc *guc)
>>  {
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>  	int ret;
>>  	if (guc->lrc_desc_pool)
>>@@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>  	INIT_LIST_HEAD(&guc->guc_id_list);
>>  	ida_init(&guc->guc_ids);
>>+	spin_lock_init(&guc->timestamp.lock);
>>+	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
>>+	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
>>+
>>  	return 0;
>>  }
>>@@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
>>  		engine->emit_flush = gen12_emit_flush_xcs;
>>  	}
>>  	engine->set_default_submission = guc_set_default_submission;
>>+	engine->busyness = guc_engine_busyness;
>>+	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
>>  	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
>>  	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
>>@@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
>>  void intel_guc_submission_enable(struct intel_guc *guc)
>>  {
>>  	guc_init_lrc_mapping(guc);
>>+	guc_init_engine_stats(guc);
>>  }
>>  void intel_guc_submission_disable(struct intel_guc *guc)
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>>index c7ef44fa0c36..5a95a9f0a8e3 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>>@@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>>  void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>>  				    struct i915_request *hung_rq,
>>  				    struct drm_printer *m);
>>+void intel_guc_busyness_park(struct intel_gt *gt);
>>+void intel_guc_busyness_unpark(struct intel_gt *gt);
>>  bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
>>diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>>index a897f4abea0c..9aee08425382 100644
>>--- a/drivers/gpu/drm/i915/i915_reg.h
>>+++ b/drivers/gpu/drm/i915/i915_reg.h
>>@@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>>  #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
>>  #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
>>+#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
>>+
>>  /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
>>  #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
>>  #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
  2021-10-06 20:45     ` [Intel-gfx] " Umesh Nerlige Ramappa
@ 2021-10-07  8:17       ` Tvrtko Ursulin
  -1 siblings, 0 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2021-10-07  8:17 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa
  Cc: intel-gfx, dri-devel, john.c.harrison, daniel.vetter, Matthew Brost


On 06/10/2021 21:45, Umesh Nerlige Ramappa wrote:
> On Wed, Oct 06, 2021 at 10:11:58AM +0100, Tvrtko Ursulin wrote:

[snip]

>>> @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>>>  static void disable_submission(struct intel_guc *guc)
>>>  {
>>>      struct i915_sched_engine * const sched_engine = guc->sched_engine;
>>> +    struct intel_gt *gt = guc_to_gt(guc);
>>> +    struct intel_engine_cs *engine;
>>> +    enum intel_engine_id id;
>>> +    unsigned long flags;
>>>      if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>>>          GEM_BUG_ON(!guc->ct.enabled);
>>>          __tasklet_disable_sync_once(&sched_engine->tasklet);
>>>          sched_engine->tasklet.callback = NULL;
>>>      }
>>> +
>>> +    cancel_delayed_work(&guc->timestamp.work);
>>
>> I am not sure when disable_submission gets called so a question - 
>> could it be important to call cancel_delayed_work_sync here to ensure 
>> if the worker was running it had exited before proceeding?
> 
> disable_submission is called in the reset_prepare path for uc resets. I 
> see this happening only with busy-hang test which does a global gt 
> reset. The counterpart for this is the guc_init_engine_stats which is 
> called post reset in the path to initialize GuC.
> 
> I tried cancel_delayed_work_sync both here and in park. Seems to work 
> fine, so will change the calls to _sync versions.

 From park is not allowed to sleep so can't do sync from there. It might 
have been my question which put you on a wrong path, sorry. Now I think 
question remains what happens if the ping worker happens to be sampling 
GuC state as GuC is being reset? Do you need some sort of a lock to 
protect that, or make sure worker skips if reset in progress?

>>
>> Also, does this interact with the open about resets? Should/could 
>> parking helper be called from here?
> 
> It is related to reset. Below, I am only updating the engine prev_total 
> to 0 since it gets reset on gt reset. I thought that's all we need to 
> keep the busyness increasing monotonically. By calling parking helper, 
> are you suggesting we should update the other stats too (total, start, 
> gt_stamp etc.)?

Don't know, was just asking.

Looking at it now again, resetting prev_total looks correct to me if it 
tracks rec->total_runtime which is also reset by GuC. 
Engine->stats.total_gt_clks is then purely software managed state which 
you only keep adding to. Yes looks fine to me.

>>
>>> +    mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>>> +             guc->timestamp.ping_delay);
>>> +}
>>> +
>>> +static int guc_action_enable_usage_stats(struct intel_guc *guc)
>>> +{
>>> +    u32 offset = intel_guc_engine_usage_offset(guc);
>>> +    u32 action[] = {
>>> +        INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
>>> +        offset,
>>> +        0,
>>> +    };
>>> +
>>> +    return intel_guc_send(guc, action, ARRAY_SIZE(action));
>>> +}
>>> +
>>> +static void guc_init_engine_stats(struct intel_guc *guc)
>>> +{
>>> +    struct intel_gt *gt = guc_to_gt(guc);
>>> +    intel_wakeref_t wakeref;
>>> +
>>> +    mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>>> +             guc->timestamp.ping_delay);
>>
>> Not sure how this slots in with unpark. It will probably be called two 
>> times but it also probably does not matter? If you can figure it out 
>> perhaps you can remove this call from here. Or maybe there is a 
>> separate path where disable-enable can be called without the 
>> park-unpark transition. In which case you could call the unpark helper 
>> here. Not sure really.
> 
> - disable_submission pairs with guc_init_engine_stats for the gt reset 
>   path.
> - park/unpark just follow the gt_park/gt_unpark paths.
> 
> I haven't checked if reset eventually results in park/unpark or if they 
> are separate paths though. In the reset path, there are a bunch of 
> i915_requests going on, so difficult to say if reset caused the 
> gt_park/gt_unpark or was it the requests.
> 
> The cases where mod_delayed_work is called twice are:
> 
> 1) module load
> 2) i915_gem_resume (based on rc6-suspend test)
> 
> In both cases, unpark is followed by guc_init_engine_stats. Looking a 
> bit at what is returned from the mod_delayed_work, I see that it just 
> modifies the time if the work is already queued/pending, so I am 
> thinking we should be okay.
> 
> I don't see cancel getting called twice without a mod_delayed_work in 
> between.

Sounds good.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
@ 2021-10-07  8:17       ` Tvrtko Ursulin
  0 siblings, 0 replies; 24+ messages in thread
From: Tvrtko Ursulin @ 2021-10-07  8:17 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa
  Cc: intel-gfx, dri-devel, john.c.harrison, daniel.vetter, Matthew Brost


On 06/10/2021 21:45, Umesh Nerlige Ramappa wrote:
> On Wed, Oct 06, 2021 at 10:11:58AM +0100, Tvrtko Ursulin wrote:

[snip]

>>> @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>>>  static void disable_submission(struct intel_guc *guc)
>>>  {
>>>      struct i915_sched_engine * const sched_engine = guc->sched_engine;
>>> +    struct intel_gt *gt = guc_to_gt(guc);
>>> +    struct intel_engine_cs *engine;
>>> +    enum intel_engine_id id;
>>> +    unsigned long flags;
>>>      if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>>>          GEM_BUG_ON(!guc->ct.enabled);
>>>          __tasklet_disable_sync_once(&sched_engine->tasklet);
>>>          sched_engine->tasklet.callback = NULL;
>>>      }
>>> +
>>> +    cancel_delayed_work(&guc->timestamp.work);
>>
>> I am not sure when disable_submission gets called so a question - 
>> could it be important to call cancel_delayed_work_sync here to ensure 
>> if the worker was running it had exited before proceeding?
> 
> disable_submission is called in the reset_prepare path for uc resets. I 
> see this happening only with busy-hang test which does a global gt 
> reset. The counterpart for this is the guc_init_engine_stats which is 
> called post reset in the path to initialize GuC.
> 
> I tried cancel_delayed_work_sync both here and in park. Seems to work 
> fine, so will change the calls to _sync versions.

 From park is not allowed to sleep so can't do sync from there. It might 
have been my question which put you on a wrong path, sorry. Now I think 
question remains what happens if the ping worker happens to be sampling 
GuC state as GuC is being reset? Do you need some sort of a lock to 
protect that, or make sure worker skips if reset in progress?

>>
>> Also, does this interact with the open about resets? Should/could 
>> parking helper be called from here?
> 
> It is related to reset. Below, I am only updating the engine prev_total 
> to 0 since it gets reset on gt reset. I thought that's all we need to 
> keep the busyness increasing monotonically. By calling parking helper, 
> are you suggesting we should update the other stats too (total, start, 
> gt_stamp etc.)?

Don't know, was just asking.

Looking at it now again, resetting prev_total looks correct to me if it 
tracks rec->total_runtime which is also reset by GuC. 
Engine->stats.total_gt_clks is then purely software managed state which 
you only keep adding to. Yes looks fine to me.

>>
>>> +    mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>>> +             guc->timestamp.ping_delay);
>>> +}
>>> +
>>> +static int guc_action_enable_usage_stats(struct intel_guc *guc)
>>> +{
>>> +    u32 offset = intel_guc_engine_usage_offset(guc);
>>> +    u32 action[] = {
>>> +        INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
>>> +        offset,
>>> +        0,
>>> +    };
>>> +
>>> +    return intel_guc_send(guc, action, ARRAY_SIZE(action));
>>> +}
>>> +
>>> +static void guc_init_engine_stats(struct intel_guc *guc)
>>> +{
>>> +    struct intel_gt *gt = guc_to_gt(guc);
>>> +    intel_wakeref_t wakeref;
>>> +
>>> +    mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>>> +             guc->timestamp.ping_delay);
>>
>> Not sure how this slots in with unpark. It will probably be called two 
>> times but it also probably does not matter? If you can figure it out 
>> perhaps you can remove this call from here. Or maybe there is a 
>> separate path where disable-enable can be called without the 
>> park-unpark transition. In which case you could call the unpark helper 
>> here. Not sure really.
> 
> - disable_submission pairs with guc_init_engine_stats for the gt reset 
>   path.
> - park/unpark just follow the gt_park/gt_unpark paths.
> 
> I haven't checked if reset eventually results in park/unpark or if they 
> are separate paths though. In the reset path, there are a bunch of 
> i915_requests going on, so difficult to say if reset caused the 
> gt_park/gt_unpark or was it the requests.
> 
> The cases where mod_delayed_work is called twice are:
> 
> 1) module load
> 2) i915_gem_resume (based on rc6-suspend test)
> 
> In both cases, unpark is followed by guc_init_engine_stats. Looking a 
> bit at what is returned from the mod_delayed_work, I see that it just 
> modifies the time if the work is already queued/pending, so I am 
> thinking we should be okay.
> 
> I don't see cancel getting called twice without a mod_delayed_work in 
> between.

Sounds good.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
  2021-10-07  8:17       ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-10-07 15:42         ` Umesh Nerlige Ramappa
  -1 siblings, 0 replies; 24+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-10-07 15:42 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: intel-gfx, dri-devel, john.c.harrison, daniel.vetter, Matthew Brost

On Thu, Oct 07, 2021 at 09:17:34AM +0100, Tvrtko Ursulin wrote:
>
>On 06/10/2021 21:45, Umesh Nerlige Ramappa wrote:
>>On Wed, Oct 06, 2021 at 10:11:58AM +0100, Tvrtko Ursulin wrote:
>
>[snip]
>
>>>>@@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>>>> static void disable_submission(struct intel_guc *guc)
>>>> {
>>>>     struct i915_sched_engine * const sched_engine = guc->sched_engine;
>>>>+    struct intel_gt *gt = guc_to_gt(guc);
>>>>+    struct intel_engine_cs *engine;
>>>>+    enum intel_engine_id id;
>>>>+    unsigned long flags;
>>>>     if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>>>>         GEM_BUG_ON(!guc->ct.enabled);
>>>>         __tasklet_disable_sync_once(&sched_engine->tasklet);
>>>>         sched_engine->tasklet.callback = NULL;
>>>>     }
>>>>+
>>>>+    cancel_delayed_work(&guc->timestamp.work);
>>>
>>>I am not sure when disable_submission gets called so a question - 
>>>could it be important to call cancel_delayed_work_sync here to 
>>>ensure if the worker was running it had exited before proceeding?
>>
>>disable_submission is called in the reset_prepare path for uc 
>>resets. I see this happening only with busy-hang test which does a 
>>global gt reset. The counterpart for this is the 
>>guc_init_engine_stats which is called post reset in the path to 
>>initialize GuC.
>>
>>I tried cancel_delayed_work_sync both here and in park. Seems to 
>>work fine, so will change the calls to _sync versions.
>
>From park is not allowed to sleep so can't do sync from there. It 
>might have been my question which put you on a wrong path, sorry. Now 
>I think question remains what happens if the ping worker happens to be 
>sampling GuC state as GuC is being reset? Do you need some sort of a 
>lock to protect that, or make sure worker skips if reset in progress?
>

If ping ran after the actual gt reset, we should be okay. If it ran 
after we reset prev_total and before gt reset, then we have bad 
busyness. At the same time, skipping ping risks timestamp overflow. I am 
thinking skip ping, but update all stats in the reset_prepare path.  
reset_prepare is running with pm runtime.

On a different note, during reset, we need to store now-start into the 
total_gt_clks also because we may lose that information in the next pmu 
query or ping (post reset). Maybe I will store active_clks instead of 
running in the stats to do that.

Thanks,
Umesh


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
@ 2021-10-07 15:42         ` Umesh Nerlige Ramappa
  0 siblings, 0 replies; 24+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-10-07 15:42 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: intel-gfx, dri-devel, john.c.harrison, daniel.vetter, Matthew Brost

On Thu, Oct 07, 2021 at 09:17:34AM +0100, Tvrtko Ursulin wrote:
>
>On 06/10/2021 21:45, Umesh Nerlige Ramappa wrote:
>>On Wed, Oct 06, 2021 at 10:11:58AM +0100, Tvrtko Ursulin wrote:
>
>[snip]
>
>>>>@@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>>>> static void disable_submission(struct intel_guc *guc)
>>>> {
>>>>     struct i915_sched_engine * const sched_engine = guc->sched_engine;
>>>>+    struct intel_gt *gt = guc_to_gt(guc);
>>>>+    struct intel_engine_cs *engine;
>>>>+    enum intel_engine_id id;
>>>>+    unsigned long flags;
>>>>     if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>>>>         GEM_BUG_ON(!guc->ct.enabled);
>>>>         __tasklet_disable_sync_once(&sched_engine->tasklet);
>>>>         sched_engine->tasklet.callback = NULL;
>>>>     }
>>>>+
>>>>+    cancel_delayed_work(&guc->timestamp.work);
>>>
>>>I am not sure when disable_submission gets called so a question - 
>>>could it be important to call cancel_delayed_work_sync here to 
>>>ensure if the worker was running it had exited before proceeding?
>>
>>disable_submission is called in the reset_prepare path for uc 
>>resets. I see this happening only with busy-hang test which does a 
>>global gt reset. The counterpart for this is the 
>>guc_init_engine_stats which is called post reset in the path to 
>>initialize GuC.
>>
>>I tried cancel_delayed_work_sync both here and in park. Seems to 
>>work fine, so will change the calls to _sync versions.
>
>From park is not allowed to sleep so can't do sync from there. It 
>might have been my question which put you on a wrong path, sorry. Now 
>I think question remains what happens if the ping worker happens to be 
>sampling GuC state as GuC is being reset? Do you need some sort of a 
>lock to protect that, or make sure worker skips if reset in progress?
>

If ping ran after the actual gt reset, we should be okay. If it ran 
after we reset prev_total and before gt reset, then we have bad 
busyness. At the same time, skipping ping risks timestamp overflow. I am 
thinking skip ping, but update all stats in the reset_prepare path.  
reset_prepare is running with pm runtime.

On a different note, during reset, we need to store now-start into the 
total_gt_clks also because we may lose that information in the next pmu 
query or ping (post reset). Maybe I will store active_clks instead of 
running in the stats to do that.

Thanks,
Umesh


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
  2021-10-05 23:14   ` [Intel-gfx] " Matthew Brost
@ 2021-10-07 23:00     ` Umesh Nerlige Ramappa
  -1 siblings, 0 replies; 24+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-10-07 23:00 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-gfx, dri-devel, john.c.harrison, Tvrtko Ursulin, daniel.vetter

On Tue, Oct 05, 2021 at 04:14:23PM -0700, Matthew Brost wrote:
>On Tue, Oct 05, 2021 at 10:47:11AM -0700, Umesh Nerlige Ramappa wrote:
>> With GuC handling scheduling, i915 is not aware of the time that a
>> context is scheduled in and out of the engine. Since i915 pmu relies on
>> this info to provide engine busyness to the user, GuC shares this info
>> with i915 for all engines using shared memory. For each engine, this
>> info contains:
>>
>> - total busyness: total time that the context was running (total)
>> - id: id of the running context (id)
>> - start timestamp: timestamp when the context started running (start)
>>
>> At the time (now) of sampling the engine busyness, if the id is valid
>> (!= ~0), and start is non-zero, then the context is considered to be
>> active and the engine busyness is calculated using the below equation
>>
>> 	engine busyness = total + (now - start)
>>
>> All times are obtained from the gt clock base. For inactive contexts,
>> engine busyness is just equal to the total.
>>
>> The start and total values provided by GuC are 32 bits and wrap around
>> in a few minutes. Since perf pmu provides busyness as 64 bit
>> monotonically increasing values, there is a need for this implementation
>> to account for overflows and extend the time to 64 bits before returning
>> busyness to the user. In order to do that, a worker runs periodically at
>> frequency = 1/8th the time it takes for the timestamp to wrap. As an
>> example, that would be once in 27 seconds for a gt clock frequency of
>> 19.2 MHz.
>>
>> Opens and wip that are targeted for later patches:
>>
>> 1) On global gt reset the total busyness of engines resets and i915
>>    needs to fix that so that user sees monotonically increasing
>>    busyness.
>> 2) In runtime suspend mode, the worker may not need to be run. We could
>>    stop the worker on suspend and rerun it on resume provided that the
>>    guc pm timestamp does not tick during suspend.
>>
>> Note:
>> There might be an overaccounting of busyness due to the fact that GuC
>> may be updating the total and start values while kmd is reading them.
>> (i.e kmd may read the updated total and the stale start). In such a
>> case, user may see higher busyness value followed by smaller ones which
>> would eventually catch up to the higher value.
>>
>> v2: (Tvrtko)
>> - Include details in commit message
>> - Move intel engine busyness function into execlist code
>> - Use union inside engine->stats
>> - Use natural type for ping delay jiffies
>> - Drop active_work condition checks
>> - Use for_each_engine if iterating all engines
>> - Drop seq locking, use spinlock at guc level to update engine stats
>> - Document worker specific details
>>
>> v3: (Tvrtko/Umesh)
>> - Demarcate guc and execlist stat objects with comments
>> - Document known over-accounting issue in commit
>> - Provide a consistent view of guc state
>> - Add hooks to gt park/unpark for guc busyness
>> - Stop/start worker in gt park/unpark path
>> - Drop inline
>> - Move spinlock and worker inits to guc initialization
>> - Drop helpers that are called only once
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> ---
>>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
>>  drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
>>  .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
>>  drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
>>  .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
>>  drivers/gpu/drm/i915/i915_reg.h               |   2 +
>>  12 files changed, 398 insertions(+), 49 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index 2ae57e4656a3..6fcc70a313d9 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>>  	intel_engine_print_breadcrumbs(engine, m);
>>  }
>>
>> -static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>> -					    ktime_t *now)
>> -{
>> -	ktime_t total = engine->stats.total;
>> -
>> -	/*
>> -	 * If the engine is executing something at the moment
>> -	 * add it to the total.
>> -	 */
>> -	*now = ktime_get();
>> -	if (READ_ONCE(engine->stats.active))
>> -		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>> -
>> -	return total;
>> -}
>> -
>>  /**
>>   * intel_engine_get_busy_time() - Return current accumulated engine busyness
>>   * @engine: engine to report on
>> @@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>   */
>>  ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>>  {
>> -	unsigned int seq;
>> -	ktime_t total;
>> -
>> -	do {
>> -		seq = read_seqcount_begin(&engine->stats.lock);
>> -		total = __intel_engine_get_busy_time(engine, now);
>> -	} while (read_seqcount_retry(&engine->stats.lock, seq));
>> -
>> -	return total;
>> +	return engine->busyness(engine, now);
>>  }
>>
>>  struct intel_context *
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>> index 5ae1207c363b..8e1b9c38a6fc 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>> @@ -432,6 +432,12 @@ struct intel_engine_cs {
>>  	void		(*add_active_request)(struct i915_request *rq);
>>  	void		(*remove_active_request)(struct i915_request *rq);
>>
>> +	/*
>> +	 * Get engine busyness and the time at which the busyness was sampled.
>> +	 */
>> +	ktime_t		(*busyness)(struct intel_engine_cs *engine,
>> +				    ktime_t *now);
>> +
>>  	struct intel_engine_execlists execlists;
>>
>>  	/*
>> @@ -481,30 +487,66 @@ struct intel_engine_cs {
>>  	u32 (*get_cmd_length_mask)(u32 cmd_header);
>>
>>  	struct {
>> -		/**
>> -		 * @active: Number of contexts currently scheduled in.
>> -		 */
>> -		unsigned int active;
>> -
>> -		/**
>> -		 * @lock: Lock protecting the below fields.
>> -		 */
>> -		seqcount_t lock;
>> -
>> -		/**
>> -		 * @total: Total time this engine was busy.
>> -		 *
>> -		 * Accumulated time not counting the most recent block in cases
>> -		 * where engine is currently busy (active > 0).
>> -		 */
>> -		ktime_t total;
>> -
>> -		/**
>> -		 * @start: Timestamp of the last idle to active transition.
>> -		 *
>> -		 * Idle is defined as active == 0, active is active > 0.
>> -		 */
>> -		ktime_t start;
>> +		union {
>> +			/* Fields used by the execlists backend. */
>> +			struct {
>> +				/**
>> +				 * @active: Number of contexts currently
>> +				 * scheduled in.
>> +				 */
>> +				unsigned int active;
>> +
>> +				/**
>> +				 * @lock: Lock protecting the below fields.
>> +				 */
>> +				seqcount_t lock;
>> +
>> +				/**
>> +				 * @total: Total time this engine was busy.
>> +				 *
>> +				 * Accumulated time not counting the most recent
>> +				 * block in cases where engine is currently busy
>> +				 * (active > 0).
>> +				 */
>> +				ktime_t total;
>> +
>> +				/**
>> +				 * @start: Timestamp of the last idle to active
>> +				 * transition.
>> +				 *
>> +				 * Idle is defined as active == 0, active is
>> +				 * active > 0.
>> +				 */
>> +				ktime_t start;
>> +			};
>
>Not anonymous? e.g.
>
>struct {
>	...
>} execlists;
>struct {
>	...
>} guc;
>
>IMO this is better as this is self documenting and if you touch an
>backend specific field in a non-backend specific file it pops out as
>incorrect.

Posted a new revision with the above comment addressed. Other comments 
(vfunc), I will add them in the future series.

Thanks,
Umesh

>
>> +
>> +			/* Fields used by the GuC backend. */
>> +			struct {
>> +				/**
>> +				 * @running: Active state of the engine when
>> +				 * busyness was last sampled.
>> +				 */
>> +				bool running;
>> +
>> +				/**
>> +				 * @prev_total: Previous value of total runtime
>> +				 * clock cycles.
>> +				 */
>> +				u32 prev_total;
>> +
>> +				/**
>> +				 * @total_gt_clks: Total gt clock cycles this
>> +				 * engine was busy.
>> +				 */
>> +				u64 total_gt_clks;
>> +
>> +				/**
>> +				 * @start_gt_clk: GT clock time of last idle to
>> +				 * active transition.
>> +				 */
>> +				u64 start_gt_clk;
>> +			};
>> +		};
>>
>>  		/**
>>  		 * @rps: Utilisation at last RPS sampling.
>> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> index 7147fe80919e..5c9b695e906c 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> @@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
>>  	lrc_fini_wa_ctx(engine);
>>  }
>>
>> +static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
>> +					   ktime_t *now)
>> +{
>> +	ktime_t total = engine->stats.total;
>> +
>> +	/*
>> +	 * If the engine is executing something at the moment
>> +	 * add it to the total.
>> +	 */
>> +	*now = ktime_get();
>> +	if (READ_ONCE(engine->stats.active))
>> +		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>> +
>> +	return total;
>> +}
>> +
>> +static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
>> +					 ktime_t *now)
>> +{
>> +	unsigned int seq;
>> +	ktime_t total;
>> +
>> +	do {
>> +		seq = read_seqcount_begin(&engine->stats.lock);
>> +		total = __execlists_engine_busyness(engine, now);
>> +	} while (read_seqcount_retry(&engine->stats.lock, seq));
>> +
>> +	return total;
>> +}
>> +
>>  static void
>>  logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>  {
>> @@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>  		engine->emit_bb_start = gen8_emit_bb_start;
>>  	else
>>  		engine->emit_bb_start = gen8_emit_bb_start_noarb;
>> +
>> +	engine->busyness = execlists_engine_busyness;
>>  }
>>
>>  static void logical_ring_default_irqs(struct intel_engine_cs *engine)
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> index 524eaf678790..b4a8594bc46c 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> @@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
>>  	intel_rc6_unpark(&gt->rc6);
>>  	intel_rps_unpark(&gt->rps);
>>  	i915_pmu_gt_unparked(i915);
>> +	intel_guc_busyness_unpark(gt);
>
>I personally don't mind this but in the spirit of correct layering, this
>likely should be generic wrapper inline func which calls a vfunc if
>present (e.g. set the vfunc for backend, don't set for execlists).
>
>>
>>  	intel_gt_unpark_requests(gt);
>>  	runtime_begin(gt);
>> @@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
>>  	runtime_end(gt);
>>  	intel_gt_park_requests(gt);
>>
>> +	intel_guc_busyness_park(gt);
>
>Same here.
>
>>  	i915_vma_parked(gt);
>>  	i915_pmu_gt_parked(i915);
>>  	intel_rps_park(&gt->rps);
>> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> index 8ff582222aff..ff1311d4beff 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> @@ -143,6 +143,7 @@ enum intel_guc_action {
>>  	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>>  	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>>  	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>> +	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>>  	INTEL_GUC_ACTION_LIMIT
>>  };
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> index 5dd174babf7a..22c30dbdf63a 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> @@ -104,6 +104,8 @@ struct intel_guc {
>>  	u32 ads_regset_size;
>>  	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>>  	u32 ads_golden_ctxt_size;
>> +	/** @ads_engine_usage_size: size of engine usage in the ADS */
>> +	u32 ads_engine_usage_size;
>>
>>  	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
>>  	struct i915_vma *lrc_desc_pool;
>> @@ -138,6 +140,30 @@ struct intel_guc {
>>
>>  	/** @send_mutex: used to serialize the intel_guc_send actions */
>>  	struct mutex send_mutex;
>> +
>> +	struct {
>> +		/**
>> +		 * @lock: Lock protecting the below fields and the engine stats.
>> +		 */
>> +		spinlock_t lock;
>> +
>
>Again I really don't mind but I'm told not to add more spin locks than
>needed. This really should be protected by a generic GuC submission spin
>lock. e.g. Build on this patch and protect all of this by the
>submission_state.lock.
>
>https://patchwork.freedesktop.org/patch/457310/?series=92789&rev=5
>
>Whomevers series gets merged first can include the above patch.
>
>Rest the series looks fine cosmetically to me.
>
>Matt
>
>> +		/**
>> +		 * @gt_stamp: 64 bit extended value of the GT timestamp.
>> +		 */
>> +		u64 gt_stamp;
>> +
>> +		/**
>> +		 * @ping_delay: Period for polling the GT timestamp for
>> +		 * overflow.
>> +		 */
>> +		unsigned long ping_delay;
>> +
>> +		/**
>> +		 * @work: Periodic work to adjust GT timestamp, engine and
>> +		 * context usage for overflows.
>> +		 */
>> +		struct delayed_work work;
>> +	} timestamp;
>>  };
>>
>>  static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>> index 2c6ea64af7ec..ca9ab53999d5 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>> @@ -26,6 +26,8 @@
>>   *      | guc_policies                          |
>>   *      +---------------------------------------+
>>   *      | guc_gt_system_info                    |
>> + *      +---------------------------------------+
>> + *      | guc_engine_usage                      |
>>   *      +---------------------------------------+ <== static
>>   *      | guc_mmio_reg[countA] (engine 0.0)     |
>>   *      | guc_mmio_reg[countB] (engine 0.1)     |
>> @@ -47,6 +49,7 @@ struct __guc_ads_blob {
>>  	struct guc_ads ads;
>>  	struct guc_policies policies;
>>  	struct guc_gt_system_info system_info;
>> +	struct guc_engine_usage engine_usage;
>>  	/* From here on, location is dynamic! Refer to above diagram. */
>>  	struct guc_mmio_reg regset[0];
>>  } __packed;
>> @@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
>>
>>  	guc_ads_private_data_reset(guc);
>>  }
>> +
>> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
>> +{
>> +	struct __guc_ads_blob *blob = guc->ads_blob;
>> +	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
>> +	u32 offset = base + ptr_offset(blob, engine_usage);
>> +
>> +	return offset;
>> +}
>> +
>> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
>> +{
>> +	struct intel_guc *guc = &engine->gt->uc.guc;
>> +	struct __guc_ads_blob *blob = guc->ads_blob;
>> +	u8 guc_class = engine_class_to_guc_class(engine->class);
>> +
>> +	return &blob->engine_usage.engines[guc_class][engine->instance];
>> +}
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>> index 3d85051d57e4..e74c110facff 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>> @@ -6,8 +6,11 @@
>>  #ifndef _INTEL_GUC_ADS_H_
>>  #define _INTEL_GUC_ADS_H_
>>
>> +#include <linux/types.h>
>> +
>>  struct intel_guc;
>>  struct drm_printer;
>> +struct intel_engine_cs;
>>
>>  int intel_guc_ads_create(struct intel_guc *guc);
>>  void intel_guc_ads_destroy(struct intel_guc *guc);
>> @@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
>>  void intel_guc_ads_reset(struct intel_guc *guc);
>>  void intel_guc_ads_print_policy_info(struct intel_guc *guc,
>>  				     struct drm_printer *p);
>> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
>> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
>>
>>  #endif
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> index fa4be13c8854..7c9c081670fc 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> @@ -294,6 +294,19 @@ struct guc_ads {
>>  	u32 reserved[15];
>>  } __packed;
>>
>> +/* Engine usage stats */
>> +struct guc_engine_usage_record {
>> +	u32 current_context_index;
>> +	u32 last_switch_in_stamp;
>> +	u32 reserved0;
>> +	u32 total_runtime;
>> +	u32 reserved1[4];
>> +} __packed;
>> +
>> +struct guc_engine_usage {
>> +	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
>> +} __packed;
>> +
>>  /* GuC logging structures */
>>
>>  enum guc_log_buffer_type {
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index ba0de35f6323..3f7d0f2ac9da 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -12,6 +12,7 @@
>>  #include "gt/intel_engine_pm.h"
>>  #include "gt/intel_engine_heartbeat.h"
>>  #include "gt/intel_gt.h"
>> +#include "gt/intel_gt_clock_utils.h"
>>  #include "gt/intel_gt_irq.h"
>>  #include "gt/intel_gt_pm.h"
>>  #include "gt/intel_gt_requests.h"
>> @@ -20,6 +21,7 @@
>>  #include "gt/intel_mocs.h"
>>  #include "gt/intel_ring.h"
>>
>> +#include "intel_guc_ads.h"
>>  #include "intel_guc_submission.h"
>>
>>  #include "i915_drv.h"
>> @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>>  static void disable_submission(struct intel_guc *guc)
>>  {
>>  	struct i915_sched_engine * const sched_engine = guc->sched_engine;
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	unsigned long flags;
>>
>>  	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>>  		GEM_BUG_ON(!guc->ct.enabled);
>>  		__tasklet_disable_sync_once(&sched_engine->tasklet);
>>  		sched_engine->tasklet.callback = NULL;
>>  	}
>> +
>> +	cancel_delayed_work(&guc->timestamp.work);
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +
>> +	for_each_engine(engine, gt, id)
>> +		engine->stats.prev_total = 0;
>> +
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>  }
>>
>>  static void enable_submission(struct intel_guc *guc)
>> @@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>>  	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>>  }
>>
>> +/*
>> + * GuC stores busyness stats for each engine at context in/out boundaries. A
>> + * context 'in' logs execution start time, 'out' adds in -> out delta to total.
>> + * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
>> + * GuC.
>> + *
>> + * __i915_pmu_event_read samples engine busyness. When sampling, if context id
>> + * is valid (!= ~0) and start is non-zero, the engine is considered to be
>> + * active. For an active engine total busyness = total + (now - start), where
>> + * 'now' is the time at which the busyness is sampled. For inactive engine,
>> + * total busyness = total.
>> + *
>> + * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
>> + *
>> + * The start and total values provided by GuC are 32 bits and wrap around in a
>> + * few minutes. Since perf pmu provides busyness as 64 bit monotonically
>> + * increasing ns values, there is a need for this implementation to account for
>> + * overflows and extend the GuC provided values to 64 bits before returning
>> + * busyness to the user. In order to do that, a worker runs periodically at
>> + * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
>> + * 27 seconds for a gt clock frequency of 19.2 MHz).
>> + */
>> +
>> +#define WRAP_TIME_CLKS U32_MAX
>> +#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
>> +
>> +static void
>> +__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
>> +{
>> +	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>> +	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
>> +
>> +	if (new_start == lower_32_bits(*prev_start))
>> +		return;
>> +
>> +	if (new_start < gt_stamp_last &&
>> +	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
>> +		gt_stamp_hi++;
>> +
>> +	if (new_start > gt_stamp_last &&
>> +	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
>> +		gt_stamp_hi--;
>> +
>> +	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
>> +}
>> +
>> +static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
>> +{
>> +	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
>> +	struct intel_guc *guc = &engine->gt->uc.guc;
>> +	u32 last_switch = rec->last_switch_in_stamp;
>> +	u32 ctx_id = rec->current_context_index;
>> +	u32 total = rec->total_runtime;
>> +
>> +	lockdep_assert_held(&guc->timestamp.lock);
>> +
>> +	engine->stats.running = ctx_id != ~0U && last_switch;
>> +	if (engine->stats.running)
>> +		__extend_last_switch(guc, &engine->stats.start_gt_clk,
>> +				     last_switch);
>> +
>> +	/*
>> +	 * Instead of adjusting the total for overflow, just add the
>> +	 * difference from previous sample to the stats.total_gt_clks
>> +	 */
>> +	if (total && total != ~0U) {
>> +		engine->stats.total_gt_clks += (u32)(total -
>> +						     engine->stats.prev_total);
>> +		engine->stats.prev_total = total;
>> +	}
>> +}
>> +
>> +static void guc_update_pm_timestamp(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	u32 gt_stamp_now, gt_stamp_hi;
>> +
>> +	lockdep_assert_held(&guc->timestamp.lock);
>> +
>> +	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>> +	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
>> +
>> +	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
>> +		gt_stamp_hi++;
>> +
>> +	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
>> +}
>> +
>> +/*
>> + * Unlike the execlist mode of submission total and active times are in terms of
>> + * gt clocks. The *now parameter is retained to return the cpu time at which the
>> + * busyness was sampled.
>> + */
>> +static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>> +{
>> +	struct intel_gt *gt = engine->gt;
>> +	struct intel_guc *guc = &gt->uc.guc;
>> +	unsigned long flags;
>> +	u64 total;
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +
>> +	*now = ktime_get();
>> +
>> +	/*
>> +	 * The active busyness depends on start_gt_clk and gt_stamp.
>> +	 * gt_stamp is updated by i915 only when gt is awake and the
>> +	 * start_gt_clk is derived from GuC state. To get a consistent
>> +	 * view of activity, we query the GuC state only if gt is awake.
>> +	 */
>> +	if (intel_gt_pm_get_if_awake(gt)) {
>> +		guc_update_engine_gt_clks(engine);
>> +		guc_update_pm_timestamp(guc);
>> +		intel_gt_pm_put_async(gt);
>> +	}
>> +
>> +	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
>> +	if (engine->stats.running) {
>> +		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
>> +
>> +		total += intel_gt_clock_interval_to_ns(gt, clk);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> +
>> +	return ns_to_ktime(total);
>> +}
>> +
>> +static void __update_guc_busyness_stats(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +
>> +	if (intel_gt_pm_get_if_awake(gt)) {
>> +		guc_update_pm_timestamp(guc);
>> +
>> +		for_each_engine(engine, gt, id)
>> +			guc_update_engine_gt_clks(engine);
>> +
>> +		intel_gt_pm_put_async(gt);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> +}
>> +
>> +static void guc_timestamp_ping(struct work_struct *wrk)
>> +{
>> +	struct intel_guc *guc = container_of(wrk, typeof(*guc),
>> +					     timestamp.work.work);
>> +
>> +	__update_guc_busyness_stats(guc);
>> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>> +			 guc->timestamp.ping_delay);
>> +}
>> +
>> +static int guc_action_enable_usage_stats(struct intel_guc *guc)
>> +{
>> +	u32 offset = intel_guc_engine_usage_offset(guc);
>> +	u32 action[] = {
>> +		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
>> +		offset,
>> +		0,
>> +	};
>> +
>> +	return intel_guc_send(guc, action, ARRAY_SIZE(action));
>> +}
>> +
>> +static void guc_init_engine_stats(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	intel_wakeref_t wakeref;
>> +
>> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>> +			 guc->timestamp.ping_delay);
>> +
>> +	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
>> +		int ret = guc_action_enable_usage_stats(guc);
>> +
>> +		if (ret)
>> +			drm_err(&gt->i915->drm,
>> +				"Failed to enable usage stats: %d!\n", ret);
>> +	}
>> +}
>> +
>> +void intel_guc_busyness_park(struct intel_gt *gt)
>> +{
>> +	struct intel_guc *guc = &gt->uc.guc;
>> +
>> +	cancel_delayed_work(&guc->timestamp.work);
>> +	__update_guc_busyness_stats(guc);
>> +}
>> +
>> +void intel_guc_busyness_unpark(struct intel_gt *gt)
>> +{
>> +	struct intel_guc *guc = &gt->uc.guc;
>> +
>> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>> +			 guc->timestamp.ping_delay);
>> +}
>> +
>>  /*
>>   * Set up the memory resources to be shared with the GuC (via the GGTT)
>>   * at firmware loading time.
>>   */
>>  int intel_guc_submission_init(struct intel_guc *guc)
>>  {
>> +	struct intel_gt *gt = guc_to_gt(guc);
>>  	int ret;
>>
>>  	if (guc->lrc_desc_pool)
>> @@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>  	INIT_LIST_HEAD(&guc->guc_id_list);
>>  	ida_init(&guc->guc_ids);
>>
>> +	spin_lock_init(&guc->timestamp.lock);
>> +	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
>> +	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
>> +
>>  	return 0;
>>  }
>>
>> @@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
>>  		engine->emit_flush = gen12_emit_flush_xcs;
>>  	}
>>  	engine->set_default_submission = guc_set_default_submission;
>> +	engine->busyness = guc_engine_busyness;
>>
>> +	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
>>  	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
>>  	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
>>
>> @@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
>>  void intel_guc_submission_enable(struct intel_guc *guc)
>>  {
>>  	guc_init_lrc_mapping(guc);
>> +	guc_init_engine_stats(guc);
>>  }
>>
>>  void intel_guc_submission_disable(struct intel_guc *guc)
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>> index c7ef44fa0c36..5a95a9f0a8e3 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>> @@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>>  void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>>  				    struct i915_request *hung_rq,
>>  				    struct drm_printer *m);
>> +void intel_guc_busyness_park(struct intel_gt *gt);
>> +void intel_guc_busyness_unpark(struct intel_gt *gt);
>>
>>  bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
>>
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index a897f4abea0c..9aee08425382 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>>  #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
>>  #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
>>
>> +#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
>> +
>>  /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
>>  #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
>>  #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
>> --
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
@ 2021-10-07 23:00     ` Umesh Nerlige Ramappa
  0 siblings, 0 replies; 24+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-10-07 23:00 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-gfx, dri-devel, john.c.harrison, Tvrtko Ursulin, daniel.vetter

On Tue, Oct 05, 2021 at 04:14:23PM -0700, Matthew Brost wrote:
>On Tue, Oct 05, 2021 at 10:47:11AM -0700, Umesh Nerlige Ramappa wrote:
>> With GuC handling scheduling, i915 is not aware of the time that a
>> context is scheduled in and out of the engine. Since i915 pmu relies on
>> this info to provide engine busyness to the user, GuC shares this info
>> with i915 for all engines using shared memory. For each engine, this
>> info contains:
>>
>> - total busyness: total time that the context was running (total)
>> - id: id of the running context (id)
>> - start timestamp: timestamp when the context started running (start)
>>
>> At the time (now) of sampling the engine busyness, if the id is valid
>> (!= ~0), and start is non-zero, then the context is considered to be
>> active and the engine busyness is calculated using the below equation
>>
>> 	engine busyness = total + (now - start)
>>
>> All times are obtained from the gt clock base. For inactive contexts,
>> engine busyness is just equal to the total.
>>
>> The start and total values provided by GuC are 32 bits and wrap around
>> in a few minutes. Since perf pmu provides busyness as 64 bit
>> monotonically increasing values, there is a need for this implementation
>> to account for overflows and extend the time to 64 bits before returning
>> busyness to the user. In order to do that, a worker runs periodically at
>> frequency = 1/8th the time it takes for the timestamp to wrap. As an
>> example, that would be once in 27 seconds for a gt clock frequency of
>> 19.2 MHz.
>>
>> Opens and wip that are targeted for later patches:
>>
>> 1) On global gt reset the total busyness of engines resets and i915
>>    needs to fix that so that user sees monotonically increasing
>>    busyness.
>> 2) In runtime suspend mode, the worker may not need to be run. We could
>>    stop the worker on suspend and rerun it on resume provided that the
>>    guc pm timestamp does not tick during suspend.
>>
>> Note:
>> There might be an overaccounting of busyness due to the fact that GuC
>> may be updating the total and start values while kmd is reading them.
>> (i.e kmd may read the updated total and the stale start). In such a
>> case, user may see higher busyness value followed by smaller ones which
>> would eventually catch up to the higher value.
>>
>> v2: (Tvrtko)
>> - Include details in commit message
>> - Move intel engine busyness function into execlist code
>> - Use union inside engine->stats
>> - Use natural type for ping delay jiffies
>> - Drop active_work condition checks
>> - Use for_each_engine if iterating all engines
>> - Drop seq locking, use spinlock at guc level to update engine stats
>> - Document worker specific details
>>
>> v3: (Tvrtko/Umesh)
>> - Demarcate guc and execlist stat objects with comments
>> - Document known over-accounting issue in commit
>> - Provide a consistent view of guc state
>> - Add hooks to gt park/unpark for guc busyness
>> - Stop/start worker in gt park/unpark path
>> - Drop inline
>> - Move spinlock and worker inits to guc initialization
>> - Drop helpers that are called only once
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> ---
>>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +-
>>  drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +++++--
>>  .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
>>  drivers/gpu/drm/i915/gt/intel_gt_pm.c         |   2 +
>>  .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 ++
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++++++++++++++++++
>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
>>  drivers/gpu/drm/i915/i915_reg.h               |   2 +
>>  12 files changed, 398 insertions(+), 49 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index 2ae57e4656a3..6fcc70a313d9 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>>  	intel_engine_print_breadcrumbs(engine, m);
>>  }
>>
>> -static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>> -					    ktime_t *now)
>> -{
>> -	ktime_t total = engine->stats.total;
>> -
>> -	/*
>> -	 * If the engine is executing something at the moment
>> -	 * add it to the total.
>> -	 */
>> -	*now = ktime_get();
>> -	if (READ_ONCE(engine->stats.active))
>> -		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>> -
>> -	return total;
>> -}
>> -
>>  /**
>>   * intel_engine_get_busy_time() - Return current accumulated engine busyness
>>   * @engine: engine to report on
>> @@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>   */
>>  ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>>  {
>> -	unsigned int seq;
>> -	ktime_t total;
>> -
>> -	do {
>> -		seq = read_seqcount_begin(&engine->stats.lock);
>> -		total = __intel_engine_get_busy_time(engine, now);
>> -	} while (read_seqcount_retry(&engine->stats.lock, seq));
>> -
>> -	return total;
>> +	return engine->busyness(engine, now);
>>  }
>>
>>  struct intel_context *
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>> index 5ae1207c363b..8e1b9c38a6fc 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>> @@ -432,6 +432,12 @@ struct intel_engine_cs {
>>  	void		(*add_active_request)(struct i915_request *rq);
>>  	void		(*remove_active_request)(struct i915_request *rq);
>>
>> +	/*
>> +	 * Get engine busyness and the time at which the busyness was sampled.
>> +	 */
>> +	ktime_t		(*busyness)(struct intel_engine_cs *engine,
>> +				    ktime_t *now);
>> +
>>  	struct intel_engine_execlists execlists;
>>
>>  	/*
>> @@ -481,30 +487,66 @@ struct intel_engine_cs {
>>  	u32 (*get_cmd_length_mask)(u32 cmd_header);
>>
>>  	struct {
>> -		/**
>> -		 * @active: Number of contexts currently scheduled in.
>> -		 */
>> -		unsigned int active;
>> -
>> -		/**
>> -		 * @lock: Lock protecting the below fields.
>> -		 */
>> -		seqcount_t lock;
>> -
>> -		/**
>> -		 * @total: Total time this engine was busy.
>> -		 *
>> -		 * Accumulated time not counting the most recent block in cases
>> -		 * where engine is currently busy (active > 0).
>> -		 */
>> -		ktime_t total;
>> -
>> -		/**
>> -		 * @start: Timestamp of the last idle to active transition.
>> -		 *
>> -		 * Idle is defined as active == 0, active is active > 0.
>> -		 */
>> -		ktime_t start;
>> +		union {
>> +			/* Fields used by the execlists backend. */
>> +			struct {
>> +				/**
>> +				 * @active: Number of contexts currently
>> +				 * scheduled in.
>> +				 */
>> +				unsigned int active;
>> +
>> +				/**
>> +				 * @lock: Lock protecting the below fields.
>> +				 */
>> +				seqcount_t lock;
>> +
>> +				/**
>> +				 * @total: Total time this engine was busy.
>> +				 *
>> +				 * Accumulated time not counting the most recent
>> +				 * block in cases where engine is currently busy
>> +				 * (active > 0).
>> +				 */
>> +				ktime_t total;
>> +
>> +				/**
>> +				 * @start: Timestamp of the last idle to active
>> +				 * transition.
>> +				 *
>> +				 * Idle is defined as active == 0, active is
>> +				 * active > 0.
>> +				 */
>> +				ktime_t start;
>> +			};
>
>Not anonymous? e.g.
>
>struct {
>	...
>} execlists;
>struct {
>	...
>} guc;
>
>IMO this is better as this is self documenting and if you touch an
>backend specific field in a non-backend specific file it pops out as
>incorrect.

Posted a new revision with the above comment addressed. Other comments 
(vfunc), I will add them in the future series.

Thanks,
Umesh

>
>> +
>> +			/* Fields used by the GuC backend. */
>> +			struct {
>> +				/**
>> +				 * @running: Active state of the engine when
>> +				 * busyness was last sampled.
>> +				 */
>> +				bool running;
>> +
>> +				/**
>> +				 * @prev_total: Previous value of total runtime
>> +				 * clock cycles.
>> +				 */
>> +				u32 prev_total;
>> +
>> +				/**
>> +				 * @total_gt_clks: Total gt clock cycles this
>> +				 * engine was busy.
>> +				 */
>> +				u64 total_gt_clks;
>> +
>> +				/**
>> +				 * @start_gt_clk: GT clock time of last idle to
>> +				 * active transition.
>> +				 */
>> +				u64 start_gt_clk;
>> +			};
>> +		};
>>
>>  		/**
>>  		 * @rps: Utilisation at last RPS sampling.
>> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> index 7147fe80919e..5c9b695e906c 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> @@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
>>  	lrc_fini_wa_ctx(engine);
>>  }
>>
>> +static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
>> +					   ktime_t *now)
>> +{
>> +	ktime_t total = engine->stats.total;
>> +
>> +	/*
>> +	 * If the engine is executing something at the moment
>> +	 * add it to the total.
>> +	 */
>> +	*now = ktime_get();
>> +	if (READ_ONCE(engine->stats.active))
>> +		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>> +
>> +	return total;
>> +}
>> +
>> +static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
>> +					 ktime_t *now)
>> +{
>> +	unsigned int seq;
>> +	ktime_t total;
>> +
>> +	do {
>> +		seq = read_seqcount_begin(&engine->stats.lock);
>> +		total = __execlists_engine_busyness(engine, now);
>> +	} while (read_seqcount_retry(&engine->stats.lock, seq));
>> +
>> +	return total;
>> +}
>> +
>>  static void
>>  logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>  {
>> @@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>  		engine->emit_bb_start = gen8_emit_bb_start;
>>  	else
>>  		engine->emit_bb_start = gen8_emit_bb_start_noarb;
>> +
>> +	engine->busyness = execlists_engine_busyness;
>>  }
>>
>>  static void logical_ring_default_irqs(struct intel_engine_cs *engine)
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> index 524eaf678790..b4a8594bc46c 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> @@ -86,6 +86,7 @@ static int __gt_unpark(struct intel_wakeref *wf)
>>  	intel_rc6_unpark(&gt->rc6);
>>  	intel_rps_unpark(&gt->rps);
>>  	i915_pmu_gt_unparked(i915);
>> +	intel_guc_busyness_unpark(gt);
>
>I personally don't mind this but in the spirit of correct layering, this
>likely should be generic wrapper inline func which calls a vfunc if
>present (e.g. set the vfunc for backend, don't set for execlists).
>
>>
>>  	intel_gt_unpark_requests(gt);
>>  	runtime_begin(gt);
>> @@ -104,6 +105,7 @@ static int __gt_park(struct intel_wakeref *wf)
>>  	runtime_end(gt);
>>  	intel_gt_park_requests(gt);
>>
>> +	intel_guc_busyness_park(gt);
>
>Same here.
>
>>  	i915_vma_parked(gt);
>>  	i915_pmu_gt_parked(i915);
>>  	intel_rps_park(&gt->rps);
>> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> index 8ff582222aff..ff1311d4beff 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>> @@ -143,6 +143,7 @@ enum intel_guc_action {
>>  	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>>  	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>>  	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>> +	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>>  	INTEL_GUC_ACTION_LIMIT
>>  };
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> index 5dd174babf7a..22c30dbdf63a 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>> @@ -104,6 +104,8 @@ struct intel_guc {
>>  	u32 ads_regset_size;
>>  	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>>  	u32 ads_golden_ctxt_size;
>> +	/** @ads_engine_usage_size: size of engine usage in the ADS */
>> +	u32 ads_engine_usage_size;
>>
>>  	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
>>  	struct i915_vma *lrc_desc_pool;
>> @@ -138,6 +140,30 @@ struct intel_guc {
>>
>>  	/** @send_mutex: used to serialize the intel_guc_send actions */
>>  	struct mutex send_mutex;
>> +
>> +	struct {
>> +		/**
>> +		 * @lock: Lock protecting the below fields and the engine stats.
>> +		 */
>> +		spinlock_t lock;
>> +
>
>Again I really don't mind but I'm told not to add more spin locks than
>needed. This really should be protected by a generic GuC submission spin
>lock. e.g. Build on this patch and protect all of this by the
>submission_state.lock.
>
>https://patchwork.freedesktop.org/patch/457310/?series=92789&rev=5
>
>Whomevers series gets merged first can include the above patch.
>
>Rest the series looks fine cosmetically to me.
>
>Matt
>
>> +		/**
>> +		 * @gt_stamp: 64 bit extended value of the GT timestamp.
>> +		 */
>> +		u64 gt_stamp;
>> +
>> +		/**
>> +		 * @ping_delay: Period for polling the GT timestamp for
>> +		 * overflow.
>> +		 */
>> +		unsigned long ping_delay;
>> +
>> +		/**
>> +		 * @work: Periodic work to adjust GT timestamp, engine and
>> +		 * context usage for overflows.
>> +		 */
>> +		struct delayed_work work;
>> +	} timestamp;
>>  };
>>
>>  static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>> index 2c6ea64af7ec..ca9ab53999d5 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>> @@ -26,6 +26,8 @@
>>   *      | guc_policies                          |
>>   *      +---------------------------------------+
>>   *      | guc_gt_system_info                    |
>> + *      +---------------------------------------+
>> + *      | guc_engine_usage                      |
>>   *      +---------------------------------------+ <== static
>>   *      | guc_mmio_reg[countA] (engine 0.0)     |
>>   *      | guc_mmio_reg[countB] (engine 0.1)     |
>> @@ -47,6 +49,7 @@ struct __guc_ads_blob {
>>  	struct guc_ads ads;
>>  	struct guc_policies policies;
>>  	struct guc_gt_system_info system_info;
>> +	struct guc_engine_usage engine_usage;
>>  	/* From here on, location is dynamic! Refer to above diagram. */
>>  	struct guc_mmio_reg regset[0];
>>  } __packed;
>> @@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
>>
>>  	guc_ads_private_data_reset(guc);
>>  }
>> +
>> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
>> +{
>> +	struct __guc_ads_blob *blob = guc->ads_blob;
>> +	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
>> +	u32 offset = base + ptr_offset(blob, engine_usage);
>> +
>> +	return offset;
>> +}
>> +
>> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
>> +{
>> +	struct intel_guc *guc = &engine->gt->uc.guc;
>> +	struct __guc_ads_blob *blob = guc->ads_blob;
>> +	u8 guc_class = engine_class_to_guc_class(engine->class);
>> +
>> +	return &blob->engine_usage.engines[guc_class][engine->instance];
>> +}
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>> index 3d85051d57e4..e74c110facff 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>> @@ -6,8 +6,11 @@
>>  #ifndef _INTEL_GUC_ADS_H_
>>  #define _INTEL_GUC_ADS_H_
>>
>> +#include <linux/types.h>
>> +
>>  struct intel_guc;
>>  struct drm_printer;
>> +struct intel_engine_cs;
>>
>>  int intel_guc_ads_create(struct intel_guc *guc);
>>  void intel_guc_ads_destroy(struct intel_guc *guc);
>> @@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
>>  void intel_guc_ads_reset(struct intel_guc *guc);
>>  void intel_guc_ads_print_policy_info(struct intel_guc *guc,
>>  				     struct drm_printer *p);
>> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
>> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
>>
>>  #endif
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> index fa4be13c8854..7c9c081670fc 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> @@ -294,6 +294,19 @@ struct guc_ads {
>>  	u32 reserved[15];
>>  } __packed;
>>
>> +/* Engine usage stats */
>> +struct guc_engine_usage_record {
>> +	u32 current_context_index;
>> +	u32 last_switch_in_stamp;
>> +	u32 reserved0;
>> +	u32 total_runtime;
>> +	u32 reserved1[4];
>> +} __packed;
>> +
>> +struct guc_engine_usage {
>> +	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
>> +} __packed;
>> +
>>  /* GuC logging structures */
>>
>>  enum guc_log_buffer_type {
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index ba0de35f6323..3f7d0f2ac9da 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -12,6 +12,7 @@
>>  #include "gt/intel_engine_pm.h"
>>  #include "gt/intel_engine_heartbeat.h"
>>  #include "gt/intel_gt.h"
>> +#include "gt/intel_gt_clock_utils.h"
>>  #include "gt/intel_gt_irq.h"
>>  #include "gt/intel_gt_pm.h"
>>  #include "gt/intel_gt_requests.h"
>> @@ -20,6 +21,7 @@
>>  #include "gt/intel_mocs.h"
>>  #include "gt/intel_ring.h"
>>
>> +#include "intel_guc_ads.h"
>>  #include "intel_guc_submission.h"
>>
>>  #include "i915_drv.h"
>> @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>>  static void disable_submission(struct intel_guc *guc)
>>  {
>>  	struct i915_sched_engine * const sched_engine = guc->sched_engine;
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	unsigned long flags;
>>
>>  	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>>  		GEM_BUG_ON(!guc->ct.enabled);
>>  		__tasklet_disable_sync_once(&sched_engine->tasklet);
>>  		sched_engine->tasklet.callback = NULL;
>>  	}
>> +
>> +	cancel_delayed_work(&guc->timestamp.work);
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +
>> +	for_each_engine(engine, gt, id)
>> +		engine->stats.prev_total = 0;
>> +
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>  }
>>
>>  static void enable_submission(struct intel_guc *guc)
>> @@ -1126,12 +1141,217 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
>>  	intel_gt_unpark_heartbeats(guc_to_gt(guc));
>>  }
>>
>> +/*
>> + * GuC stores busyness stats for each engine at context in/out boundaries. A
>> + * context 'in' logs execution start time, 'out' adds in -> out delta to total.
>> + * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
>> + * GuC.
>> + *
>> + * __i915_pmu_event_read samples engine busyness. When sampling, if context id
>> + * is valid (!= ~0) and start is non-zero, the engine is considered to be
>> + * active. For an active engine total busyness = total + (now - start), where
>> + * 'now' is the time at which the busyness is sampled. For inactive engine,
>> + * total busyness = total.
>> + *
>> + * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
>> + *
>> + * The start and total values provided by GuC are 32 bits and wrap around in a
>> + * few minutes. Since perf pmu provides busyness as 64 bit monotonically
>> + * increasing ns values, there is a need for this implementation to account for
>> + * overflows and extend the GuC provided values to 64 bits before returning
>> + * busyness to the user. In order to do that, a worker runs periodically at
>> + * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
>> + * 27 seconds for a gt clock frequency of 19.2 MHz).
>> + */
>> +
>> +#define WRAP_TIME_CLKS U32_MAX
>> +#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
>> +
>> +static void
>> +__extend_last_switch(struct intel_guc *guc, u64 *prev_start, u32 new_start)
>> +{
>> +	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>> +	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
>> +
>> +	if (new_start == lower_32_bits(*prev_start))
>> +		return;
>> +
>> +	if (new_start < gt_stamp_last &&
>> +	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
>> +		gt_stamp_hi++;
>> +
>> +	if (new_start > gt_stamp_last &&
>> +	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS && gt_stamp_hi)
>> +		gt_stamp_hi--;
>> +
>> +	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
>> +}
>> +
>> +static void guc_update_engine_gt_clks(struct intel_engine_cs *engine)
>> +{
>> +	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
>> +	struct intel_guc *guc = &engine->gt->uc.guc;
>> +	u32 last_switch = rec->last_switch_in_stamp;
>> +	u32 ctx_id = rec->current_context_index;
>> +	u32 total = rec->total_runtime;
>> +
>> +	lockdep_assert_held(&guc->timestamp.lock);
>> +
>> +	engine->stats.running = ctx_id != ~0U && last_switch;
>> +	if (engine->stats.running)
>> +		__extend_last_switch(guc, &engine->stats.start_gt_clk,
>> +				     last_switch);
>> +
>> +	/*
>> +	 * Instead of adjusting the total for overflow, just add the
>> +	 * difference from previous sample to the stats.total_gt_clks
>> +	 */
>> +	if (total && total != ~0U) {
>> +		engine->stats.total_gt_clks += (u32)(total -
>> +						     engine->stats.prev_total);
>> +		engine->stats.prev_total = total;
>> +	}
>> +}
>> +
>> +static void guc_update_pm_timestamp(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	u32 gt_stamp_now, gt_stamp_hi;
>> +
>> +	lockdep_assert_held(&guc->timestamp.lock);
>> +
>> +	gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>> +	gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
>> +
>> +	if (gt_stamp_now < lower_32_bits(guc->timestamp.gt_stamp))
>> +		gt_stamp_hi++;
>> +
>> +	guc->timestamp.gt_stamp = ((u64) gt_stamp_hi << 32) | gt_stamp_now;
>> +}
>> +
>> +/*
>> + * Unlike the execlist mode of submission total and active times are in terms of
>> + * gt clocks. The *now parameter is retained to return the cpu time at which the
>> + * busyness was sampled.
>> + */
>> +static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>> +{
>> +	struct intel_gt *gt = engine->gt;
>> +	struct intel_guc *guc = &gt->uc.guc;
>> +	unsigned long flags;
>> +	u64 total;
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +
>> +	*now = ktime_get();
>> +
>> +	/*
>> +	 * The active busyness depends on start_gt_clk and gt_stamp.
>> +	 * gt_stamp is updated by i915 only when gt is awake and the
>> +	 * start_gt_clk is derived from GuC state. To get a consistent
>> +	 * view of activity, we query the GuC state only if gt is awake.
>> +	 */
>> +	if (intel_gt_pm_get_if_awake(gt)) {
>> +		guc_update_engine_gt_clks(engine);
>> +		guc_update_pm_timestamp(guc);
>> +		intel_gt_pm_put_async(gt);
>> +	}
>> +
>> +	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
>> +	if (engine->stats.running) {
>> +		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
>> +
>> +		total += intel_gt_clock_interval_to_ns(gt, clk);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> +
>> +	return ns_to_ktime(total);
>> +}
>> +
>> +static void __update_guc_busyness_stats(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +
>> +	if (intel_gt_pm_get_if_awake(gt)) {
>> +		guc_update_pm_timestamp(guc);
>> +
>> +		for_each_engine(engine, gt, id)
>> +			guc_update_engine_gt_clks(engine);
>> +
>> +		intel_gt_pm_put_async(gt);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> +}
>> +
>> +static void guc_timestamp_ping(struct work_struct *wrk)
>> +{
>> +	struct intel_guc *guc = container_of(wrk, typeof(*guc),
>> +					     timestamp.work.work);
>> +
>> +	__update_guc_busyness_stats(guc);
>> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>> +			 guc->timestamp.ping_delay);
>> +}
>> +
>> +static int guc_action_enable_usage_stats(struct intel_guc *guc)
>> +{
>> +	u32 offset = intel_guc_engine_usage_offset(guc);
>> +	u32 action[] = {
>> +		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
>> +		offset,
>> +		0,
>> +	};
>> +
>> +	return intel_guc_send(guc, action, ARRAY_SIZE(action));
>> +}
>> +
>> +static void guc_init_engine_stats(struct intel_guc *guc)
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	intel_wakeref_t wakeref;
>> +
>> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>> +			 guc->timestamp.ping_delay);
>> +
>> +	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
>> +		int ret = guc_action_enable_usage_stats(guc);
>> +
>> +		if (ret)
>> +			drm_err(&gt->i915->drm,
>> +				"Failed to enable usage stats: %d!\n", ret);
>> +	}
>> +}
>> +
>> +void intel_guc_busyness_park(struct intel_gt *gt)
>> +{
>> +	struct intel_guc *guc = &gt->uc.guc;
>> +
>> +	cancel_delayed_work(&guc->timestamp.work);
>> +	__update_guc_busyness_stats(guc);
>> +}
>> +
>> +void intel_guc_busyness_unpark(struct intel_gt *gt)
>> +{
>> +	struct intel_guc *guc = &gt->uc.guc;
>> +
>> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>> +			 guc->timestamp.ping_delay);
>> +}
>> +
>>  /*
>>   * Set up the memory resources to be shared with the GuC (via the GGTT)
>>   * at firmware loading time.
>>   */
>>  int intel_guc_submission_init(struct intel_guc *guc)
>>  {
>> +	struct intel_gt *gt = guc_to_gt(guc);
>>  	int ret;
>>
>>  	if (guc->lrc_desc_pool)
>> @@ -1152,6 +1372,10 @@ int intel_guc_submission_init(struct intel_guc *guc)
>>  	INIT_LIST_HEAD(&guc->guc_id_list);
>>  	ida_init(&guc->guc_ids);
>>
>> +	spin_lock_init(&guc->timestamp.lock);
>> +	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
>> +	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
>> +
>>  	return 0;
>>  }
>>
>> @@ -2606,7 +2830,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
>>  		engine->emit_flush = gen12_emit_flush_xcs;
>>  	}
>>  	engine->set_default_submission = guc_set_default_submission;
>> +	engine->busyness = guc_engine_busyness;
>>
>> +	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
>>  	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
>>  	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
>>
>> @@ -2705,6 +2931,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
>>  void intel_guc_submission_enable(struct intel_guc *guc)
>>  {
>>  	guc_init_lrc_mapping(guc);
>> +	guc_init_engine_stats(guc);
>>  }
>>
>>  void intel_guc_submission_disable(struct intel_guc *guc)
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>> index c7ef44fa0c36..5a95a9f0a8e3 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
>> @@ -28,6 +28,8 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
>>  void intel_guc_dump_active_requests(struct intel_engine_cs *engine,
>>  				    struct i915_request *hung_rq,
>>  				    struct drm_printer *m);
>> +void intel_guc_busyness_park(struct intel_gt *gt);
>> +void intel_guc_busyness_unpark(struct intel_gt *gt);
>>
>>  bool intel_guc_virtual_engine_has_heartbeat(const struct intel_engine_cs *ve);
>>
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index a897f4abea0c..9aee08425382 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>>  #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
>>  #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
>>
>> +#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
>> +
>>  /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
>>  #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
>>  #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
>> --
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
  2021-10-04 15:21 ` Tvrtko Ursulin
@ 2021-10-05 18:03   ` Umesh Nerlige Ramappa
  0 siblings, 0 replies; 24+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-10-05 18:03 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, john.c.harrison, daniel.vetter

On Mon, Oct 04, 2021 at 04:21:44PM +0100, Tvrtko Ursulin wrote:
>
>On 24/09/2021 23:34, Umesh Nerlige Ramappa wrote:
>>With GuC handling scheduling, i915 is not aware of the time that a
>>context is scheduled in and out of the engine. Since i915 pmu relies on
>>this info to provide engine busyness to the user, GuC shares this info
>>with i915 for all engines using shared memory. For each engine, this
>>info contains:
>>
>>- total busyness: total time that the context was running (total)
>>- id: id of the running context (id)
>>- start timestamp: timestamp when the context started running (start)
>>
>>At the time (now) of sampling the engine busyness, if the id is valid
>>(!= ~0), and start is non-zero, then the context is considered to be
>>active and the engine busyness is calculated using the below equation
>>
>>	engine busyness = total + (now - start)
>>
>>All times are obtained from the gt clock base. For inactive contexts,
>>engine busyness is just equal to the total.
>>
>>The start and total values provided by GuC are 32 bits and wrap around
>>in a few minutes. Since perf pmu provides busyness as 64 bit
>>monotonically increasing values, there is a need for this implementation
>>to account for overflows and extend the time to 64 bits before returning
>>busyness to the user. In order to do that, a worker runs periodically at
>>frequqncy = 1/8th the time it takes for the timestamp to wrap. As an
>
>frequency
>
>>example, that would be once in 27 seconds for a gt clock frequency of
>>19.2 MHz.
>>
>>Opens and wip that are targeted for later patches:
>>
>>1) On global gt reset the total busyness of engines resets and i915
>>    needs to fix that so that user sees monotonically increasing
>>    busyness.
>>2) In runtime suspend mode, the worker may not need to be run. We could
>>    stop the worker on suspend and rerun it on resume provided that the
>>    guc pm timestamp does not tick during suspend.
>
>2) sounds easy since there are park/unpark hooks for pmu already. Will 
>see if I can figure out why you did not just immediately do it.

I posted a new revision now with all these comments for your review.

For (2), something was throwing a warning when I tried this earlier. I 
figured I need to move the initialization of the work and spinlock 
elsewhere.

>
>I would also document in the commit message the known problem of 
>possible over-accounting, just for historical reference.

I added a note. If that's not the issue you are mentioning w.r.t.  
engine busyness, let me know.  

>
>>
>>v2: (Tvrtko)
>>- Include details in commit message
>>- Move intel engine busyness function into execlist code
>>- Use union inside engine->stats
>>- Use natural type for ping delay jiffies
>>- Drop active_work condition checks
>>- Use for_each_engine if iterating all engines
>>- Drop seq locking, use spinlock at guc level to update engine stats
>>- Document worker specific details
>>
>>Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>---
>>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +--
>>  drivers/gpu/drm/i915/gt/intel_engine_types.h  |  82 ++++---
>>  .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
>>  .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>>  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 +++
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 ++
>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 204 ++++++++++++++++++
>>  drivers/gpu/drm/i915/i915_reg.h               |   2 +
>>  10 files changed, 363 insertions(+), 49 deletions(-)
>>
>>diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>index 2ae57e4656a3..6fcc70a313d9 100644
>>--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>@@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>>  	intel_engine_print_breadcrumbs(engine, m);
>>  }
>>-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>-					    ktime_t *now)
>>-{
>>-	ktime_t total = engine->stats.total;
>>-
>>-	/*
>>-	 * If the engine is executing something at the moment
>>-	 * add it to the total.
>>-	 */
>>-	*now = ktime_get();
>>-	if (READ_ONCE(engine->stats.active))
>>-		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>>-
>>-	return total;
>>-}
>>-
>>  /**
>>   * intel_engine_get_busy_time() - Return current accumulated engine busyness
>>   * @engine: engine to report on
>>@@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>   */
>>  ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>>  {
>>-	unsigned int seq;
>>-	ktime_t total;
>>-
>>-	do {
>>-		seq = read_seqcount_begin(&engine->stats.lock);
>>-		total = __intel_engine_get_busy_time(engine, now);
>>-	} while (read_seqcount_retry(&engine->stats.lock, seq));
>>-
>>-	return total;
>>+	return engine->busyness(engine, now);
>>  }
>>  struct intel_context *
>>diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>index 5ae1207c363b..490166b54ed6 100644
>>--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>@@ -432,6 +432,12 @@ struct intel_engine_cs {
>>  	void		(*add_active_request)(struct i915_request *rq);
>>  	void		(*remove_active_request)(struct i915_request *rq);
>>+	/*
>>+	 * Get engine busyness and the time at which the busyness was sampled.
>>+	 */
>>+	ktime_t		(*busyness)(struct intel_engine_cs *engine,
>>+				    ktime_t *now);
>>+
>>  	struct intel_engine_execlists execlists;
>>  	/*
>>@@ -481,30 +487,58 @@ struct intel_engine_cs {
>>  	u32 (*get_cmd_length_mask)(u32 cmd_header);
>>  	struct {
>>-		/**
>>-		 * @active: Number of contexts currently scheduled in.
>>-		 */
>>-		unsigned int active;
>>-
>>-		/**
>>-		 * @lock: Lock protecting the below fields.
>>-		 */
>>-		seqcount_t lock;
>>-
>>-		/**
>>-		 * @total: Total time this engine was busy.
>>-		 *
>>-		 * Accumulated time not counting the most recent block in cases
>>-		 * where engine is currently busy (active > 0).
>>-		 */
>>-		ktime_t total;
>>-
>>-		/**
>>-		 * @start: Timestamp of the last idle to active transition.
>>-		 *
>>-		 * Idle is defined as active == 0, active is active > 0.
>>-		 */
>>-		ktime_t start;
>>+		union {
>
>Maybe put a marker like:
>
>			/* Fields used by the execlists backend. */
>
>>+			struct {
>>+				/**
>>+				 * @active: Number of contexts currently
>>+				 * scheduled in.
>>+				 */
>>+				unsigned int active;
>>+
>>+				/**
>>+				 * @lock: Lock protecting the below fields.
>>+				 */
>>+				seqcount_t lock;
>>+
>>+				/**
>>+				 * @total: Total time this engine was busy.
>>+				 *
>>+				 * Accumulated time not counting the most recent
>>+				 * block in cases where engine is currently busy
>>+				 * (active > 0).
>>+				 */
>>+				ktime_t total;
>>+
>>+				/**
>>+				 * @start: Timestamp of the last idle to active
>>+				 * transition.
>>+				 *
>>+				 * Idle is defined as active == 0, active is
>>+				 * active > 0.
>>+				 */
>>+				ktime_t start;
>>+			};
>>+
>
>			/* Fields used by the GuC backend. */
>
>>+			struct {
>>+				/**
>>+				 * @prev_total: Previous value of total runtime
>>+				 * clock cycles.
>>+				 */
>>+				u32 prev_total;
>>+
>>+				/**
>>+				 * @total_gt_clks: Total gt clock cycles this
>>+				 * engine was busy.
>>+				 */
>>+				u64 total_gt_clks;
>>+
>>+				/**
>>+				 * @start_gt_clk: GT clock time of last idle to
>>+				 * active transition.
>>+				 */
>>+				u64 start_gt_clk;
>>+			};
>>+		};
>>  		/**
>>  		 * @rps: Utilisation at last RPS sampling.
>>diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>index 7147fe80919e..5c9b695e906c 100644
>>--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>@@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
>>  	lrc_fini_wa_ctx(engine);
>>  }
>>+static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
>>+					   ktime_t *now)
>>+{
>>+	ktime_t total = engine->stats.total;
>>+
>>+	/*
>>+	 * If the engine is executing something at the moment
>>+	 * add it to the total.
>>+	 */
>>+	*now = ktime_get();
>>+	if (READ_ONCE(engine->stats.active))
>>+		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
>>+
>>+	return total;
>>+}
>>+
>>+static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
>>+					 ktime_t *now)
>>+{
>>+	unsigned int seq;
>>+	ktime_t total;
>>+
>>+	do {
>>+		seq = read_seqcount_begin(&engine->stats.lock);
>>+		total = __execlists_engine_busyness(engine, now);
>>+	} while (read_seqcount_retry(&engine->stats.lock, seq));
>>+
>>+	return total;
>>+}
>>+
>>  static void
>>  logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>  {
>>@@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>>  		engine->emit_bb_start = gen8_emit_bb_start;
>>  	else
>>  		engine->emit_bb_start = gen8_emit_bb_start_noarb;
>>+
>>+	engine->busyness = execlists_engine_busyness;
>>  }
>>  static void logical_ring_default_irqs(struct intel_engine_cs *engine)
>>diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>>index 8ff582222aff..ff1311d4beff 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
>>@@ -143,6 +143,7 @@ enum intel_guc_action {
>>  	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>>  	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>>  	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
>>+	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>>  	INTEL_GUC_ACTION_LIMIT
>>  };
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>index 5dd174babf7a..22c30dbdf63a 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
>>@@ -104,6 +104,8 @@ struct intel_guc {
>>  	u32 ads_regset_size;
>>  	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>>  	u32 ads_golden_ctxt_size;
>>+	/** @ads_engine_usage_size: size of engine usage in the ADS */
>>+	u32 ads_engine_usage_size;
>>  	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
>>  	struct i915_vma *lrc_desc_pool;
>>@@ -138,6 +140,30 @@ struct intel_guc {
>>  	/** @send_mutex: used to serialize the intel_guc_send actions */
>>  	struct mutex send_mutex;
>>+
>>+	struct {
>>+		/**
>>+		 * @lock: Lock protecting the below fields and the engine stats.
>>+		 */
>>+		spinlock_t lock;
>>+
>>+		/**
>>+		 * @gt_stamp: 64 bit extended value of the GT timestamp.
>>+		 */
>>+		u64 gt_stamp;
>>+
>>+		/**
>>+		 * @ping_delay: Period for polling the GT timestamp for
>>+		 * overflow.
>>+		 */
>>+		unsigned long ping_delay;
>>+
>>+		/**
>>+		 * @work: Periodic work to adjust GT timestamp, engine and
>>+		 * context usage for overflows.
>>+		 */
>>+		struct delayed_work work;
>>+	} timestamp;
>>  };
>>  static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>index 2c6ea64af7ec..ca9ab53999d5 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
>>@@ -26,6 +26,8 @@
>>   *      | guc_policies                          |
>>   *      +---------------------------------------+
>>   *      | guc_gt_system_info                    |
>>+ *      +---------------------------------------+
>>+ *      | guc_engine_usage                      |
>>   *      +---------------------------------------+ <== static
>>   *      | guc_mmio_reg[countA] (engine 0.0)     |
>>   *      | guc_mmio_reg[countB] (engine 0.1)     |
>>@@ -47,6 +49,7 @@ struct __guc_ads_blob {
>>  	struct guc_ads ads;
>>  	struct guc_policies policies;
>>  	struct guc_gt_system_info system_info;
>>+	struct guc_engine_usage engine_usage;
>>  	/* From here on, location is dynamic! Refer to above diagram. */
>>  	struct guc_mmio_reg regset[0];
>>  } __packed;
>>@@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
>>  	guc_ads_private_data_reset(guc);
>>  }
>>+
>>+u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
>>+{
>>+	struct __guc_ads_blob *blob = guc->ads_blob;
>>+	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
>>+	u32 offset = base + ptr_offset(blob, engine_usage);
>>+
>>+	return offset;
>>+}
>>+
>>+struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
>>+{
>>+	struct intel_guc *guc = &engine->gt->uc.guc;
>>+	struct __guc_ads_blob *blob = guc->ads_blob;
>>+	u8 guc_class = engine_class_to_guc_class(engine->class);
>>+
>>+	return &blob->engine_usage.engines[guc_class][engine->instance];
>>+}
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>>index 3d85051d57e4..e74c110facff 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
>>@@ -6,8 +6,11 @@
>>  #ifndef _INTEL_GUC_ADS_H_
>>  #define _INTEL_GUC_ADS_H_
>>+#include <linux/types.h>
>>+
>>  struct intel_guc;
>>  struct drm_printer;
>>+struct intel_engine_cs;
>>  int intel_guc_ads_create(struct intel_guc *guc);
>>  void intel_guc_ads_destroy(struct intel_guc *guc);
>>@@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
>>  void intel_guc_ads_reset(struct intel_guc *guc);
>>  void intel_guc_ads_print_policy_info(struct intel_guc *guc,
>>  				     struct drm_printer *p);
>>+struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
>>+u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
>>  #endif
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>index fa4be13c8854..7c9c081670fc 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>@@ -294,6 +294,19 @@ struct guc_ads {
>>  	u32 reserved[15];
>>  } __packed;
>>+/* Engine usage stats */
>>+struct guc_engine_usage_record {
>>+	u32 current_context_index;
>>+	u32 last_switch_in_stamp;
>>+	u32 reserved0;
>>+	u32 total_runtime;
>>+	u32 reserved1[4];
>>+} __packed;
>>+
>>+struct guc_engine_usage {
>>+	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
>>+} __packed;
>>+
>>  /* GuC logging structures */
>>  enum guc_log_buffer_type {
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>index ba0de35f6323..5d29a4913e17 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>@@ -12,6 +12,7 @@
>>  #include "gt/intel_engine_pm.h"
>>  #include "gt/intel_engine_heartbeat.h"
>>  #include "gt/intel_gt.h"
>>+#include "gt/intel_gt_clock_utils.h"
>>  #include "gt/intel_gt_irq.h"
>>  #include "gt/intel_gt_pm.h"
>>  #include "gt/intel_gt_requests.h"
>>@@ -20,6 +21,7 @@
>>  #include "gt/intel_mocs.h"
>>  #include "gt/intel_ring.h"
>>+#include "intel_guc_ads.h"
>>  #include "intel_guc_submission.h"
>>  #include "i915_drv.h"
>>@@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>>  static void disable_submission(struct intel_guc *guc)
>>  {
>>  	struct i915_sched_engine * const sched_engine = guc->sched_engine;
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+	struct intel_engine_cs *engine;
>>+	enum intel_engine_id id;
>>+	unsigned long flags;
>>  	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>>  		GEM_BUG_ON(!guc->ct.enabled);
>>  		__tasklet_disable_sync_once(&sched_engine->tasklet);
>>  		sched_engine->tasklet.callback = NULL;
>>  	}
>>+
>>+	cancel_delayed_work(&guc->timestamp.work);
>>+
>>+	spin_lock_irqsave(&guc->timestamp.lock, flags);
>>+
>>+	for_each_engine(engine, gt, id)
>>+		engine->stats.prev_total = 0;
>>+
>>+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>  }
>>  static void enable_submission(struct intel_guc *guc)
>>@@ -1164,6 +1179,192 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>>  	i915_sched_engine_put(guc->sched_engine);
>>  }
>>+/*
>>+ * GuC stores busyness stats for each engine at context in/out boundaries. A
>>+ * context 'in' logs execution start time, 'out' adds in -> out delta to total.
>>+ * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
>>+ * GuC.
>>+ *
>>+ * __i915_pmu_event_read samples engine busyness. When sampling, if context id
>>+ * is valid (!= ~0) and start is non-zero, the engine is considered to be
>>+ * active. For an active engine total busyness = total + (now - start), where
>>+ * 'now' is the time at which the busyness is sampled. For inactive engine,
>>+ * total busyness = total.
>>+ *
>>+ * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
>>+ *
>>+ * The start and total values provided by GuC are 32 bits and wrap around in a
>>+ * few minutes. Since perf pmu provides busyness as 64 bit monotonically
>>+ * increasing ns values, there is a need for this implementation to account for
>>+ * overflows and extend the GuC proviced values to 64 bits before returning
>>+ * busyness to the user. In order to do that, a worker runs periodically at
>>+ * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
>>+ * 27 seconds for a gt clock frequency of 19.2 MHz).
>>+ */
>>+
>>+#define WRAP_TIME_CLKS U32_MAX
>>+#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
>>+
>>+static inline void
>
>I'd probably drop the inline from here and the one below and let the 
>compiler decide.

Did this. I dropped one of the helpers in the new revision since it's 
called only from one place.

>
>>+__update_timestamp(struct intel_guc *guc, u64 *prev_start, u32 new_start)
>>+{
>>+	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
>>+	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
>>+
>>+	if (new_start == lower_32_bits(*prev_start))
>>+		return;
>>+
>>+	if (new_start < gt_stamp_last &&
>>+	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
>>+		gt_stamp_hi++;
>>+
>>+	if (new_start > gt_stamp_last &&
>>+	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS)
>>+		if (gt_stamp_hi)
>>+			gt_stamp_hi--;
>>+
>>+	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
>>+}
>>+
>>+static inline void
>>+__update_counter(u64 *curr_value, u32 new)
>>+{
>>+	u32 hi = upper_32_bits(*curr_value);
>>+
>>+	if (new < lower_32_bits(*curr_value))
>>+		hi++;
>>+
>>+	*curr_value = ((u64)hi << 32) | new;
>>+}
>>+
>>+static bool guc_update_engine_gt_clks(struct intel_engine_cs *engine)
>>+{
>>+	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
>>+	struct intel_guc *guc = &engine->gt->uc.guc;
>>+	u32 last_switch = rec->last_switch_in_stamp;
>>+	u32 ctx_id = rec->current_context_index;
>>+	u32 total = rec->total_runtime;
>>+	bool active = ctx_id != ~0U && last_switch;
>>+
>>+	if (active)
>>+		__update_timestamp(guc, &engine->stats.start_gt_clk,
>>+				   last_switch);
>>+
>>+	/*
>>+	 * Instead of adjusting the total for overflow, just add the
>>+	 * difference from previous sample to the stats.total_gt_clks
>>+	 */
>>+	if (total && total != ~0U) {
>>+		engine->stats.total_gt_clks += (u32)(total -
>>+						     engine->stats.prev_total);
>>+		engine->stats.prev_total = total;
>>+	}
>>+
>>+	return active;
>>+}
>>+
>>+static void guc_update_pm_timestamp(struct intel_guc *guc)
>>+{
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+	u32 gt_stamp_now;
>>+
>>+	if (intel_gt_pm_get_if_awake(gt)) {
>>+		gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
>>+		intel_gt_pm_put_async(gt);
>>+		__update_counter(&guc->timestamp.gt_stamp, gt_stamp_now);
>>+	}
>>+}
>>+
>>+/*
>>+ * Unlike the execlist mode of submission total and active times are in terms of
>>+ * gt clocks. The *now parameter is retained to return the cpu time at which the
>>+ * busyness was sampled.
>>+ */
>>+static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>>+{
>>+	struct intel_gt *gt = engine->gt;
>>+	struct intel_guc *guc = &gt->uc.guc;
>>+	unsigned long flags;
>>+	bool active;
>>+	u64 total;
>>+
>>+	spin_lock_irqsave(&guc->timestamp.lock, flags);
>>+
>>+	*now = ktime_get();
>>+	active = guc_update_engine_gt_clks(engine);
>>+	guc_update_pm_timestamp(guc);
>
>I am a bit nervous that we have a mix of i915 view of "active" (pm get 
>if awake) and GuC via "active = <read shared page and determine>".
>
>The two sources come together in:
>
> if (active) {
>	u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
>
>Where active comes from GuC and gt_stamp "up to dateness" comes from 
>whether or not intel_gt_pm_get_if_awake succeeded.
>
>Coupled with the fact that you will need to add some hooks to pmu 
>park/unpark to handle the ping worker I wonder how it would look to 
>try and use a more consistent view here.
>
>What I mean is use the i915 view exclusively when deciding whether or 
>not to query anything from the GuC, or just use last known i915 copy 
>of the data.
>
>In other words the outline of the operation would be:
>
>guc_engine_busyness / guc_timestamp_ping
>{
>	intel_gt_pm_get_if_awake {
>		spin_lock
>		read and update guc state
>		spin_unlock
>	} else {
>		spin_lock
>		read last known guc state (100% driver copy)
>		spin_unlock
>	}
>
>	...
>}
>
>pmu park
>{
>	park worker
>
>	spin_lock
>	read and update guc state
>	spin_unlock
>}
>
>pmu unpark
>{
>	unpark worker
>}	
>
>Not sure to how much my concern amounts in practice so it is open for 
>discussion.
>
>At least we know we have inherent racyness in context save vs guc 
>tracking so perhaps best not to add more.

Fair enough. This is added in the new revision as proposed. The cached 
state is engine->stats.total_gt_clks, engine->stats.start_gt_clk (as 
before) and engine->stats.running (new).

Thanks,
Umesh

>
>Otherwise the design, in terms of how it fits into i915, now looks 
>fine to me (module unfinished worker parking).
>
>Regards,
>
>Tvrtko
>
>>+
>>+	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
>>+	if (active) {
>>+		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
>>+
>>+		total += intel_gt_clock_interval_to_ns(gt, clk);
>>+	}
>>+
>>+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>+
>>+	return ns_to_ktime(total);
>>+}
>>+
>>+static void guc_timestamp_ping(struct work_struct *wrk)
>>+{
>>+	struct intel_guc *guc = container_of(wrk, typeof(*guc), timestamp.work.work);
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+	struct intel_engine_cs *engine;
>>+	intel_engine_mask_t tmp;
>>+	unsigned long flags;
>>+
>>+	spin_lock_irqsave(&guc->timestamp.lock, flags);
>>+
>>+	/* adjust the guc pm timestamp for overflow */
>>+	guc_update_pm_timestamp(guc);
>>+
>>+	/* adjust the engine stats for overflow */
>>+	for_each_engine_masked(engine, gt, ALL_ENGINES, tmp)
>
>for_each_engine
>
>>+		guc_update_engine_gt_clks(engine);
>>+
>>+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>+
>>+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work, guc->timestamp.ping_delay);
>>+}
>>+
>>+static int guc_action_enable_usage_stats(struct intel_guc *guc)
>>+{
>>+	u32 offset = intel_guc_engine_usage_offset(guc);
>>+	u32 action[] = {
>>+		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
>>+		offset,
>>+		0,
>>+	};
>>+
>>+	return intel_guc_send(guc, action, ARRAY_SIZE(action));
>>+}
>>+
>>+static void __queue_work(struct intel_guc *guc)
>>+{
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+
>>+	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
>>+	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
>>+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work, guc->timestamp.ping_delay);
>>+}
>>+
>>+static void guc_init_engine_stats(struct intel_guc *guc)
>>+{
>>+	struct intel_gt *gt = guc_to_gt(guc);
>>+	intel_wakeref_t wakeref;
>>+
>>+	spin_lock_init(&guc->timestamp.lock);
>>+	__queue_work(guc);
>>+	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
>>+		int ret = guc_action_enable_usage_stats(guc);
>>+
>>+		if (ret)
>>+			drm_err(&gt->i915->drm,
>>+				"Failed to enable usage stats: %d!\n", ret);
>>+	}
>>+}
>>+
>>  static inline void queue_request(struct i915_sched_engine *sched_engine,
>>  				 struct i915_request *rq,
>>  				 int prio)
>>@@ -2606,7 +2807,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
>>  		engine->emit_flush = gen12_emit_flush_xcs;
>>  	}
>>  	engine->set_default_submission = guc_set_default_submission;
>>+	engine->busyness = guc_engine_busyness;
>>+	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
>>  	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
>>  	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
>>@@ -2705,6 +2908,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
>>  void intel_guc_submission_enable(struct intel_guc *guc)
>>  {
>>  	guc_init_lrc_mapping(guc);
>>+	guc_init_engine_stats(guc);
>>  }
>>  void intel_guc_submission_disable(struct intel_guc *guc)
>>diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>>index ef594df039db..8bc88c1bd68e 100644
>>--- a/drivers/gpu/drm/i915/i915_reg.h
>>+++ b/drivers/gpu/drm/i915/i915_reg.h
>>@@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>>  #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
>>  #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
>>+#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
>>+
>>  /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
>>  #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
>>  #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
  2021-09-24 22:34 Umesh Nerlige Ramappa
@ 2021-10-04 15:21 ` Tvrtko Ursulin
  2021-10-05 18:03   ` Umesh Nerlige Ramappa
  0 siblings, 1 reply; 24+ messages in thread
From: Tvrtko Ursulin @ 2021-10-04 15:21 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa, intel-gfx, dri-devel
  Cc: john.c.harrison, daniel.vetter


On 24/09/2021 23:34, Umesh Nerlige Ramappa wrote:
> With GuC handling scheduling, i915 is not aware of the time that a
> context is scheduled in and out of the engine. Since i915 pmu relies on
> this info to provide engine busyness to the user, GuC shares this info
> with i915 for all engines using shared memory. For each engine, this
> info contains:
> 
> - total busyness: total time that the context was running (total)
> - id: id of the running context (id)
> - start timestamp: timestamp when the context started running (start)
> 
> At the time (now) of sampling the engine busyness, if the id is valid
> (!= ~0), and start is non-zero, then the context is considered to be
> active and the engine busyness is calculated using the below equation
> 
> 	engine busyness = total + (now - start)
> 
> All times are obtained from the gt clock base. For inactive contexts,
> engine busyness is just equal to the total.
> 
> The start and total values provided by GuC are 32 bits and wrap around
> in a few minutes. Since perf pmu provides busyness as 64 bit
> monotonically increasing values, there is a need for this implementation
> to account for overflows and extend the time to 64 bits before returning
> busyness to the user. In order to do that, a worker runs periodically at
> frequqncy = 1/8th the time it takes for the timestamp to wrap. As an

frequency

> example, that would be once in 27 seconds for a gt clock frequency of
> 19.2 MHz.
> 
> Opens and wip that are targeted for later patches:
> 
> 1) On global gt reset the total busyness of engines resets and i915
>     needs to fix that so that user sees monotonically increasing
>     busyness.
> 2) In runtime suspend mode, the worker may not need to be run. We could
>     stop the worker on suspend and rerun it on resume provided that the
>     guc pm timestamp does not tick during suspend.

2) sounds easy since there are park/unpark hooks for pmu already. Will 
see if I can figure out why you did not just immediately do it.

I would also document in the commit message the known problem of 
possible over-accounting, just for historical reference.

> 
> v2: (Tvrtko)
> - Include details in commit message
> - Move intel engine busyness function into execlist code
> - Use union inside engine->stats
> - Use natural type for ping delay jiffies
> - Drop active_work condition checks
> - Use for_each_engine if iterating all engines
> - Drop seq locking, use spinlock at guc level to update engine stats
> - Document worker specific details
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +--
>   drivers/gpu/drm/i915/gt/intel_engine_types.h  |  82 ++++---
>   .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
>   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 +++
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 ++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 204 ++++++++++++++++++
>   drivers/gpu/drm/i915/i915_reg.h               |   2 +
>   10 files changed, 363 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 2ae57e4656a3..6fcc70a313d9 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>   	intel_engine_print_breadcrumbs(engine, m);
>   }
>   
> -static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
> -					    ktime_t *now)
> -{
> -	ktime_t total = engine->stats.total;
> -
> -	/*
> -	 * If the engine is executing something at the moment
> -	 * add it to the total.
> -	 */
> -	*now = ktime_get();
> -	if (READ_ONCE(engine->stats.active))
> -		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> -
> -	return total;
> -}
> -
>   /**
>    * intel_engine_get_busy_time() - Return current accumulated engine busyness
>    * @engine: engine to report on
> @@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
>    */
>   ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
>   {
> -	unsigned int seq;
> -	ktime_t total;
> -
> -	do {
> -		seq = read_seqcount_begin(&engine->stats.lock);
> -		total = __intel_engine_get_busy_time(engine, now);
> -	} while (read_seqcount_retry(&engine->stats.lock, seq));
> -
> -	return total;
> +	return engine->busyness(engine, now);
>   }
>   
>   struct intel_context *
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 5ae1207c363b..490166b54ed6 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -432,6 +432,12 @@ struct intel_engine_cs {
>   	void		(*add_active_request)(struct i915_request *rq);
>   	void		(*remove_active_request)(struct i915_request *rq);
>   
> +	/*
> +	 * Get engine busyness and the time at which the busyness was sampled.
> +	 */
> +	ktime_t		(*busyness)(struct intel_engine_cs *engine,
> +				    ktime_t *now);
> +
>   	struct intel_engine_execlists execlists;
>   
>   	/*
> @@ -481,30 +487,58 @@ struct intel_engine_cs {
>   	u32 (*get_cmd_length_mask)(u32 cmd_header);
>   
>   	struct {
> -		/**
> -		 * @active: Number of contexts currently scheduled in.
> -		 */
> -		unsigned int active;
> -
> -		/**
> -		 * @lock: Lock protecting the below fields.
> -		 */
> -		seqcount_t lock;
> -
> -		/**
> -		 * @total: Total time this engine was busy.
> -		 *
> -		 * Accumulated time not counting the most recent block in cases
> -		 * where engine is currently busy (active > 0).
> -		 */
> -		ktime_t total;
> -
> -		/**
> -		 * @start: Timestamp of the last idle to active transition.
> -		 *
> -		 * Idle is defined as active == 0, active is active > 0.
> -		 */
> -		ktime_t start;
> +		union {

Maybe put a marker like:

			/* Fields used by the execlists backend. */

> +			struct {
> +				/**
> +				 * @active: Number of contexts currently
> +				 * scheduled in.
> +				 */
> +				unsigned int active;
> +
> +				/**
> +				 * @lock: Lock protecting the below fields.
> +				 */
> +				seqcount_t lock;
> +
> +				/**
> +				 * @total: Total time this engine was busy.
> +				 *
> +				 * Accumulated time not counting the most recent
> +				 * block in cases where engine is currently busy
> +				 * (active > 0).
> +				 */
> +				ktime_t total;
> +
> +				/**
> +				 * @start: Timestamp of the last idle to active
> +				 * transition.
> +				 *
> +				 * Idle is defined as active == 0, active is
> +				 * active > 0.
> +				 */
> +				ktime_t start;
> +			};
> +

			/* Fields used by the GuC backend. */

> +			struct {
> +				/**
> +				 * @prev_total: Previous value of total runtime
> +				 * clock cycles.
> +				 */
> +				u32 prev_total;
> +
> +				/**
> +				 * @total_gt_clks: Total gt clock cycles this
> +				 * engine was busy.
> +				 */
> +				u64 total_gt_clks;
> +
> +				/**
> +				 * @start_gt_clk: GT clock time of last idle to
> +				 * active transition.
> +				 */
> +				u64 start_gt_clk;
> +			};
> +		};
>   
>   		/**
>   		 * @rps: Utilisation at last RPS sampling.
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 7147fe80919e..5c9b695e906c 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
>   	lrc_fini_wa_ctx(engine);
>   }
>   
> +static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
> +					   ktime_t *now)
> +{
> +	ktime_t total = engine->stats.total;
> +
> +	/*
> +	 * If the engine is executing something at the moment
> +	 * add it to the total.
> +	 */
> +	*now = ktime_get();
> +	if (READ_ONCE(engine->stats.active))
> +		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
> +
> +	return total;
> +}
> +
> +static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
> +					 ktime_t *now)
> +{
> +	unsigned int seq;
> +	ktime_t total;
> +
> +	do {
> +		seq = read_seqcount_begin(&engine->stats.lock);
> +		total = __execlists_engine_busyness(engine, now);
> +	} while (read_seqcount_retry(&engine->stats.lock, seq));
> +
> +	return total;
> +}
> +
>   static void
>   logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>   {
> @@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>   		engine->emit_bb_start = gen8_emit_bb_start;
>   	else
>   		engine->emit_bb_start = gen8_emit_bb_start_noarb;
> +
> +	engine->busyness = execlists_engine_busyness;
>   }
>   
>   static void logical_ring_default_irqs(struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> index 8ff582222aff..ff1311d4beff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> @@ -143,6 +143,7 @@ enum intel_guc_action {
>   	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
>   	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
>   	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
> +	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
>   	INTEL_GUC_ACTION_LIMIT
>   };
>   
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 5dd174babf7a..22c30dbdf63a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -104,6 +104,8 @@ struct intel_guc {
>   	u32 ads_regset_size;
>   	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
>   	u32 ads_golden_ctxt_size;
> +	/** @ads_engine_usage_size: size of engine usage in the ADS */
> +	u32 ads_engine_usage_size;
>   
>   	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
>   	struct i915_vma *lrc_desc_pool;
> @@ -138,6 +140,30 @@ struct intel_guc {
>   
>   	/** @send_mutex: used to serialize the intel_guc_send actions */
>   	struct mutex send_mutex;
> +
> +	struct {
> +		/**
> +		 * @lock: Lock protecting the below fields and the engine stats.
> +		 */
> +		spinlock_t lock;
> +
> +		/**
> +		 * @gt_stamp: 64 bit extended value of the GT timestamp.
> +		 */
> +		u64 gt_stamp;
> +
> +		/**
> +		 * @ping_delay: Period for polling the GT timestamp for
> +		 * overflow.
> +		 */
> +		unsigned long ping_delay;
> +
> +		/**
> +		 * @work: Periodic work to adjust GT timestamp, engine and
> +		 * context usage for overflows.
> +		 */
> +		struct delayed_work work;
> +	} timestamp;
>   };
>   
>   static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> index 2c6ea64af7ec..ca9ab53999d5 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
> @@ -26,6 +26,8 @@
>    *      | guc_policies                          |
>    *      +---------------------------------------+
>    *      | guc_gt_system_info                    |
> + *      +---------------------------------------+
> + *      | guc_engine_usage                      |
>    *      +---------------------------------------+ <== static
>    *      | guc_mmio_reg[countA] (engine 0.0)     |
>    *      | guc_mmio_reg[countB] (engine 0.1)     |
> @@ -47,6 +49,7 @@ struct __guc_ads_blob {
>   	struct guc_ads ads;
>   	struct guc_policies policies;
>   	struct guc_gt_system_info system_info;
> +	struct guc_engine_usage engine_usage;
>   	/* From here on, location is dynamic! Refer to above diagram. */
>   	struct guc_mmio_reg regset[0];
>   } __packed;
> @@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
>   
>   	guc_ads_private_data_reset(guc);
>   }
> +
> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
> +{
> +	struct __guc_ads_blob *blob = guc->ads_blob;
> +	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
> +	u32 offset = base + ptr_offset(blob, engine_usage);
> +
> +	return offset;
> +}
> +
> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
> +{
> +	struct intel_guc *guc = &engine->gt->uc.guc;
> +	struct __guc_ads_blob *blob = guc->ads_blob;
> +	u8 guc_class = engine_class_to_guc_class(engine->class);
> +
> +	return &blob->engine_usage.engines[guc_class][engine->instance];
> +}
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> index 3d85051d57e4..e74c110facff 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
> @@ -6,8 +6,11 @@
>   #ifndef _INTEL_GUC_ADS_H_
>   #define _INTEL_GUC_ADS_H_
>   
> +#include <linux/types.h>
> +
>   struct intel_guc;
>   struct drm_printer;
> +struct intel_engine_cs;
>   
>   int intel_guc_ads_create(struct intel_guc *guc);
>   void intel_guc_ads_destroy(struct intel_guc *guc);
> @@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
>   void intel_guc_ads_reset(struct intel_guc *guc);
>   void intel_guc_ads_print_policy_info(struct intel_guc *guc,
>   				     struct drm_printer *p);
> +struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
> +u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
>   
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index fa4be13c8854..7c9c081670fc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -294,6 +294,19 @@ struct guc_ads {
>   	u32 reserved[15];
>   } __packed;
>   
> +/* Engine usage stats */
> +struct guc_engine_usage_record {
> +	u32 current_context_index;
> +	u32 last_switch_in_stamp;
> +	u32 reserved0;
> +	u32 total_runtime;
> +	u32 reserved1[4];
> +} __packed;
> +
> +struct guc_engine_usage {
> +	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
> +} __packed;
> +
>   /* GuC logging structures */
>   
>   enum guc_log_buffer_type {
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index ba0de35f6323..5d29a4913e17 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -12,6 +12,7 @@
>   #include "gt/intel_engine_pm.h"
>   #include "gt/intel_engine_heartbeat.h"
>   #include "gt/intel_gt.h"
> +#include "gt/intel_gt_clock_utils.h"
>   #include "gt/intel_gt_irq.h"
>   #include "gt/intel_gt_pm.h"
>   #include "gt/intel_gt_requests.h"
> @@ -20,6 +21,7 @@
>   #include "gt/intel_mocs.h"
>   #include "gt/intel_ring.h"
>   
> +#include "intel_guc_ads.h"
>   #include "intel_guc_submission.h"
>   
>   #include "i915_drv.h"
> @@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
>   static void disable_submission(struct intel_guc *guc)
>   {
>   	struct i915_sched_engine * const sched_engine = guc->sched_engine;
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	unsigned long flags;
>   
>   	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>   		GEM_BUG_ON(!guc->ct.enabled);
>   		__tasklet_disable_sync_once(&sched_engine->tasklet);
>   		sched_engine->tasklet.callback = NULL;
>   	}
> +
> +	cancel_delayed_work(&guc->timestamp.work);
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	for_each_engine(engine, gt, id)
> +		engine->stats.prev_total = 0;
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>   }
>   
>   static void enable_submission(struct intel_guc *guc)
> @@ -1164,6 +1179,192 @@ void intel_guc_submission_fini(struct intel_guc *guc)
>   	i915_sched_engine_put(guc->sched_engine);
>   }
>   
> +/*
> + * GuC stores busyness stats for each engine at context in/out boundaries. A
> + * context 'in' logs execution start time, 'out' adds in -> out delta to total.
> + * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
> + * GuC.
> + *
> + * __i915_pmu_event_read samples engine busyness. When sampling, if context id
> + * is valid (!= ~0) and start is non-zero, the engine is considered to be
> + * active. For an active engine total busyness = total + (now - start), where
> + * 'now' is the time at which the busyness is sampled. For inactive engine,
> + * total busyness = total.
> + *
> + * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
> + *
> + * The start and total values provided by GuC are 32 bits and wrap around in a
> + * few minutes. Since perf pmu provides busyness as 64 bit monotonically
> + * increasing ns values, there is a need for this implementation to account for
> + * overflows and extend the GuC proviced values to 64 bits before returning
> + * busyness to the user. In order to do that, a worker runs periodically at
> + * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
> + * 27 seconds for a gt clock frequency of 19.2 MHz).
> + */
> +
> +#define WRAP_TIME_CLKS U32_MAX
> +#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
> +
> +static inline void

I'd probably drop the inline from here and the one below and let the 
compiler decide.

> +__update_timestamp(struct intel_guc *guc, u64 *prev_start, u32 new_start)
> +{
> +	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
> +	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
> +
> +	if (new_start == lower_32_bits(*prev_start))
> +		return;
> +
> +	if (new_start < gt_stamp_last &&
> +	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
> +		gt_stamp_hi++;
> +
> +	if (new_start > gt_stamp_last &&
> +	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS)
> +		if (gt_stamp_hi)
> +			gt_stamp_hi--;
> +
> +	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
> +}
> +
> +static inline void
> +__update_counter(u64 *curr_value, u32 new)
> +{
> +	u32 hi = upper_32_bits(*curr_value);
> +
> +	if (new < lower_32_bits(*curr_value))
> +		hi++;
> +
> +	*curr_value = ((u64)hi << 32) | new;
> +}
> +
> +static bool guc_update_engine_gt_clks(struct intel_engine_cs *engine)
> +{
> +	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
> +	struct intel_guc *guc = &engine->gt->uc.guc;
> +	u32 last_switch = rec->last_switch_in_stamp;
> +	u32 ctx_id = rec->current_context_index;
> +	u32 total = rec->total_runtime;
> +	bool active = ctx_id != ~0U && last_switch;
> +
> +	if (active)
> +		__update_timestamp(guc, &engine->stats.start_gt_clk,
> +				   last_switch);
> +
> +	/*
> +	 * Instead of adjusting the total for overflow, just add the
> +	 * difference from previous sample to the stats.total_gt_clks
> +	 */
> +	if (total && total != ~0U) {
> +		engine->stats.total_gt_clks += (u32)(total -
> +						     engine->stats.prev_total);
> +		engine->stats.prev_total = total;
> +	}
> +
> +	return active;
> +}
> +
> +static void guc_update_pm_timestamp(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	u32 gt_stamp_now;
> +
> +	if (intel_gt_pm_get_if_awake(gt)) {
> +		gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
> +		intel_gt_pm_put_async(gt);
> +		__update_counter(&guc->timestamp.gt_stamp, gt_stamp_now);
> +	}
> +}
> +
> +/*
> + * Unlike the execlist mode of submission total and active times are in terms of
> + * gt clocks. The *now parameter is retained to return the cpu time at which the
> + * busyness was sampled.
> + */
> +static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
> +{
> +	struct intel_gt *gt = engine->gt;
> +	struct intel_guc *guc = &gt->uc.guc;
> +	unsigned long flags;
> +	bool active;
> +	u64 total;
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	*now = ktime_get();
> +	active = guc_update_engine_gt_clks(engine);
> +	guc_update_pm_timestamp(guc);

I am a bit nervous that we have a mix of i915 view of "active" (pm get 
if awake) and GuC via "active = <read shared page and determine>".

The two sources come together in:

  if (active) {
	u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;

Where active comes from GuC and gt_stamp "up to dateness" comes from 
whether or not intel_gt_pm_get_if_awake succeeded.

Coupled with the fact that you will need to add some hooks to pmu 
park/unpark to handle the ping worker I wonder how it would look to try 
and use a more consistent view here.

What I mean is use the i915 view exclusively when deciding whether or 
not to query anything from the GuC, or just use last known i915 copy of 
the data.

In other words the outline of the operation would be:

guc_engine_busyness / guc_timestamp_ping
{
	intel_gt_pm_get_if_awake {
		spin_lock
		read and update guc state
		spin_unlock
	} else {
		spin_lock
		read last known guc state (100% driver copy)
		spin_unlock
	}

	...
}

pmu park
{
	park worker

	spin_lock
	read and update guc state
	spin_unlock
}

pmu unpark
{
	unpark worker
}	

Not sure to how much my concern amounts in practice so it is open for 
discussion.

At least we know we have inherent racyness in context save vs guc 
tracking so perhaps best not to add more.

Otherwise the design, in terms of how it fits into i915, now looks fine 
to me (module unfinished worker parking).

Regards,

Tvrtko

> +
> +	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
> +	if (active) {
> +		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
> +
> +		total += intel_gt_clock_interval_to_ns(gt, clk);
> +	}
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> +
> +	return ns_to_ktime(total);
> +}
> +
> +static void guc_timestamp_ping(struct work_struct *wrk)
> +{
> +	struct intel_guc *guc = container_of(wrk, typeof(*guc), timestamp.work.work);
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	struct intel_engine_cs *engine;
> +	intel_engine_mask_t tmp;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +
> +	/* adjust the guc pm timestamp for overflow */
> +	guc_update_pm_timestamp(guc);
> +
> +	/* adjust the engine stats for overflow */
> +	for_each_engine_masked(engine, gt, ALL_ENGINES, tmp)

for_each_engine

> +		guc_update_engine_gt_clks(engine);
> +
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> +
> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work, guc->timestamp.ping_delay);
> +}
> +
> +static int guc_action_enable_usage_stats(struct intel_guc *guc)
> +{
> +	u32 offset = intel_guc_engine_usage_offset(guc);
> +	u32 action[] = {
> +		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
> +		offset,
> +		0,
> +	};
> +
> +	return intel_guc_send(guc, action, ARRAY_SIZE(action));
> +}
> +
> +static void __queue_work(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +
> +	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
> +	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
> +	mod_delayed_work(system_highpri_wq, &guc->timestamp.work, guc->timestamp.ping_delay);
> +}
> +
> +static void guc_init_engine_stats(struct intel_guc *guc)
> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	intel_wakeref_t wakeref;
> +
> +	spin_lock_init(&guc->timestamp.lock);
> +	__queue_work(guc);
> +	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
> +		int ret = guc_action_enable_usage_stats(guc);
> +
> +		if (ret)
> +			drm_err(&gt->i915->drm,
> +				"Failed to enable usage stats: %d!\n", ret);
> +	}
> +}
> +
>   static inline void queue_request(struct i915_sched_engine *sched_engine,
>   				 struct i915_request *rq,
>   				 int prio)
> @@ -2606,7 +2807,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
>   		engine->emit_flush = gen12_emit_flush_xcs;
>   	}
>   	engine->set_default_submission = guc_set_default_submission;
> +	engine->busyness = guc_engine_busyness;
>   
> +	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
>   	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
>   	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
>   
> @@ -2705,6 +2908,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
>   void intel_guc_submission_enable(struct intel_guc *guc)
>   {
>   	guc_init_lrc_mapping(guc);
> +	guc_init_engine_stats(guc);
>   }
>   
>   void intel_guc_submission_disable(struct intel_guc *guc)
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index ef594df039db..8bc88c1bd68e 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>   #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
>   #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
>   
> +#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
> +
>   /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
>   #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
>   #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
@ 2021-09-24 22:34 Umesh Nerlige Ramappa
  2021-10-04 15:21 ` Tvrtko Ursulin
  0 siblings, 1 reply; 24+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-09-24 22:34 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, Tvrtko Ursulin, daniel.vetter

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

	engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequqncy = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Opens and wip that are targeted for later patches:

1) On global gt reset the total busyness of engines resets and i915
   needs to fix that so that user sees monotonically increasing
   busyness.
2) In runtime suspend mode, the worker may not need to be run. We could
   stop the worker on suspend and rerun it on resume provided that the
   guc pm timestamp does not tick during suspend.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  26 +--
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  82 ++++---
 .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  26 +++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h    |   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 204 ++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h               |   2 +
 10 files changed, 363 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..6fcc70a313d9 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 	intel_engine_print_breadcrumbs(engine, m);
 }
 
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-					    ktime_t *now)
-{
-	ktime_t total = engine->stats.total;
-
-	/*
-	 * If the engine is executing something at the moment
-	 * add it to the total.
-	 */
-	*now = ktime_get();
-	if (READ_ONCE(engine->stats.active))
-		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
-
-	return total;
-}
-
 /**
  * intel_engine_get_busy_time() - Return current accumulated engine busyness
  * @engine: engine to report on
@@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now)
 {
-	unsigned int seq;
-	ktime_t total;
-
-	do {
-		seq = read_seqcount_begin(&engine->stats.lock);
-		total = __intel_engine_get_busy_time(engine, now);
-	} while (read_seqcount_retry(&engine->stats.lock, seq));
-
-	return total;
+	return engine->busyness(engine, now);
 }
 
 struct intel_context *
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 5ae1207c363b..490166b54ed6 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -432,6 +432,12 @@ struct intel_engine_cs {
 	void		(*add_active_request)(struct i915_request *rq);
 	void		(*remove_active_request)(struct i915_request *rq);
 
+	/*
+	 * Get engine busyness and the time at which the busyness was sampled.
+	 */
+	ktime_t		(*busyness)(struct intel_engine_cs *engine,
+				    ktime_t *now);
+
 	struct intel_engine_execlists execlists;
 
 	/*
@@ -481,30 +487,58 @@ struct intel_engine_cs {
 	u32 (*get_cmd_length_mask)(u32 cmd_header);
 
 	struct {
-		/**
-		 * @active: Number of contexts currently scheduled in.
-		 */
-		unsigned int active;
-
-		/**
-		 * @lock: Lock protecting the below fields.
-		 */
-		seqcount_t lock;
-
-		/**
-		 * @total: Total time this engine was busy.
-		 *
-		 * Accumulated time not counting the most recent block in cases
-		 * where engine is currently busy (active > 0).
-		 */
-		ktime_t total;
-
-		/**
-		 * @start: Timestamp of the last idle to active transition.
-		 *
-		 * Idle is defined as active == 0, active is active > 0.
-		 */
-		ktime_t start;
+		union {
+			struct {
+				/**
+				 * @active: Number of contexts currently
+				 * scheduled in.
+				 */
+				unsigned int active;
+
+				/**
+				 * @lock: Lock protecting the below fields.
+				 */
+				seqcount_t lock;
+
+				/**
+				 * @total: Total time this engine was busy.
+				 *
+				 * Accumulated time not counting the most recent
+				 * block in cases where engine is currently busy
+				 * (active > 0).
+				 */
+				ktime_t total;
+
+				/**
+				 * @start: Timestamp of the last idle to active
+				 * transition.
+				 *
+				 * Idle is defined as active == 0, active is
+				 * active > 0.
+				 */
+				ktime_t start;
+			};
+
+			struct {
+				/**
+				 * @prev_total: Previous value of total runtime
+				 * clock cycles.
+				 */
+				u32 prev_total;
+
+				/**
+				 * @total_gt_clks: Total gt clock cycles this
+				 * engine was busy.
+				 */
+				u64 total_gt_clks;
+
+				/**
+				 * @start_gt_clk: GT clock time of last idle to
+				 * active transition.
+				 */
+				u64 start_gt_clk;
+			};
+		};
 
 		/**
 		 * @rps: Utilisation at last RPS sampling.
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 7147fe80919e..5c9b695e906c 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -3292,6 +3292,36 @@ static void execlists_release(struct intel_engine_cs *engine)
 	lrc_fini_wa_ctx(engine);
 }
 
+static ktime_t __execlists_engine_busyness(struct intel_engine_cs *engine,
+					   ktime_t *now)
+{
+	ktime_t total = engine->stats.total;
+
+	/*
+	 * If the engine is executing something at the moment
+	 * add it to the total.
+	 */
+	*now = ktime_get();
+	if (READ_ONCE(engine->stats.active))
+		total = ktime_add(total, ktime_sub(*now, engine->stats.start));
+
+	return total;
+}
+
+static ktime_t execlists_engine_busyness(struct intel_engine_cs *engine,
+					 ktime_t *now)
+{
+	unsigned int seq;
+	ktime_t total;
+
+	do {
+		seq = read_seqcount_begin(&engine->stats.lock);
+		total = __execlists_engine_busyness(engine, now);
+	} while (read_seqcount_retry(&engine->stats.lock, seq));
+
+	return total;
+}
+
 static void
 logical_ring_default_vfuncs(struct intel_engine_cs *engine)
 {
@@ -3348,6 +3378,8 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
 		engine->emit_bb_start = gen8_emit_bb_start;
 	else
 		engine->emit_bb_start = gen8_emit_bb_start_noarb;
+
+	engine->busyness = execlists_engine_busyness;
 }
 
 static void logical_ring_default_irqs(struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
index 8ff582222aff..ff1311d4beff 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
@@ -143,6 +143,7 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_DEREGISTER_COMMAND_TRANSPORT_BUFFER = 0x4506,
 	INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
 	INTEL_GUC_ACTION_RESET_CLIENT = 0x5507,
+	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
 	INTEL_GUC_ACTION_LIMIT
 };
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 5dd174babf7a..22c30dbdf63a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -104,6 +104,8 @@ struct intel_guc {
 	u32 ads_regset_size;
 	/** @ads_golden_ctxt_size: size of the golden contexts in the ADS */
 	u32 ads_golden_ctxt_size;
+	/** @ads_engine_usage_size: size of engine usage in the ADS */
+	u32 ads_engine_usage_size;
 
 	/** @lrc_desc_pool: object allocated to hold the GuC LRC descriptor pool */
 	struct i915_vma *lrc_desc_pool;
@@ -138,6 +140,30 @@ struct intel_guc {
 
 	/** @send_mutex: used to serialize the intel_guc_send actions */
 	struct mutex send_mutex;
+
+	struct {
+		/**
+		 * @lock: Lock protecting the below fields and the engine stats.
+		 */
+		spinlock_t lock;
+
+		/**
+		 * @gt_stamp: 64 bit extended value of the GT timestamp.
+		 */
+		u64 gt_stamp;
+
+		/**
+		 * @ping_delay: Period for polling the GT timestamp for
+		 * overflow.
+		 */
+		unsigned long ping_delay;
+
+		/**
+		 * @work: Periodic work to adjust GT timestamp, engine and
+		 * context usage for overflows.
+		 */
+		struct delayed_work work;
+	} timestamp;
 };
 
 static inline struct intel_guc *log_to_guc(struct intel_guc_log *log)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 2c6ea64af7ec..ca9ab53999d5 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -26,6 +26,8 @@
  *      | guc_policies                          |
  *      +---------------------------------------+
  *      | guc_gt_system_info                    |
+ *      +---------------------------------------+
+ *      | guc_engine_usage                      |
  *      +---------------------------------------+ <== static
  *      | guc_mmio_reg[countA] (engine 0.0)     |
  *      | guc_mmio_reg[countB] (engine 0.1)     |
@@ -47,6 +49,7 @@ struct __guc_ads_blob {
 	struct guc_ads ads;
 	struct guc_policies policies;
 	struct guc_gt_system_info system_info;
+	struct guc_engine_usage engine_usage;
 	/* From here on, location is dynamic! Refer to above diagram. */
 	struct guc_mmio_reg regset[0];
 } __packed;
@@ -628,3 +631,21 @@ void intel_guc_ads_reset(struct intel_guc *guc)
 
 	guc_ads_private_data_reset(guc);
 }
+
+u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
+{
+	struct __guc_ads_blob *blob = guc->ads_blob;
+	u32 base = intel_guc_ggtt_offset(guc, guc->ads_vma);
+	u32 offset = base + ptr_offset(blob, engine_usage);
+
+	return offset;
+}
+
+struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine)
+{
+	struct intel_guc *guc = &engine->gt->uc.guc;
+	struct __guc_ads_blob *blob = guc->ads_blob;
+	u8 guc_class = engine_class_to_guc_class(engine->class);
+
+	return &blob->engine_usage.engines[guc_class][engine->instance];
+}
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
index 3d85051d57e4..e74c110facff 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h
@@ -6,8 +6,11 @@
 #ifndef _INTEL_GUC_ADS_H_
 #define _INTEL_GUC_ADS_H_
 
+#include <linux/types.h>
+
 struct intel_guc;
 struct drm_printer;
+struct intel_engine_cs;
 
 int intel_guc_ads_create(struct intel_guc *guc);
 void intel_guc_ads_destroy(struct intel_guc *guc);
@@ -15,5 +18,7 @@ void intel_guc_ads_init_late(struct intel_guc *guc);
 void intel_guc_ads_reset(struct intel_guc *guc);
 void intel_guc_ads_print_policy_info(struct intel_guc *guc,
 				     struct drm_printer *p);
+struct guc_engine_usage_record *intel_guc_engine_usage(struct intel_engine_cs *engine);
+u32 intel_guc_engine_usage_offset(struct intel_guc *guc);
 
 #endif
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index fa4be13c8854..7c9c081670fc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -294,6 +294,19 @@ struct guc_ads {
 	u32 reserved[15];
 } __packed;
 
+/* Engine usage stats */
+struct guc_engine_usage_record {
+	u32 current_context_index;
+	u32 last_switch_in_stamp;
+	u32 reserved0;
+	u32 total_runtime;
+	u32 reserved1[4];
+} __packed;
+
+struct guc_engine_usage {
+	struct guc_engine_usage_record engines[GUC_MAX_ENGINE_CLASSES][GUC_MAX_INSTANCES_PER_CLASS];
+} __packed;
+
 /* GuC logging structures */
 
 enum guc_log_buffer_type {
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ba0de35f6323..5d29a4913e17 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -12,6 +12,7 @@
 #include "gt/intel_engine_pm.h"
 #include "gt/intel_engine_heartbeat.h"
 #include "gt/intel_gt.h"
+#include "gt/intel_gt_clock_utils.h"
 #include "gt/intel_gt_irq.h"
 #include "gt/intel_gt_pm.h"
 #include "gt/intel_gt_requests.h"
@@ -20,6 +21,7 @@
 #include "gt/intel_mocs.h"
 #include "gt/intel_ring.h"
 
+#include "intel_guc_ads.h"
 #include "intel_guc_submission.h"
 
 #include "i915_drv.h"
@@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
 static void disable_submission(struct intel_guc *guc)
 {
 	struct i915_sched_engine * const sched_engine = guc->sched_engine;
+	struct intel_gt *gt = guc_to_gt(guc);
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	unsigned long flags;
 
 	if (__tasklet_is_enabled(&sched_engine->tasklet)) {
 		GEM_BUG_ON(!guc->ct.enabled);
 		__tasklet_disable_sync_once(&sched_engine->tasklet);
 		sched_engine->tasklet.callback = NULL;
 	}
+
+	cancel_delayed_work(&guc->timestamp.work);
+
+	spin_lock_irqsave(&guc->timestamp.lock, flags);
+
+	for_each_engine(engine, gt, id)
+		engine->stats.prev_total = 0;
+
+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
 }
 
 static void enable_submission(struct intel_guc *guc)
@@ -1164,6 +1179,192 @@ void intel_guc_submission_fini(struct intel_guc *guc)
 	i915_sched_engine_put(guc->sched_engine);
 }
 
+/*
+ * GuC stores busyness stats for each engine at context in/out boundaries. A
+ * context 'in' logs execution start time, 'out' adds in -> out delta to total.
+ * i915/kmd accesses 'start', 'total' and 'context id' from memory shared with
+ * GuC.
+ *
+ * __i915_pmu_event_read samples engine busyness. When sampling, if context id
+ * is valid (!= ~0) and start is non-zero, the engine is considered to be
+ * active. For an active engine total busyness = total + (now - start), where
+ * 'now' is the time at which the busyness is sampled. For inactive engine,
+ * total busyness = total.
+ *
+ * All times are captured from GUCPMTIMESTAMP reg and are in gt clock domain.
+ *
+ * The start and total values provided by GuC are 32 bits and wrap around in a
+ * few minutes. Since perf pmu provides busyness as 64 bit monotonically
+ * increasing ns values, there is a need for this implementation to account for
+ * overflows and extend the GuC proviced values to 64 bits before returning
+ * busyness to the user. In order to do that, a worker runs periodically at
+ * frequency = 1/8th the time it takes for the timestamp to wrap (i.e. once in
+ * 27 seconds for a gt clock frequency of 19.2 MHz).
+ */
+
+#define WRAP_TIME_CLKS U32_MAX
+#define POLL_TIME_CLKS (WRAP_TIME_CLKS >> 3)
+
+static inline void
+__update_timestamp(struct intel_guc *guc, u64 *prev_start, u32 new_start)
+{
+	u32 gt_stamp_hi = upper_32_bits(guc->timestamp.gt_stamp);
+	u32 gt_stamp_last = lower_32_bits(guc->timestamp.gt_stamp);
+
+	if (new_start == lower_32_bits(*prev_start))
+		return;
+
+	if (new_start < gt_stamp_last &&
+	    (new_start - gt_stamp_last) <= POLL_TIME_CLKS)
+		gt_stamp_hi++;
+
+	if (new_start > gt_stamp_last &&
+	    (gt_stamp_last - new_start) <= POLL_TIME_CLKS)
+		if (gt_stamp_hi)
+			gt_stamp_hi--;
+
+	*prev_start = ((u64)gt_stamp_hi << 32) | new_start;
+}
+
+static inline void
+__update_counter(u64 *curr_value, u32 new)
+{
+	u32 hi = upper_32_bits(*curr_value);
+
+	if (new < lower_32_bits(*curr_value))
+		hi++;
+
+	*curr_value = ((u64)hi << 32) | new;
+}
+
+static bool guc_update_engine_gt_clks(struct intel_engine_cs *engine)
+{
+	struct guc_engine_usage_record *rec = intel_guc_engine_usage(engine);
+	struct intel_guc *guc = &engine->gt->uc.guc;
+	u32 last_switch = rec->last_switch_in_stamp;
+	u32 ctx_id = rec->current_context_index;
+	u32 total = rec->total_runtime;
+	bool active = ctx_id != ~0U && last_switch;
+
+	if (active)
+		__update_timestamp(guc, &engine->stats.start_gt_clk,
+				   last_switch);
+
+	/*
+	 * Instead of adjusting the total for overflow, just add the
+	 * difference from previous sample to the stats.total_gt_clks
+	 */
+	if (total && total != ~0U) {
+		engine->stats.total_gt_clks += (u32)(total -
+						     engine->stats.prev_total);
+		engine->stats.prev_total = total;
+	}
+
+	return active;
+}
+
+static void guc_update_pm_timestamp(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	u32 gt_stamp_now;
+
+	if (intel_gt_pm_get_if_awake(gt)) {
+		gt_stamp_now = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
+		intel_gt_pm_put_async(gt);
+		__update_counter(&guc->timestamp.gt_stamp, gt_stamp_now);
+	}
+}
+
+/*
+ * Unlike the execlist mode of submission total and active times are in terms of
+ * gt clocks. The *now parameter is retained to return the cpu time at which the
+ * busyness was sampled.
+ */
+static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
+{
+	struct intel_gt *gt = engine->gt;
+	struct intel_guc *guc = &gt->uc.guc;
+	unsigned long flags;
+	bool active;
+	u64 total;
+
+	spin_lock_irqsave(&guc->timestamp.lock, flags);
+
+	*now = ktime_get();
+	active = guc_update_engine_gt_clks(engine);
+	guc_update_pm_timestamp(guc);
+
+	total = intel_gt_clock_interval_to_ns(gt, engine->stats.total_gt_clks);
+	if (active) {
+		u64 clk = guc->timestamp.gt_stamp - engine->stats.start_gt_clk;
+
+		total += intel_gt_clock_interval_to_ns(gt, clk);
+	}
+
+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
+
+	return ns_to_ktime(total);
+}
+
+static void guc_timestamp_ping(struct work_struct *wrk)
+{
+	struct intel_guc *guc = container_of(wrk, typeof(*guc), timestamp.work.work);
+	struct intel_gt *gt = guc_to_gt(guc);
+	struct intel_engine_cs *engine;
+	intel_engine_mask_t tmp;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->timestamp.lock, flags);
+
+	/* adjust the guc pm timestamp for overflow */
+	guc_update_pm_timestamp(guc);
+
+	/* adjust the engine stats for overflow */
+	for_each_engine_masked(engine, gt, ALL_ENGINES, tmp)
+		guc_update_engine_gt_clks(engine);
+
+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
+
+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work, guc->timestamp.ping_delay);
+}
+
+static int guc_action_enable_usage_stats(struct intel_guc *guc)
+{
+	u32 offset = intel_guc_engine_usage_offset(guc);
+	u32 action[] = {
+		INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF,
+		offset,
+		0,
+	};
+
+	return intel_guc_send(guc, action, ARRAY_SIZE(action));
+}
+
+static void __queue_work(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+
+	guc->timestamp.ping_delay = (POLL_TIME_CLKS / gt->clock_frequency + 1) * HZ;
+	INIT_DELAYED_WORK(&guc->timestamp.work, guc_timestamp_ping);
+	mod_delayed_work(system_highpri_wq, &guc->timestamp.work, guc->timestamp.ping_delay);
+}
+
+static void guc_init_engine_stats(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	intel_wakeref_t wakeref;
+
+	spin_lock_init(&guc->timestamp.lock);
+	__queue_work(guc);
+	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref) {
+		int ret = guc_action_enable_usage_stats(guc);
+
+		if (ret)
+			drm_err(&gt->i915->drm,
+				"Failed to enable usage stats: %d!\n", ret);
+	}
+}
+
 static inline void queue_request(struct i915_sched_engine *sched_engine,
 				 struct i915_request *rq,
 				 int prio)
@@ -2606,7 +2807,9 @@ static void guc_default_vfuncs(struct intel_engine_cs *engine)
 		engine->emit_flush = gen12_emit_flush_xcs;
 	}
 	engine->set_default_submission = guc_set_default_submission;
+	engine->busyness = guc_engine_busyness;
 
+	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
 	engine->flags |= I915_ENGINE_HAS_PREEMPTION;
 	engine->flags |= I915_ENGINE_HAS_TIMESLICES;
 
@@ -2705,6 +2908,7 @@ int intel_guc_submission_setup(struct intel_engine_cs *engine)
 void intel_guc_submission_enable(struct intel_guc *guc)
 {
 	guc_init_lrc_mapping(guc);
+	guc_init_engine_stats(guc);
 }
 
 void intel_guc_submission_disable(struct intel_guc *guc)
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index ef594df039db..8bc88c1bd68e 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -2664,6 +2664,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define   RING_WAIT		(1 << 11) /* gen3+, PRBx_CTL */
 #define   RING_WAIT_SEMAPHORE	(1 << 10) /* gen6+ */
 
+#define GUCPMTIMESTAMP          _MMIO(0xC3E8)
+
 /* There are 16 64-bit CS General Purpose Registers per-engine on Gen8+ */
 #define GEN8_RING_CS_GPR(base, n)	_MMIO((base) + 0x600 + (n) * 8)
 #define GEN8_RING_CS_GPR_UDW(base, n)	_MMIO((base) + 0x600 + (n) * 8 + 4)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-10-07 23:00 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-05 17:47 [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu Umesh Nerlige Ramappa
2021-10-05 17:47 ` [Intel-gfx] " Umesh Nerlige Ramappa
2021-10-05 22:14 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/i915/pmu: Connect engine busyness stats from GuC to pmu (rev2) Patchwork
2021-10-05 22:20 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
2021-10-05 22:49 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2021-10-05 23:14 ` [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu Matthew Brost
2021-10-05 23:14   ` [Intel-gfx] " Matthew Brost
2021-10-06  8:22   ` Tvrtko Ursulin
2021-10-06  8:22     ` [Intel-gfx] " Tvrtko Ursulin
2021-10-06 17:04     ` Matthew Brost
2021-10-06 17:04       ` [Intel-gfx] " Matthew Brost
2021-10-07 23:00   ` Umesh Nerlige Ramappa
2021-10-07 23:00     ` [Intel-gfx] " Umesh Nerlige Ramappa
2021-10-06  9:11 ` Tvrtko Ursulin
2021-10-06  9:11   ` [Intel-gfx] " Tvrtko Ursulin
2021-10-06 20:45   ` Umesh Nerlige Ramappa
2021-10-06 20:45     ` [Intel-gfx] " Umesh Nerlige Ramappa
2021-10-07  8:17     ` Tvrtko Ursulin
2021-10-07  8:17       ` [Intel-gfx] " Tvrtko Ursulin
2021-10-07 15:42       ` Umesh Nerlige Ramappa
2021-10-07 15:42         ` [Intel-gfx] " Umesh Nerlige Ramappa
  -- strict thread matches above, loose matches on Subject: below --
2021-09-24 22:34 Umesh Nerlige Ramappa
2021-10-04 15:21 ` Tvrtko Ursulin
2021-10-05 18:03   ` Umesh Nerlige Ramappa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.