All of lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-gfx] [PATCH 1/2] i915/pmu: Wire GuC backend to per-client busyness
@ 2022-08-31 19:33 Ashutosh Dixit
  2022-08-31 19:33 ` [Intel-gfx] [RFC PATCH 2/2] Fix per client busyness locking Ashutosh Dixit
  2022-08-31 19:50 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for series starting with [1/2] i915/pmu: Wire GuC backend to per-client busyness Patchwork
  0 siblings, 2 replies; 10+ messages in thread
From: Ashutosh Dixit @ 2022-08-31 19:33 UTC (permalink / raw)
  To: intel-gfx

From: John Harrison <John.C.Harrison@Intel.com>

GuC provides engine_id and last_switch_in ticks for an active context in
the pphwsp. The context image provides a 32 bit total ticks which is the
accumulated by the context (a.k.a. context[CTX_TIMESTAMP]). This
information is used to calculate the context busyness as follows:

If the engine_id is valid, then busyness is the sum of accumulated total
ticks and active ticks. Active ticks is calculated with current gt time
as reference.

If engine_id is invalid, busyness is equal to accumulated total ticks.

Since KMD (CPU) retrieves busyness data from 2 sources - GPU and GuC, a
potential race was highlighted in an earlier review that can lead to
double accounting of busyness. While the solution to this is a wip,
busyness is still usable for platforms running GuC submission.

Remaining work: Enable and test context busyness for
virtual_parent_context_ops and virtual_child_context_ops.

v2: (Tvrtko)
- Use COPS_RUNTIME_ACTIVE_TOTAL
- Add code comment for the race
- Undo local variables initializations

v3:
- Add support for virtual engines based on
  https://patchwork.freedesktop.org/series/105227/

v4:
- Update commit message with remaining work.
- Rebase

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       | 12 +++-
 drivers/gpu/drm/i915/gt/intel_context.h       |  6 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  6 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  5 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 65 ++++++++++++++++++-
 drivers/gpu/drm/i915/i915_drm_client.c        |  6 +-
 6 files changed, 89 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 654a092ed3d6..e2d70a9fdac0 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -576,16 +576,24 @@ void intel_context_bind_parent_child(struct intel_context *parent,
 	child->parallel.parent = parent;
 }
 
-u64 intel_context_get_total_runtime_ns(const struct intel_context *ce)
+u64 intel_context_get_total_runtime_ns(struct intel_context *ce)
 {
 	u64 total, active;
 
+	if (ce->ops->update_stats)
+		ce->ops->update_stats(ce);
+
 	total = ce->stats.runtime.total;
 	if (ce->ops->flags & COPS_RUNTIME_CYCLES)
 		total *= ce->engine->gt->clock_period_ns;
 
 	active = READ_ONCE(ce->stats.active);
-	if (active)
+	/*
+	 * When COPS_RUNTIME_ACTIVE_TOTAL is set for ce->cops, the backend
+	 * already provides the total active time of the context, so skip this
+	 * calculation when this flag is set.
+	 */
+	if (active && !(ce->ops->flags & COPS_RUNTIME_ACTIVE_TOTAL))
 		active = intel_context_clock() - active;
 
 	return total + active;
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index 8e2d70630c49..3d1d7436c1a4 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -58,7 +58,7 @@ static inline bool intel_context_is_parent(struct intel_context *ce)
 	return !!ce->parallel.number_children;
 }
 
-static inline bool intel_context_is_pinned(struct intel_context *ce);
+static inline bool intel_context_is_pinned(const struct intel_context *ce);
 
 static inline struct intel_context *
 intel_context_to_parent(struct intel_context *ce)
@@ -118,7 +118,7 @@ static inline int intel_context_lock_pinned(struct intel_context *ce)
  * Returns: true if the context is currently pinned for use by the GPU.
  */
 static inline bool
-intel_context_is_pinned(struct intel_context *ce)
+intel_context_is_pinned(const struct intel_context *ce)
 {
 	return atomic_read(&ce->pin_count);
 }
@@ -362,7 +362,7 @@ intel_context_clear_nopreempt(struct intel_context *ce)
 	clear_bit(CONTEXT_NOPREEMPT, &ce->flags);
 }
 
-u64 intel_context_get_total_runtime_ns(const struct intel_context *ce);
+u64 intel_context_get_total_runtime_ns(struct intel_context *ce);
 u64 intel_context_get_avg_runtime_ns(struct intel_context *ce);
 
 static inline u64 intel_context_clock(void)
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 04eacae1aca5..f7ff4c7d81c7 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -38,6 +38,9 @@ struct intel_context_ops {
 #define COPS_RUNTIME_CYCLES_BIT 1
 #define COPS_RUNTIME_CYCLES BIT(COPS_RUNTIME_CYCLES_BIT)
 
+#define COPS_RUNTIME_ACTIVE_TOTAL_BIT 2
+#define COPS_RUNTIME_ACTIVE_TOTAL BIT(COPS_RUNTIME_ACTIVE_TOTAL_BIT)
+
 	int (*alloc)(struct intel_context *ce);
 
 	void (*revoke)(struct intel_context *ce, struct i915_request *rq,
@@ -56,6 +59,8 @@ struct intel_context_ops {
 
 	void (*sched_disable)(struct intel_context *ce);
 
+	void (*update_stats)(struct intel_context *ce);
+
 	void (*reset)(struct intel_context *ce);
 	void (*destroy)(struct kref *kref);
 
@@ -148,6 +153,7 @@ struct intel_context {
 			struct ewma_runtime avg;
 			u64 total;
 			u32 last;
+			u64 start_gt_clk;
 			I915_SELFTEST_DECLARE(u32 num_underflow);
 			I915_SELFTEST_DECLARE(u32 max_underflow);
 		} runtime;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 323b055e5db9..c7b54f1631b9 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -196,6 +196,11 @@ static inline u8 guc_class_to_engine_class(u8 guc_class)
 	return guc_class_engine_class_map[guc_class];
 }
 
+/* Per context engine usage stats: */
+#define PPHWSP_GUC_CONTEXT_USAGE_STAMP_LO	(0x500 / sizeof(u32))
+#define PPHWSP_GUC_CONTEXT_USAGE_STAMP_HI	(PPHWSP_GUC_CONTEXT_USAGE_STAMP_LO + 1)
+#define PPHWSP_GUC_CONTEXT_USAGE_ENGINE_ID	(PPHWSP_GUC_CONTEXT_USAGE_STAMP_HI + 1)
+
 /* Work item for submitting workloads into work queue of GuC. */
 struct guc_wq_item {
 	u32 header;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0d56b615bf78..bee8cf10f749 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -378,7 +378,7 @@ static inline void set_context_guc_id_invalid(struct intel_context *ce)
 	ce->guc_id.id = GUC_INVALID_CONTEXT_ID;
 }
 
-static inline struct intel_guc *ce_to_guc(struct intel_context *ce)
+static inline struct intel_guc *ce_to_guc(const struct intel_context *ce)
 {
 	return &ce->engine->gt->uc.guc;
 }
@@ -1376,13 +1376,16 @@ static void __update_guc_busyness_stats(struct intel_guc *guc)
 	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
 }
 
+static void __guc_context_update_clks(struct intel_context *ce);
 static void guc_timestamp_ping(struct work_struct *wrk)
 {
 	struct intel_guc *guc = container_of(wrk, typeof(*guc),
 					     timestamp.work.work);
 	struct intel_uc *uc = container_of(guc, typeof(*uc), guc);
 	struct intel_gt *gt = guc_to_gt(guc);
+	struct intel_context *ce;
 	intel_wakeref_t wakeref;
+	unsigned long index;
 	int srcu, ret;
 
 	/*
@@ -1396,6 +1399,10 @@ static void guc_timestamp_ping(struct work_struct *wrk)
 	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref)
 		__update_guc_busyness_stats(guc);
 
+	/* adjust context stats for overflow */
+	xa_for_each(&guc->context_lookup, index, ce)
+		__guc_context_update_clks(ce);
+
 	intel_gt_reset_unlock(gt, srcu);
 
 	mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
@@ -1469,6 +1476,56 @@ void intel_guc_busyness_unpark(struct intel_gt *gt)
 			 guc->timestamp.ping_delay);
 }
 
+static void __guc_context_update_clks(struct intel_context *ce)
+{
+	struct intel_guc *guc = ce_to_guc(ce);
+	struct intel_gt *gt = ce->engine->gt;
+	u32 *pphwsp, last_switch, engine_id;
+	u64 start_gt_clk, active;
+	unsigned long flags;
+	ktime_t unused;
+
+	spin_lock_irqsave(&guc->timestamp.lock, flags);
+
+	/*
+	 * GPU updates ce->lrc_reg_state[CTX_TIMESTAMP] when context is switched
+	 * out, however GuC updates PPHWSP offsets below. Hence KMD (CPU)
+	 * relies on GuC and GPU for busyness calculations. Due to this, A
+	 * potential race was highlighted in an earlier review that can lead to
+	 * double accounting of busyness. While the solution to this is a wip,
+	 * busyness is still usable for platforms running GuC submission.
+	 */
+	pphwsp = ((void *)ce->lrc_reg_state) - LRC_STATE_OFFSET;
+	last_switch = READ_ONCE(pphwsp[PPHWSP_GUC_CONTEXT_USAGE_STAMP_LO]);
+	engine_id = READ_ONCE(pphwsp[PPHWSP_GUC_CONTEXT_USAGE_ENGINE_ID]);
+
+	guc_update_pm_timestamp(guc, &unused);
+
+	if (engine_id != 0xffffffff && last_switch) {
+		start_gt_clk = READ_ONCE(ce->stats.runtime.start_gt_clk);
+		__extend_last_switch(guc, &start_gt_clk, last_switch);
+		active = intel_gt_clock_interval_to_ns(gt, guc->timestamp.gt_stamp - start_gt_clk);
+		WRITE_ONCE(ce->stats.runtime.start_gt_clk, start_gt_clk);
+		WRITE_ONCE(ce->stats.active, active);
+	} else {
+		lrc_update_runtime(ce);
+	}
+
+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
+}
+
+static void guc_context_update_stats(struct intel_context *ce)
+{
+	if (!intel_context_pin_if_active(ce)) {
+		WRITE_ONCE(ce->stats.runtime.start_gt_clk, 0);
+		WRITE_ONCE(ce->stats.active, 0);
+		return;
+	}
+
+	__guc_context_update_clks(ce);
+	intel_context_unpin(ce);
+}
+
 static inline bool
 submission_disabled(struct intel_guc *guc)
 {
@@ -2723,6 +2780,7 @@ static void guc_context_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 
+	lrc_update_runtime(ce);
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
 
@@ -3344,6 +3402,7 @@ static void remove_from_context(struct i915_request *rq)
 }
 
 static const struct intel_context_ops guc_context_ops = {
+	.flags = COPS_RUNTIME_CYCLES | COPS_RUNTIME_ACTIVE_TOTAL,
 	.alloc = guc_context_alloc,
 
 	.pre_pin = guc_context_pre_pin,
@@ -3360,6 +3419,8 @@ static const struct intel_context_ops guc_context_ops = {
 
 	.sched_disable = guc_context_sched_disable,
 
+	.update_stats = guc_context_update_stats,
+
 	.reset = lrc_reset,
 	.destroy = guc_context_destroy,
 
@@ -3593,6 +3654,7 @@ static int guc_virtual_context_alloc(struct intel_context *ce)
 }
 
 static const struct intel_context_ops virtual_guc_context_ops = {
+	.flags = COPS_RUNTIME_CYCLES | COPS_RUNTIME_ACTIVE_TOTAL,
 	.alloc = guc_virtual_context_alloc,
 
 	.pre_pin = guc_virtual_context_pre_pin,
@@ -3608,6 +3670,7 @@ static const struct intel_context_ops virtual_guc_context_ops = {
 	.exit = guc_virtual_context_exit,
 
 	.sched_disable = guc_context_sched_disable,
+	.update_stats = guc_context_update_stats,
 
 	.destroy = guc_context_destroy,
 
diff --git a/drivers/gpu/drm/i915/i915_drm_client.c b/drivers/gpu/drm/i915/i915_drm_client.c
index b09d1d386574..8d81119fff14 100644
--- a/drivers/gpu/drm/i915/i915_drm_client.c
+++ b/drivers/gpu/drm/i915/i915_drm_client.c
@@ -147,11 +147,7 @@ void i915_drm_client_fdinfo(struct seq_file *m, struct file *f)
 		   PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
 	seq_printf(m, "drm-client-id:\t%u\n", client->id);
 
-	/*
-	 * Temporarily skip showing client engine information with GuC submission till
-	 * fetching engine busyness is implemented in the GuC submission backend
-	 */
-	if (GRAPHICS_VER(i915) < 8 || intel_uc_uses_guc_submission(&i915->gt0.uc))
+	if (GRAPHICS_VER(i915) < 8)
 		return;
 
 	for (i = 0; i < ARRAY_SIZE(uabi_class_names); i++)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Intel-gfx] [RFC PATCH 2/2] Fix per client busyness locking
  2022-08-31 19:33 [Intel-gfx] [PATCH 1/2] i915/pmu: Wire GuC backend to per-client busyness Ashutosh Dixit
@ 2022-08-31 19:33 ` Ashutosh Dixit
  2022-08-31 22:45   ` Umesh Nerlige Ramappa
  2022-08-31 19:50 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for series starting with [1/2] i915/pmu: Wire GuC backend to per-client busyness Patchwork
  1 sibling, 1 reply; 10+ messages in thread
From: Ashutosh Dixit @ 2022-08-31 19:33 UTC (permalink / raw)
  To: intel-gfx

1. Do all ce->stats updates and reads under guc->timestamp.lock
2. Pin context image before reading
3. Merge __guc_context_update_clks and guc_context_update_stats into a
   single function
4. Call lrc_update_runtime() unconditionally in guc_context_update_stats
5. Seems no need to update ce->stats.active with this approach

Some of the above steps may not be correct or complete but the idea is to
discuss/improve the original patch.

Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |  2 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  2 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 41 ++++++++++---------
 3 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e2d70a9fdac0..c004f676de27 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -581,7 +581,7 @@ u64 intel_context_get_total_runtime_ns(struct intel_context *ce)
 	u64 total, active;
 
 	if (ce->ops->update_stats)
-		ce->ops->update_stats(ce);
+		return ce->ops->update_stats(ce);
 
 	total = ce->stats.runtime.total;
 	if (ce->ops->flags & COPS_RUNTIME_CYCLES)
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index f7ff4c7d81c7..699435bfff99 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -59,7 +59,7 @@ struct intel_context_ops {
 
 	void (*sched_disable)(struct intel_context *ce);
 
-	void (*update_stats)(struct intel_context *ce);
+	u64 (*update_stats)(struct intel_context *ce);
 
 	void (*reset)(struct intel_context *ce);
 	void (*destroy)(struct kref *kref);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index bee8cf10f749..40d0edaf2e0a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1376,7 +1376,6 @@ static void __update_guc_busyness_stats(struct intel_guc *guc)
 	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
 }
 
-static void __guc_context_update_clks(struct intel_context *ce);
 static void guc_timestamp_ping(struct work_struct *wrk)
 {
 	struct intel_guc *guc = container_of(wrk, typeof(*guc),
@@ -1401,7 +1400,8 @@ static void guc_timestamp_ping(struct work_struct *wrk)
 
 	/* adjust context stats for overflow */
 	xa_for_each(&guc->context_lookup, index, ce)
-		__guc_context_update_clks(ce);
+		if (ce->ops->update_stats)
+			ce->ops->update_stats(ce);
 
 	intel_gt_reset_unlock(gt, srcu);
 
@@ -1476,17 +1476,21 @@ void intel_guc_busyness_unpark(struct intel_gt *gt)
 			 guc->timestamp.ping_delay);
 }
 
-static void __guc_context_update_clks(struct intel_context *ce)
+static u64 guc_context_update_stats(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 	struct intel_gt *gt = ce->engine->gt;
 	u32 *pphwsp, last_switch, engine_id;
-	u64 start_gt_clk, active;
 	unsigned long flags;
+	u64 total, active = 0;
 	ktime_t unused;
 
+	intel_context_pin(ce);
 	spin_lock_irqsave(&guc->timestamp.lock, flags);
 
+	lrc_update_runtime(ce);
+	total = ce->stats.runtime.total;
+
 	/*
 	 * GPU updates ce->lrc_reg_state[CTX_TIMESTAMP] when context is switched
 	 * out, however GuC updates PPHWSP offsets below. Hence KMD (CPU)
@@ -1502,27 +1506,26 @@ static void __guc_context_update_clks(struct intel_context *ce)
 	guc_update_pm_timestamp(guc, &unused);
 
 	if (engine_id != 0xffffffff && last_switch) {
-		start_gt_clk = READ_ONCE(ce->stats.runtime.start_gt_clk);
-		__extend_last_switch(guc, &start_gt_clk, last_switch);
-		active = intel_gt_clock_interval_to_ns(gt, guc->timestamp.gt_stamp - start_gt_clk);
-		WRITE_ONCE(ce->stats.runtime.start_gt_clk, start_gt_clk);
-		WRITE_ONCE(ce->stats.active, active);
-	} else {
-		lrc_update_runtime(ce);
+		__extend_last_switch(guc, &ce->stats.runtime.start_gt_clk, last_switch);
+		active = intel_gt_clock_interval_to_ns(gt,
+			      guc->timestamp.gt_stamp - ce->stats.runtime.start_gt_clk);
 	}
 
 	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
+	intel_context_unpin(ce);
+
+	return total + active;
 }
 
-static void guc_context_update_stats(struct intel_context *ce)
+void lrc_update_runtime_locked(struct intel_context *ce)
 {
-	if (!intel_context_pin_if_active(ce)) {
-		WRITE_ONCE(ce->stats.runtime.start_gt_clk, 0);
-		WRITE_ONCE(ce->stats.active, 0);
-		return;
-	}
+	struct intel_guc *guc = ce_to_guc(ce);
+	unsigned long flags;
 
-	__guc_context_update_clks(ce);
+	intel_context_pin(ce);
+	spin_lock_irqsave(&guc->timestamp.lock, flags);
+	lrc_update_runtime(ce);
+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
 	intel_context_unpin(ce);
 }
 
@@ -2780,7 +2783,7 @@ static void guc_context_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 
-	lrc_update_runtime(ce);
+	lrc_update_runtime_locked(ce);
 	unpin_guc_id(guc, ce);
 	lrc_unpin(ce);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for series starting with [1/2] i915/pmu: Wire GuC backend to per-client busyness
  2022-08-31 19:33 [Intel-gfx] [PATCH 1/2] i915/pmu: Wire GuC backend to per-client busyness Ashutosh Dixit
  2022-08-31 19:33 ` [Intel-gfx] [RFC PATCH 2/2] Fix per client busyness locking Ashutosh Dixit
@ 2022-08-31 19:50 ` Patchwork
  1 sibling, 0 replies; 10+ messages in thread
From: Patchwork @ 2022-08-31 19:50 UTC (permalink / raw)
  To: Ashutosh Dixit; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/2] i915/pmu: Wire GuC backend to per-client busyness
URL   : https://patchwork.freedesktop.org/series/107983/
State : failure

== Summary ==

Error: make failed
  CALL    scripts/checksyscalls.sh
  CALL    scripts/atomic/check-atomics.sh
  DESCEND objtool
  CHK     include/generated/compile.h
  CC [M]  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.o
drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:1520:6: error: no previous prototype for ‘lrc_update_runtime_locked’ [-Werror=missing-prototypes]
 void lrc_update_runtime_locked(struct intel_context *ce)
      ^~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
scripts/Makefile.build:249: recipe for target 'drivers/gpu/drm/i915/gt/uc/intel_guc_submission.o' failed
make[4]: *** [drivers/gpu/drm/i915/gt/uc/intel_guc_submission.o] Error 1
scripts/Makefile.build:465: recipe for target 'drivers/gpu/drm/i915' failed
make[3]: *** [drivers/gpu/drm/i915] Error 2
scripts/Makefile.build:465: recipe for target 'drivers/gpu/drm' failed
make[2]: *** [drivers/gpu/drm] Error 2
scripts/Makefile.build:465: recipe for target 'drivers/gpu' failed
make[1]: *** [drivers/gpu] Error 2
Makefile:1853: recipe for target 'drivers' failed
make: *** [drivers] Error 2



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/2] Fix per client busyness locking
  2022-08-31 19:33 ` [Intel-gfx] [RFC PATCH 2/2] Fix per client busyness locking Ashutosh Dixit
@ 2022-08-31 22:45   ` Umesh Nerlige Ramappa
  2022-09-01 23:55     ` Dixit, Ashutosh
  0 siblings, 1 reply; 10+ messages in thread
From: Umesh Nerlige Ramappa @ 2022-08-31 22:45 UTC (permalink / raw)
  To: Ashutosh Dixit; +Cc: intel-gfx

On Wed, Aug 31, 2022 at 12:33:55PM -0700, Ashutosh Dixit wrote:
>1. Do all ce->stats updates and reads under guc->timestamp.lock

Other than the question about READ_ONCE/WRITE_ONCE, I am not sure I 
understand what's broken in the locking logic here. Can you please add 
some description? thanks

>2. Pin context image before reading
>3. Merge __guc_context_update_clks and guc_context_update_stats into a
>   single function
>4. Call lrc_update_runtime() unconditionally in guc_context_update_stats
>5. Seems no need to update ce->stats.active with this approach
>
>Some of the above steps may not be correct or complete but the idea is to
>discuss/improve the original patch.
>
>Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
>---
> drivers/gpu/drm/i915/gt/intel_context.c       |  2 +-
> drivers/gpu/drm/i915/gt/intel_context_types.h |  2 +-
> .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 41 ++++++++++---------
> 3 files changed, 24 insertions(+), 21 deletions(-)
>
>diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
>index e2d70a9fdac0..c004f676de27 100644
>--- a/drivers/gpu/drm/i915/gt/intel_context.c
>+++ b/drivers/gpu/drm/i915/gt/intel_context.c
>@@ -581,7 +581,7 @@ u64 intel_context_get_total_runtime_ns(struct intel_context *ce)
> 	u64 total, active;
>
> 	if (ce->ops->update_stats)
>-		ce->ops->update_stats(ce);
>+		return ce->ops->update_stats(ce);

This is an improvement that we can add and eventually, we can make this 
a GuC specific vfunc. Doing so may also remove the 
COPS_RUNTIME_ACTIVE_TOTAL option that I added to GuC specific context 
ops.

>
> 	total = ce->stats.runtime.total;
> 	if (ce->ops->flags & COPS_RUNTIME_CYCLES)
>diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
>index f7ff4c7d81c7..699435bfff99 100644
>--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
>+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
>@@ -59,7 +59,7 @@ struct intel_context_ops {
>
> 	void (*sched_disable)(struct intel_context *ce);
>
>-	void (*update_stats)(struct intel_context *ce);
>+	u64 (*update_stats)(struct intel_context *ce);
>
> 	void (*reset)(struct intel_context *ce);
> 	void (*destroy)(struct kref *kref);
>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>index bee8cf10f749..40d0edaf2e0a 100644
>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>@@ -1376,7 +1376,6 @@ static void __update_guc_busyness_stats(struct intel_guc *guc)
> 	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> }
>
>-static void __guc_context_update_clks(struct intel_context *ce);
> static void guc_timestamp_ping(struct work_struct *wrk)
> {
> 	struct intel_guc *guc = container_of(wrk, typeof(*guc),
>@@ -1401,7 +1400,8 @@ static void guc_timestamp_ping(struct work_struct *wrk)
>
> 	/* adjust context stats for overflow */
> 	xa_for_each(&guc->context_lookup, index, ce)
>-		__guc_context_update_clks(ce);
>+		if (ce->ops->update_stats)
>+			ce->ops->update_stats(ce);

context pinning logic needs to be separated for this (ping) path vs. the 
query path - intel_context_get_total_runtime_ns().
>
> 	intel_gt_reset_unlock(gt, srcu);
>
>@@ -1476,17 +1476,21 @@ void intel_guc_busyness_unpark(struct intel_gt *gt)
> 			 guc->timestamp.ping_delay);
> }
>
>-static void __guc_context_update_clks(struct intel_context *ce)
>+static u64 guc_context_update_stats(struct intel_context *ce)
> {
> 	struct intel_guc *guc = ce_to_guc(ce);
> 	struct intel_gt *gt = ce->engine->gt;
> 	u32 *pphwsp, last_switch, engine_id;
>-	u64 start_gt_clk, active;
> 	unsigned long flags;
>+	u64 total, active = 0;
> 	ktime_t unused;
>
>+	intel_context_pin(ce);

intel_context_pin can sleep and we are not allowed to sleep in this path 
- intel_context_get_total_runtime_ns(), however we can sleep in the ping 
   worker path, so ideally we want to separate it out for the 2 paths.

> 	spin_lock_irqsave(&guc->timestamp.lock, flags);
>
>+	lrc_update_runtime(ce);
>+	total = ce->stats.runtime.total;

For long running contexts and frequenct queries, calling this ^ when a 
context is active does not add value. I would just call it in the else 
like before.

>+
> 	/*
> 	 * GPU updates ce->lrc_reg_state[CTX_TIMESTAMP] when context is switched
> 	 * out, however GuC updates PPHWSP offsets below. Hence KMD (CPU)
>@@ -1502,27 +1506,26 @@ static void __guc_context_update_clks(struct intel_context *ce)
> 	guc_update_pm_timestamp(guc, &unused);
>
> 	if (engine_id != 0xffffffff && last_switch) {
>-		start_gt_clk = READ_ONCE(ce->stats.runtime.start_gt_clk);
>-		__extend_last_switch(guc, &start_gt_clk, last_switch);
>-		active = intel_gt_clock_interval_to_ns(gt, guc->timestamp.gt_stamp - start_gt_clk);
>-		WRITE_ONCE(ce->stats.runtime.start_gt_clk, start_gt_clk);
>-		WRITE_ONCE(ce->stats.active, active);
>-	} else {
>-		lrc_update_runtime(ce);
>+		__extend_last_switch(guc, &ce->stats.runtime.start_gt_clk, last_switch);
>+		active = intel_gt_clock_interval_to_ns(gt,
>+			      guc->timestamp.gt_stamp - ce->stats.runtime.start_gt_clk);

Makes sense. Earlier it was a rmw, now it's just a write to 
ce->stats.runtime.start_gt_clk.

> 	}
>
> 	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>+	intel_context_unpin(ce);
>+
>+	return total + active;

return ce->stats.runtime.total + active;

and then drop the local copy of total by bringing back the else{}.

> }
>
>-static void guc_context_update_stats(struct intel_context *ce)
>+void lrc_update_runtime_locked(struct intel_context *ce)
> {
>-	if (!intel_context_pin_if_active(ce)) {
>-		WRITE_ONCE(ce->stats.runtime.start_gt_clk, 0);
>-		WRITE_ONCE(ce->stats.active, 0);
>-		return;
>-	}
>+	struct intel_guc *guc = ce_to_guc(ce);
>+	unsigned long flags;
>
>-	__guc_context_update_clks(ce);
>+	intel_context_pin(ce);
>+	spin_lock_irqsave(&guc->timestamp.lock, flags);
>+	lrc_update_runtime(ce);
>+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> 	intel_context_unpin(ce);

 From where lrc_update_runtime_locked is being called, I would assume 
that the context is already pinned (until unpin_guc_id is called).

Thanks,
Umesh

> }
>
>@@ -2780,7 +2783,7 @@ static void guc_context_unpin(struct intel_context *ce)
> {
> 	struct intel_guc *guc = ce_to_guc(ce);
>
>-	lrc_update_runtime(ce);
>+	lrc_update_runtime_locked(ce);
> 	unpin_guc_id(guc, ce);
> 	lrc_unpin(ce);
>
>-- 
>2.34.1
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/2] Fix per client busyness locking
  2022-08-31 22:45   ` Umesh Nerlige Ramappa
@ 2022-09-01 23:55     ` Dixit, Ashutosh
  2022-09-06 18:29       ` Umesh Nerlige Ramappa
  0 siblings, 1 reply; 10+ messages in thread
From: Dixit, Ashutosh @ 2022-09-01 23:55 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: intel-gfx

On Wed, 31 Aug 2022 15:45:49 -0700, Umesh Nerlige Ramappa wrote:
>

Hi Umesh,

I have updated my RFC patch based on your feedback so we can discuss again.

> On Wed, Aug 31, 2022 at 12:33:55PM -0700, Ashutosh Dixit wrote:
> > 1. Do all ce->stats updates and reads under guc->timestamp.lock
>
> Other than the question about READ_ONCE/WRITE_ONCE, I am not sure I
> understand what's broken in the locking logic here. Can you please add some
> description? thanks

Basically ce->stats.runtime.total and ce->stats.active are tied together so
should be accessed/updated atomically. Also just the update of
ce->stats.runtime.total (via lrc_update_runtime()) can have multiple
concurrent writers so even that has to be protected (since that update is
not atomic).

These was the reason for:
* Introducing lrc_update_runtime_locked
* Returning early from intel_context_get_total_runtime_ns otherwise we'll
  need to introduce the same locking there
* Enforcing locking in guc_context_update_stats (which was there in your
  original patch too)

(I think execlists code didn't have this issue because there
lrc_update_runtime is called from a tasklet (which iirc is like a single
thread/cpu). I am not sure how 'active' is updated in execlist mode so
there may or may not be a problem in intel_context_get_total_runtime_ns).

I have another question: what about that GPU vs. GuC race mentioned in the
comment? What is the sequence of events there between GPU updating the
timestamp in the context image vs. GuC setting engine_id to -1 (at least
what is the sequence in which GPU and GuC supposed to do these updates
though it may not matter to i915 due to scheduling delays)?

>
> > 2. Pin context image before reading
> > 3. Merge __guc_context_update_clks and guc_context_update_stats into a
> >   single function
> > 4. Call lrc_update_runtime() unconditionally in guc_context_update_stats
> > 5. Seems no need to update ce->stats.active with this approach
> >
> > Some of the above steps may not be correct or complete but the idea is to
> > discuss/improve the original patch.
> >
> > Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> > Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
> > ---
> > drivers/gpu/drm/i915/gt/intel_context.c       |  2 +-
> > drivers/gpu/drm/i915/gt/intel_context_types.h |  2 +-
> > .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 41 ++++++++++---------
> > 3 files changed, 24 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > index e2d70a9fdac0..c004f676de27 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > @@ -581,7 +581,7 @@ u64 intel_context_get_total_runtime_ns(struct intel_context *ce)
> >	u64 total, active;
> >
> >	if (ce->ops->update_stats)
> > -		ce->ops->update_stats(ce);
> > +		return ce->ops->update_stats(ce);
>
> This is an improvement that we can add and eventually, we can make this a
> GuC specific vfunc. Doing so may also remove the COPS_RUNTIME_ACTIVE_TOTAL
> option that I added to GuC specific context ops.

I went ahead and did this in v2 of the RFC patch.

> >
> >	total = ce->stats.runtime.total;
> >	if (ce->ops->flags & COPS_RUNTIME_CYCLES)
> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > index f7ff4c7d81c7..699435bfff99 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > @@ -59,7 +59,7 @@ struct intel_context_ops {
> >
> >	void (*sched_disable)(struct intel_context *ce);
> >
> > -	void (*update_stats)(struct intel_context *ce);
> > +	u64 (*update_stats)(struct intel_context *ce);
> >
> >	void (*reset)(struct intel_context *ce);
> >	void (*destroy)(struct kref *kref);
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index bee8cf10f749..40d0edaf2e0a 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1376,7 +1376,6 @@ static void __update_guc_busyness_stats(struct intel_guc *guc)
> >	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> > }
> >
> > -static void __guc_context_update_clks(struct intel_context *ce);
> > static void guc_timestamp_ping(struct work_struct *wrk)
> > {
> >	struct intel_guc *guc = container_of(wrk, typeof(*guc),
> > @@ -1401,7 +1400,8 @@ static void guc_timestamp_ping(struct work_struct *wrk)
> >
> >	/* adjust context stats for overflow */
> >	xa_for_each(&guc->context_lookup, index, ce)
> > -		__guc_context_update_clks(ce);
> > +		if (ce->ops->update_stats)
> > +			ce->ops->update_stats(ce);
>
> context pinning logic needs to be separated for this (ping) path vs. the
> query path - intel_context_get_total_runtime_ns().

Done in v2 of RFC patch.

> >
> >	intel_gt_reset_unlock(gt, srcu);
> >
> > @@ -1476,17 +1476,21 @@ void intel_guc_busyness_unpark(struct intel_gt *gt)
> >			 guc->timestamp.ping_delay);
> > }
> >
> > -static void __guc_context_update_clks(struct intel_context *ce)
> > +static u64 guc_context_update_stats(struct intel_context *ce)
> > {
> >	struct intel_guc *guc = ce_to_guc(ce);
> >	struct intel_gt *gt = ce->engine->gt;
> >	u32 *pphwsp, last_switch, engine_id;
> > -	u64 start_gt_clk, active;
> >	unsigned long flags;
> > +	u64 total, active = 0;
> >	ktime_t unused;
> >
> > +	intel_context_pin(ce);
>
> intel_context_pin can sleep and we are not allowed to sleep in this path -
> intel_context_get_total_runtime_ns(), however we can sleep in the ping
> worker path, so ideally we want to separate it out for the 2 paths.

Do we know which intel_context_get_total_runtime_ns() call is not allowed
to sleep? E.g. the code path from i915_drm_client_fdinfo() is allowed to
sleep. As mentioned I have done this is v2 of RFC patch which I think is
sufficient, but a more complicated scheme (which I think we can avoid for
now) would be to pin in code paths when sleeping is allowed.

> >	spin_lock_irqsave(&guc->timestamp.lock, flags);
> >
> > +	lrc_update_runtime(ce);
> > +	total = ce->stats.runtime.total;
>
> That would get called every second (default intel_gpu_top query internal)
> for a long running workload. multiply that with all active contexts.
>
> For long running contexts and frequenct queries, calling this ^ when a
> context is active does not add value. I would just call it in the else like
> before.

I have done this in v2 but are we certain that when a context is active
it's ce->stats.runtime.total has been updated? Is it updated on each
context save or only when scheduling is disabled (which would not happen if
the context is active). That was the reason for my keeping it out of the
else.

> > +
> >	/*
> >	 * GPU updates ce->lrc_reg_state[CTX_TIMESTAMP] when context is switched
> >	 * out, however GuC updates PPHWSP offsets below. Hence KMD (CPU)
> > @@ -1502,27 +1506,26 @@ static void __guc_context_update_clks(struct intel_context *ce)
> >	guc_update_pm_timestamp(guc, &unused);
> >
> >	if (engine_id != 0xffffffff && last_switch) {
> > -		start_gt_clk = READ_ONCE(ce->stats.runtime.start_gt_clk);
> > -		__extend_last_switch(guc, &start_gt_clk, last_switch);
> > -		active = intel_gt_clock_interval_to_ns(gt, guc->timestamp.gt_stamp - start_gt_clk);
> > -		WRITE_ONCE(ce->stats.runtime.start_gt_clk, start_gt_clk);
> > -		WRITE_ONCE(ce->stats.active, active);
> > -	} else {
> > -		lrc_update_runtime(ce);
> > +		__extend_last_switch(guc, &ce->stats.runtime.start_gt_clk, last_switch);
> > +		active = intel_gt_clock_interval_to_ns(gt,
> > +			      guc->timestamp.gt_stamp - ce->stats.runtime.start_gt_clk);
>
> Makes sense. Earlier it was a rmw, now it's just a write to
> ce->stats.runtime.start_gt_clk.

It is still a rmw, just that it is not as explicitly seen in the code as
was the case earlier.

> >	}
> >
> >	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> > +	intel_context_unpin(ce);
> > +
> > +	return total + active;
>
> return ce->stats.runtime.total + active;
>
> and then drop the local copy of total by bringing back the else{}.

Some changes here in v2, ce->stats.active had to be revived since we need
to return previously computed value when we cannot pin.

> > }
> >
> > -static void guc_context_update_stats(struct intel_context *ce)
> > +void lrc_update_runtime_locked(struct intel_context *ce)
> > {
> > -	if (!intel_context_pin_if_active(ce)) {
> > -		WRITE_ONCE(ce->stats.runtime.start_gt_clk, 0);
> > -		WRITE_ONCE(ce->stats.active, 0);
> > -		return;
> > -	}
> > +	struct intel_guc *guc = ce_to_guc(ce);
> > +	unsigned long flags;
> >
> > -	__guc_context_update_clks(ce);
> > +	intel_context_pin(ce);
> > +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> > +	lrc_update_runtime(ce);
> > +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> >	intel_context_unpin(ce);
>
> From where lrc_update_runtime_locked is being called, I would assume that
> the context is already pinned (until unpin_guc_id is called).

Fixed in v2.

From the other email:

> From your rfc patch, I like
> - the idea of not touching ce->stats.active
> - having the update_stats return u64
> - not doing a rmw for start_gt_clk
>
> With those changes, we are only accessing total in ce->stats, so we don't
> really need a lrc_update_runtime_locked.

Strictly speaking, (as explained above) because the computation of
ce->stats.runtime.total in lrc_update_runtime is not atomic and there are
potentially multiple concurrent callers of lrc_update_runtime,
lrc_update_runtime_locked is needed and I have retained it in v2.

When you say we don't need it, you probably mean that because we are
dealing with something like busyness, we can tolerate occasional errors in
busyness if we don't have locking (and we can't verify busyness values
anyway though I think they are expected to be monotonically
increasing). But my question is why not keep it simple and retain the
locking?

Thanks.
--
Ashutosh

> > }
> >
> > @@ -2780,7 +2783,7 @@ static void guc_context_unpin(struct intel_context *ce)
> > {
> >	struct intel_guc *guc = ce_to_guc(ce);
> >
> > -	lrc_update_runtime(ce);
> > +	lrc_update_runtime_locked(ce);
> >	unpin_guc_id(guc, ce);
> >	lrc_unpin(ce);
> >
> > --
> > 2.34.1
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/2] Fix per client busyness locking
  2022-09-01 23:55     ` Dixit, Ashutosh
@ 2022-09-06 18:29       ` Umesh Nerlige Ramappa
  2022-09-07  7:26         ` Dixit, Ashutosh
  2022-09-07  7:28         ` Tvrtko Ursulin
  0 siblings, 2 replies; 10+ messages in thread
From: Umesh Nerlige Ramappa @ 2022-09-06 18:29 UTC (permalink / raw)
  To: Dixit, Ashutosh; +Cc: intel-gfx

On Thu, Sep 01, 2022 at 04:55:22PM -0700, Dixit, Ashutosh wrote:
>On Wed, 31 Aug 2022 15:45:49 -0700, Umesh Nerlige Ramappa wrote:
>>
>
>Hi Umesh,
>
>I have updated my RFC patch based on your feedback so we can discuss again.
>
>> On Wed, Aug 31, 2022 at 12:33:55PM -0700, Ashutosh Dixit wrote:
>> > 1. Do all ce->stats updates and reads under guc->timestamp.lock
>>
>> Other than the question about READ_ONCE/WRITE_ONCE, I am not sure I
>> understand what's broken in the locking logic here. Can you please add some
>> description? thanks
>
>Basically ce->stats.runtime.total and ce->stats.active are tied together so
>should be accessed/updated atomically. Also just the update of
>ce->stats.runtime.total (via lrc_update_runtime()) can have multiple
>concurrent writers so even that has to be protected (since that update is
>not atomic).
>
>These was the reason for:
>* Introducing lrc_update_runtime_locked
>* Returning early from intel_context_get_total_runtime_ns otherwise we'll
>  need to introduce the same locking there
>* Enforcing locking in guc_context_update_stats (which was there in your
>  original patch too)
>
>(I think execlists code didn't have this issue because there
>lrc_update_runtime is called from a tasklet (which iirc is like a single
>thread/cpu). I am not sure how 'active' is updated in execlist mode so
>there may or may not be a problem in intel_context_get_total_runtime_ns).
>
>I have another question: what about that GPU vs. GuC race mentioned in the
>comment? What is the sequence of events there between GPU updating the
>timestamp in the context image vs. GuC setting engine_id to -1 (at least
>what is the sequence in which GPU and GuC supposed to do these updates
>though it may not matter to i915 due to scheduling delays)?

With GPU, KMD and GuC running concurrently, after a context is done, 
this is the sequence.

GPU) updates context image (total_runtime)
KMD) reads total_runtime.
KMD) sees engine_id is valid. KMD uses start_time from pphwsp to 
calculate active_time.
KMD) returns total_runtime + active_time (double accounted busyness!!)
GuC) gets ctxt switchout event and sets engine_id and start_time to ~0

This is not rare when running the IGT tests, so the total runtime is 
updated in the query only if engine_id is idle. continued below ...

>
>>
>> > 2. Pin context image before reading
>> > 3. Merge __guc_context_update_clks and guc_context_update_stats into a
>> >   single function
>> > 4. Call lrc_update_runtime() unconditionally in guc_context_update_stats
>> > 5. Seems no need to update ce->stats.active with this approach
>> >
>> > Some of the above steps may not be correct or complete but the idea is to
>> > discuss/improve the original patch.
>> >
>> > Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> > Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
>> > ---
>> > drivers/gpu/drm/i915/gt/intel_context.c       |  2 +-
>> > drivers/gpu/drm/i915/gt/intel_context_types.h |  2 +-
>> > .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 41 ++++++++++---------
>> > 3 files changed, 24 insertions(+), 21 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
>> > index e2d70a9fdac0..c004f676de27 100644
>> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
>> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
>> > @@ -581,7 +581,7 @@ u64 intel_context_get_total_runtime_ns(struct intel_context *ce)
>> >	u64 total, active;
>> >
>> >	if (ce->ops->update_stats)
>> > -		ce->ops->update_stats(ce);
>> > +		return ce->ops->update_stats(ce);
>>
>> This is an improvement that we can add and eventually, we can make this a
>> GuC specific vfunc. Doing so may also remove the COPS_RUNTIME_ACTIVE_TOTAL
>> option that I added to GuC specific context ops.
>
>I went ahead and did this in v2 of the RFC patch.
>
>> >
>> >	total = ce->stats.runtime.total;
>> >	if (ce->ops->flags & COPS_RUNTIME_CYCLES)
>> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
>> > index f7ff4c7d81c7..699435bfff99 100644
>> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
>> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
>> > @@ -59,7 +59,7 @@ struct intel_context_ops {
>> >
>> >	void (*sched_disable)(struct intel_context *ce);
>> >
>> > -	void (*update_stats)(struct intel_context *ce);
>> > +	u64 (*update_stats)(struct intel_context *ce);
>> >
>> >	void (*reset)(struct intel_context *ce);
>> >	void (*destroy)(struct kref *kref);
>> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> > index bee8cf10f749..40d0edaf2e0a 100644
>> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> > @@ -1376,7 +1376,6 @@ static void __update_guc_busyness_stats(struct intel_guc *guc)
>> >	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> > }
>> >
>> > -static void __guc_context_update_clks(struct intel_context *ce);
>> > static void guc_timestamp_ping(struct work_struct *wrk)
>> > {
>> >	struct intel_guc *guc = container_of(wrk, typeof(*guc),
>> > @@ -1401,7 +1400,8 @@ static void guc_timestamp_ping(struct work_struct *wrk)
>> >
>> >	/* adjust context stats for overflow */
>> >	xa_for_each(&guc->context_lookup, index, ce)
>> > -		__guc_context_update_clks(ce);
>> > +		if (ce->ops->update_stats)
>> > +			ce->ops->update_stats(ce);
>>
>> context pinning logic needs to be separated for this (ping) path vs. the
>> query path - intel_context_get_total_runtime_ns().
>
>Done in v2 of RFC patch.
>
>> >
>> >	intel_gt_reset_unlock(gt, srcu);
>> >
>> > @@ -1476,17 +1476,21 @@ void intel_guc_busyness_unpark(struct intel_gt *gt)
>> >			 guc->timestamp.ping_delay);
>> > }
>> >
>> > -static void __guc_context_update_clks(struct intel_context *ce)
>> > +static u64 guc_context_update_stats(struct intel_context *ce)
>> > {
>> >	struct intel_guc *guc = ce_to_guc(ce);
>> >	struct intel_gt *gt = ce->engine->gt;
>> >	u32 *pphwsp, last_switch, engine_id;
>> > -	u64 start_gt_clk, active;
>> >	unsigned long flags;
>> > +	u64 total, active = 0;
>> >	ktime_t unused;
>> >
>> > +	intel_context_pin(ce);
>>
>> intel_context_pin can sleep and we are not allowed to sleep in this path -
>> intel_context_get_total_runtime_ns(), however we can sleep in the ping
>> worker path, so ideally we want to separate it out for the 2 paths.
>
>Do we know which intel_context_get_total_runtime_ns() call is not allowed
>to sleep? E.g. the code path from i915_drm_client_fdinfo() is allowed to
>sleep. As mentioned I have done this is v2 of RFC patch which I think is
>sufficient, but a more complicated scheme (which I think we can avoid for
>now) would be to pin in code paths when sleeping is allowed.
>

Hmm, maybe intel_context_get_total_runtime_ns can sleep, not sure why I 
was assuming that this falls in the perf_pmu path. This is now in the 
drm_fdinfo query path. + Tvrtko.

@Tvrtko, any idea if intel_context_get_total_runtime_ns path can sleep?

>> >	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> >
>> > +	lrc_update_runtime(ce);
>> > +	total = ce->stats.runtime.total;
>>
>> That would get called every second (default intel_gpu_top query internal)
>> for a long running workload. multiply that with all active contexts.
>>
>> For long running contexts and frequenct queries, calling this ^ when a
>> context is active does not add value. I would just call it in the else like
>> before.
>
>I have done this in v2 but are we certain that when a context is active
>it's ce->stats.runtime.total has been updated? Is it updated on each
>context save or only when scheduling is disabled (which would not happen if
>the context is active). That was the reason for my keeping it out of the
>else.

continued here ...

You are right, it's not updated until scheduling is disabled or a query 
falls in the else block. We can run into a case where the 
stats.runtime.total has not yet been updated because the context is 
switching in/out too frequently and the query always lands in the active 
block. In this case busyness values will accumulate the time between 
switch-out to switch-in and eventually drop when ce->stats.runtime.total 
is updated. This is an issue we need to resolve.

I think we should track the previous value of total_runtime. On a query, 
if total_runtime has changed, we should just return that (irrespective 
of whether engine is active or not. Active time should be part of the 
busyness calculations only if total_runtime has not changed (compared to 
previous sample).

>
>> > +
>> >	/*
>> >	 * GPU updates ce->lrc_reg_state[CTX_TIMESTAMP] when context is switched
>> >	 * out, however GuC updates PPHWSP offsets below. Hence KMD (CPU)
>> > @@ -1502,27 +1506,26 @@ static void __guc_context_update_clks(struct intel_context *ce)
>> >	guc_update_pm_timestamp(guc, &unused);
>> >
>> >	if (engine_id != 0xffffffff && last_switch) {
>> > -		start_gt_clk = READ_ONCE(ce->stats.runtime.start_gt_clk);
>> > -		__extend_last_switch(guc, &start_gt_clk, last_switch);
>> > -		active = intel_gt_clock_interval_to_ns(gt, guc->timestamp.gt_stamp - start_gt_clk);
>> > -		WRITE_ONCE(ce->stats.runtime.start_gt_clk, start_gt_clk);
>> > -		WRITE_ONCE(ce->stats.active, active);
>> > -	} else {
>> > -		lrc_update_runtime(ce);
>> > +		__extend_last_switch(guc, &ce->stats.runtime.start_gt_clk, last_switch);
>> > +		active = intel_gt_clock_interval_to_ns(gt,
>> > +			      guc->timestamp.gt_stamp - ce->stats.runtime.start_gt_clk);
>>
>> Makes sense. Earlier it was a rmw, now it's just a write to
>> ce->stats.runtime.start_gt_clk.
>
>It is still a rmw, just that it is not as explicitly seen in the code as
>was the case earlier.
>
>> >	}
>> >
>> >	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> > +	intel_context_unpin(ce);
>> > +
>> > +	return total + active;
>>
>> return ce->stats.runtime.total + active;
>>
>> and then drop the local copy of total by bringing back the else{}.
>
>Some changes here in v2, ce->stats.active had to be revived since we need
>to return previously computed value when we cannot pin.

Let's wait for Tvrtko's inputs. If this can sleep, I'd just use the 
original changes you had.

>
>> > }
>> >
>> > -static void guc_context_update_stats(struct intel_context *ce)
>> > +void lrc_update_runtime_locked(struct intel_context *ce)
>> > {
>> > -	if (!intel_context_pin_if_active(ce)) {
>> > -		WRITE_ONCE(ce->stats.runtime.start_gt_clk, 0);
>> > -		WRITE_ONCE(ce->stats.active, 0);
>> > -		return;
>> > -	}
>> > +	struct intel_guc *guc = ce_to_guc(ce);
>> > +	unsigned long flags;
>> >
>> > -	__guc_context_update_clks(ce);
>> > +	intel_context_pin(ce);
>> > +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> > +	lrc_update_runtime(ce);
>> > +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> >	intel_context_unpin(ce);
>>
>> From where lrc_update_runtime_locked is being called, I would assume that
>> the context is already pinned (until unpin_guc_id is called).
>
>Fixed in v2.
>
>From the other email:
>
>> From your rfc patch, I like
>> - the idea of not touching ce->stats.active
>> - having the update_stats return u64
>> - not doing a rmw for start_gt_clk
>>
>> With those changes, we are only accessing total in ce->stats, so we don't
>> really need a lrc_update_runtime_locked.
>
>Strictly speaking, (as explained above) because the computation of
>ce->stats.runtime.total in lrc_update_runtime is not atomic and there are
>potentially multiple concurrent callers of lrc_update_runtime,
>lrc_update_runtime_locked is needed and I have retained it in v2.
>
>When you say we don't need it, you probably mean that because we are
>dealing with something like busyness, we can tolerate occasional errors in
>busyness if we don't have locking (and we can't verify busyness values
>anyway though I think they are expected to be monotonically
>increasing). But my question is why not keep it simple and retain the
>locking?

The locking you suggested earlier is fine. I think I assumed the 
ce->stats.runtime.total is being updated atomically, but I see your 
point.

Thanks,
Umesh

>
>Thanks.
>--
>Ashutosh
>
>> > }
>> >
>> > @@ -2780,7 +2783,7 @@ static void guc_context_unpin(struct intel_context *ce)
>> > {
>> >	struct intel_guc *guc = ce_to_guc(ce);
>> >
>> > -	lrc_update_runtime(ce);
>> > +	lrc_update_runtime_locked(ce);
>> >	unpin_guc_id(guc, ce);
>> >	lrc_unpin(ce);
>> >
>> > --
>> > 2.34.1
>> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/2] Fix per client busyness locking
  2022-09-06 18:29       ` Umesh Nerlige Ramappa
@ 2022-09-07  7:26         ` Dixit, Ashutosh
  2022-09-07  7:28         ` Tvrtko Ursulin
  1 sibling, 0 replies; 10+ messages in thread
From: Dixit, Ashutosh @ 2022-09-07  7:26 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: intel-gfx

On Tue, 06 Sep 2022 11:29:15 -0700, Umesh Nerlige Ramappa wrote:
>
> On Thu, Sep 01, 2022 at 04:55:22PM -0700, Dixit, Ashutosh wrote:
> > On Wed, 31 Aug 2022 15:45:49 -0700, Umesh Nerlige Ramappa wrote:
> >>
> >
> > Hi Umesh,
> >
> > I have updated my RFC patch based on your feedback so we can discuss again.
> >
> >> On Wed, Aug 31, 2022 at 12:33:55PM -0700, Ashutosh Dixit wrote:
> >> > 1. Do all ce->stats updates and reads under guc->timestamp.lock
> >>
> >> Other than the question about READ_ONCE/WRITE_ONCE, I am not sure I
> >> understand what's broken in the locking logic here. Can you please add some
> >> description? thanks
> >
> > Basically ce->stats.runtime.total and ce->stats.active are tied together so
> > should be accessed/updated atomically. Also just the update of
> > ce->stats.runtime.total (via lrc_update_runtime()) can have multiple
> > concurrent writers so even that has to be protected (since that update is
> > not atomic).
> >
> > These was the reason for:
> > * Introducing lrc_update_runtime_locked
> > * Returning early from intel_context_get_total_runtime_ns otherwise we'll
> >  need to introduce the same locking there
> > * Enforcing locking in guc_context_update_stats (which was there in your
> >  original patch too)
> >
> > (I think execlists code didn't have this issue because there
> > lrc_update_runtime is called from a tasklet (which iirc is like a single
> > thread/cpu). I am not sure how 'active' is updated in execlist mode so
> > there may or may not be a problem in intel_context_get_total_runtime_ns).
> >
> > I have another question: what about that GPU vs. GuC race mentioned in the
> > comment? What is the sequence of events there between GPU updating the
> > timestamp in the context image vs. GuC setting engine_id to -1 (at least
> > what is the sequence in which GPU and GuC supposed to do these updates
> > though it may not matter to i915 due to scheduling delays)?
>
> With GPU, KMD and GuC running concurrently, after a context is done, this
> is the sequence.
>
> GPU) updates context image (total_runtime)
> KMD) reads total_runtime.
> KMD) sees engine_id is valid. KMD uses start_time from pphwsp to calculate
> active_time.
> KMD) returns total_runtime + active_time (double accounted busyness!!)
> GuC) gets ctxt switchout event and sets engine_id and start_time to ~0

Thanks for the explanation, so yes I understand the double accounting now.

>
> This is not rare when running the IGT tests, so the total runtime is
> updated in the query only if engine_id is idle. continued below ...
>
> >
> >>
> >> > 2. Pin context image before reading
> >> > 3. Merge __guc_context_update_clks and guc_context_update_stats into a
> >> >   single function
> >> > 4. Call lrc_update_runtime() unconditionally in guc_context_update_stats
> >> > 5. Seems no need to update ce->stats.active with this approach
> >> >
> >> > Some of the above steps may not be correct or complete but the idea is to
> >> > discuss/improve the original patch.
> >> >
> >> > Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> >> > Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
> >> > ---
> >> > drivers/gpu/drm/i915/gt/intel_context.c       |  2 +-
> >> > drivers/gpu/drm/i915/gt/intel_context_types.h |  2 +-
> >> > .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 41 ++++++++++---------
> >> > 3 files changed, 24 insertions(+), 21 deletions(-)
> >> >
> >> > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> >> > index e2d70a9fdac0..c004f676de27 100644
> >> > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> >> > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> >> > @@ -581,7 +581,7 @@ u64 intel_context_get_total_runtime_ns(struct intel_context *ce)
> >> >	u64 total, active;
> >> >
> >> >	if (ce->ops->update_stats)
> >> > -		ce->ops->update_stats(ce);
> >> > +		return ce->ops->update_stats(ce);
> >>
> >> This is an improvement that we can add and eventually, we can make this a
> >> GuC specific vfunc. Doing so may also remove the COPS_RUNTIME_ACTIVE_TOTAL
> >> option that I added to GuC specific context ops.
> >
> > I went ahead and did this in v2 of the RFC patch.
> >
> >> >
> >> >	total = ce->stats.runtime.total;
> >> >	if (ce->ops->flags & COPS_RUNTIME_CYCLES)
> >> > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> >> > index f7ff4c7d81c7..699435bfff99 100644
> >> > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> >> > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> >> > @@ -59,7 +59,7 @@ struct intel_context_ops {
> >> >
> >> >	void (*sched_disable)(struct intel_context *ce);
> >> >
> >> > -	void (*update_stats)(struct intel_context *ce);
> >> > +	u64 (*update_stats)(struct intel_context *ce);
> >> >
> >> >	void (*reset)(struct intel_context *ce);
> >> >	void (*destroy)(struct kref *kref);
> >> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> >> > index bee8cf10f749..40d0edaf2e0a 100644
> >> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> >> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> >> > @@ -1376,7 +1376,6 @@ static void __update_guc_busyness_stats(struct intel_guc *guc)
> >> >	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> >> > }
> >> >
> >> > -static void __guc_context_update_clks(struct intel_context *ce);
> >> > static void guc_timestamp_ping(struct work_struct *wrk)
> >> > {
> >> >	struct intel_guc *guc = container_of(wrk, typeof(*guc),
> >> > @@ -1401,7 +1400,8 @@ static void guc_timestamp_ping(struct work_struct *wrk)
> >> >
> >> >	/* adjust context stats for overflow */
> >> >	xa_for_each(&guc->context_lookup, index, ce)
> >> > -		__guc_context_update_clks(ce);
> >> > +		if (ce->ops->update_stats)
> >> > +			ce->ops->update_stats(ce);
> >>
> >> context pinning logic needs to be separated for this (ping) path vs. the
> >> query path - intel_context_get_total_runtime_ns().
> >
> > Done in v2 of RFC patch.
> >
> >> >
> >> >	intel_gt_reset_unlock(gt, srcu);
> >> >
> >> > @@ -1476,17 +1476,21 @@ void intel_guc_busyness_unpark(struct intel_gt *gt)
> >> >			 guc->timestamp.ping_delay);
> >> > }
> >> >
> >> > -static void __guc_context_update_clks(struct intel_context *ce)
> >> > +static u64 guc_context_update_stats(struct intel_context *ce)
> >> > {
> >> >	struct intel_guc *guc = ce_to_guc(ce);
> >> >	struct intel_gt *gt = ce->engine->gt;
> >> >	u32 *pphwsp, last_switch, engine_id;
> >> > -	u64 start_gt_clk, active;
> >> >	unsigned long flags;
> >> > +	u64 total, active = 0;
> >> >	ktime_t unused;
> >> >
> >> > +	intel_context_pin(ce);
> >>
> >> intel_context_pin can sleep and we are not allowed to sleep in this path -
> >> intel_context_get_total_runtime_ns(), however we can sleep in the ping
> >> worker path, so ideally we want to separate it out for the 2 paths.
> >
> > Do we know which intel_context_get_total_runtime_ns() call is not allowed
> > to sleep? E.g. the code path from i915_drm_client_fdinfo() is allowed to
> > sleep. As mentioned I have done this is v2 of RFC patch which I think is
> > sufficient, but a more complicated scheme (which I think we can avoid for
> > now) would be to pin in code paths when sleeping is allowed.
> >
>
> Hmm, maybe intel_context_get_total_runtime_ns can sleep, not sure why I was
> assuming that this falls in the perf_pmu path. This is now in the
> drm_fdinfo query path. + Tvrtko.
>
> @Tvrtko, any idea if intel_context_get_total_runtime_ns path can sleep?

intel_context_get_total_runtime_ns is called from multiple places so it is
possible some of the code paths can sleep and others cannot.

> >> >	spin_lock_irqsave(&guc->timestamp.lock, flags);
> >> >
> >> > +	lrc_update_runtime(ce);
> >> > +	total = ce->stats.runtime.total;
> >>
> >> That would get called every second (default intel_gpu_top query internal)
> >> for a long running workload. multiply that with all active contexts.
> >>
> >> For long running contexts and frequenct queries, calling this ^ when a
> >> context is active does not add value. I would just call it in the else like
> >> before.
> >
> > I have done this in v2 but are we certain that when a context is active
> > it's ce->stats.runtime.total has been updated? Is it updated on each
> > context save or only when scheduling is disabled (which would not happen if
> > the context is active). That was the reason for my keeping it out of the
> > else.
>
> continued here ...
>
> You are right, it's not updated until scheduling is disabled or a query
> falls in the else block. We can run into a case where the
> stats.runtime.total has not yet been updated because the context is
> switching in/out too frequently and the query always lands in the active
> block. In this case busyness values will accumulate the time between
> switch-out to switch-in and eventually drop when ce->stats.runtime.total is
> updated. This is an issue we need to resolve.

I agree that with this state of things there is nothing we can do in the
KMD which will mitigate this double accounting.

> I think we should track the previous value of total_runtime. On a query, if
> total_runtime has changed, we should just return that (irrespective of
> whether engine is active or not. Active time should be part of the busyness
> calculations only if total_runtime has not changed (compared to previous
> sample).

Yes this will keep busyness values monotonic though it is still error
prone. Also maybe we can drop the monotonic requirement to not add any
extra code in the KMD with the justification that this is a GuC FW bug (see
next paragraph).

I think GuC also needs to provide the 'total' time (in addition to the
'start' time). So when GuC sets engine id to ~0 it should also write the
'total' time (like the GPU does) (preferably as 64 bits, we seem to have
introduce so much complexity by not using the right data types). If updates
are non-atomic GuC can also add a flag saying 'updates are in progress'. I
think that is all that is need from the GuC and we can resolve this double
accounting.

Thanks.
--
Ashutosh


> >
> >> > +
> >> >	/*
> >> >	 * GPU updates ce->lrc_reg_state[CTX_TIMESTAMP] when context is switched
> >> >	 * out, however GuC updates PPHWSP offsets below. Hence KMD (CPU)
> >> > @@ -1502,27 +1506,26 @@ static void __guc_context_update_clks(struct intel_context *ce)
> >> >	guc_update_pm_timestamp(guc, &unused);
> >> >
> >> >	if (engine_id != 0xffffffff && last_switch) {
> >> > -		start_gt_clk = READ_ONCE(ce->stats.runtime.start_gt_clk);
> >> > -		__extend_last_switch(guc, &start_gt_clk, last_switch);
> >> > -		active = intel_gt_clock_interval_to_ns(gt, guc->timestamp.gt_stamp - start_gt_clk);
> >> > -		WRITE_ONCE(ce->stats.runtime.start_gt_clk, start_gt_clk);
> >> > -		WRITE_ONCE(ce->stats.active, active);
> >> > -	} else {
> >> > -		lrc_update_runtime(ce);
> >> > +		__extend_last_switch(guc, &ce->stats.runtime.start_gt_clk, last_switch);
> >> > +		active = intel_gt_clock_interval_to_ns(gt,
> >> > +			      guc->timestamp.gt_stamp - ce->stats.runtime.start_gt_clk);
> >>
> >> Makes sense. Earlier it was a rmw, now it's just a write to
> >> ce->stats.runtime.start_gt_clk.
> >
> > It is still a rmw, just that it is not as explicitly seen in the code as
> > was the case earlier.
> >
> >> >	}
> >> >
> >> >	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> >> > +	intel_context_unpin(ce);
> >> > +
> >> > +	return total + active;
> >>
> >> return ce->stats.runtime.total + active;
> >>
> >> and then drop the local copy of total by bringing back the else{}.
> >
> > Some changes here in v2, ce->stats.active had to be revived since we need
> > to return previously computed value when we cannot pin.
>
> Let's wait for Tvrtko's inputs. If this can sleep, I'd just use the
> original changes you had.
>
> >
> >> > }
> >> >
> >> > -static void guc_context_update_stats(struct intel_context *ce)
> >> > +void lrc_update_runtime_locked(struct intel_context *ce)
> >> > {
> >> > -	if (!intel_context_pin_if_active(ce)) {
> >> > -		WRITE_ONCE(ce->stats.runtime.start_gt_clk, 0);
> >> > -		WRITE_ONCE(ce->stats.active, 0);
> >> > -		return;
> >> > -	}
> >> > +	struct intel_guc *guc = ce_to_guc(ce);
> >> > +	unsigned long flags;
> >> >
> >> > -	__guc_context_update_clks(ce);
> >> > +	intel_context_pin(ce);
> >> > +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> >> > +	lrc_update_runtime(ce);
> >> > +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> >> >	intel_context_unpin(ce);
> >>
> >> From where lrc_update_runtime_locked is being called, I would assume that
> >> the context is already pinned (until unpin_guc_id is called).
> >
> > Fixed in v2.
> >
> > From the other email:
> >
> >> From your rfc patch, I like
> >> - the idea of not touching ce->stats.active
> >> - having the update_stats return u64
> >> - not doing a rmw for start_gt_clk
> >>
> >> With those changes, we are only accessing total in ce->stats, so we don't
> >> really need a lrc_update_runtime_locked.
> >
> > Strictly speaking, (as explained above) because the computation of
> > ce->stats.runtime.total in lrc_update_runtime is not atomic and there are
> > potentially multiple concurrent callers of lrc_update_runtime,
> > lrc_update_runtime_locked is needed and I have retained it in v2.
> >
> > When you say we don't need it, you probably mean that because we are
> > dealing with something like busyness, we can tolerate occasional errors in
> > busyness if we don't have locking (and we can't verify busyness values
> > anyway though I think they are expected to be monotonically
> > increasing). But my question is why not keep it simple and retain the
> > locking?
>
> The locking you suggested earlier is fine. I think I assumed the
> ce->stats.runtime.total is being updated atomically, but I see your point.
>
> Thanks,
> Umesh
>
> >
> > Thanks.
> > --
> > Ashutosh
> >
> >> > }
> >> >
> >> > @@ -2780,7 +2783,7 @@ static void guc_context_unpin(struct intel_context *ce)
> >> > {
> >> >	struct intel_guc *guc = ce_to_guc(ce);
> >> >
> >> > -	lrc_update_runtime(ce);
> >> > +	lrc_update_runtime_locked(ce);
> >> >	unpin_guc_id(guc, ce);
> >> >	lrc_unpin(ce);
> >> >
> >> > --
> >> > 2.34.1
> >> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/2] Fix per client busyness locking
  2022-09-06 18:29       ` Umesh Nerlige Ramappa
  2022-09-07  7:26         ` Dixit, Ashutosh
@ 2022-09-07  7:28         ` Tvrtko Ursulin
  2022-09-07 15:03           ` Dixit, Ashutosh
  1 sibling, 1 reply; 10+ messages in thread
From: Tvrtko Ursulin @ 2022-09-07  7:28 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa, Dixit, Ashutosh; +Cc: intel-gfx



On 06/09/2022 19:29, Umesh Nerlige Ramappa wrote:
> On Thu, Sep 01, 2022 at 04:55:22PM -0700, Dixit, Ashutosh wrote:
>> On Wed, 31 Aug 2022 15:45:49 -0700, Umesh Nerlige Ramappa wrote:

[snip]

>>> >
>>> >    intel_gt_reset_unlock(gt, srcu);
>>> >
>>> > @@ -1476,17 +1476,21 @@ void intel_guc_busyness_unpark(struct 
>>> intel_gt *gt)
>>> >             guc->timestamp.ping_delay);
>>> > }
>>> >
>>> > -static void __guc_context_update_clks(struct intel_context *ce)
>>> > +static u64 guc_context_update_stats(struct intel_context *ce)
>>> > {
>>> >    struct intel_guc *guc = ce_to_guc(ce);
>>> >    struct intel_gt *gt = ce->engine->gt;
>>> >    u32 *pphwsp, last_switch, engine_id;
>>> > -    u64 start_gt_clk, active;
>>> >    unsigned long flags;
>>> > +    u64 total, active = 0;
>>> >    ktime_t unused;
>>> >
>>> > +    intel_context_pin(ce);
>>>
>>> intel_context_pin can sleep and we are not allowed to sleep in this 
>>> path -
>>> intel_context_get_total_runtime_ns(), however we can sleep in the ping
>>> worker path, so ideally we want to separate it out for the 2 paths.
>>
>> Do we know which intel_context_get_total_runtime_ns() call is not allowed
>> to sleep? E.g. the code path from i915_drm_client_fdinfo() is allowed to
>> sleep. As mentioned I have done this is v2 of RFC patch which I think is
>> sufficient, but a more complicated scheme (which I think we can avoid for
>> now) would be to pin in code paths when sleeping is allowed.
>>
> 
> Hmm, maybe intel_context_get_total_runtime_ns can sleep, not sure why I 
> was assuming that this falls in the perf_pmu path. This is now in the 
> drm_fdinfo query path. + Tvrtko.
> 
> @Tvrtko, any idea if intel_context_get_total_runtime_ns path can sleep?

Not at the moment - it calls it from a lockless (rcu) section when 
walking all the contexts belonging to a client. Idea being performance 
queries should have minimum effect on a running system. I think it would 
be possible to change in theory but not sure how much work. There is a 
hard need to sleep in there or what?

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/2] Fix per client busyness locking
  2022-09-07  7:28         ` Tvrtko Ursulin
@ 2022-09-07 15:03           ` Dixit, Ashutosh
  2022-09-08  9:33             ` Tvrtko Ursulin
  0 siblings, 1 reply; 10+ messages in thread
From: Dixit, Ashutosh @ 2022-09-07 15:03 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Wed, 07 Sep 2022 00:28:48 -0700, Tvrtko Ursulin wrote:
>
> On 06/09/2022 19:29, Umesh Nerlige Ramappa wrote:
> > On Thu, Sep 01, 2022 at 04:55:22PM -0700, Dixit, Ashutosh wrote:
> >> On Wed, 31 Aug 2022 15:45:49 -0700, Umesh Nerlige Ramappa wrote:
>
> [snip]
>
> >>> >
> >>> >    intel_gt_reset_unlock(gt, srcu);
> >>> >
> >>> > @@ -1476,17 +1476,21 @@ void intel_guc_busyness_unpark(struct
> >>> intel_gt *gt)
> >>> >             guc->timestamp.ping_delay);
> >>> > }
> >>> >
> >>> > -static void __guc_context_update_clks(struct intel_context *ce)
> >>> > +static u64 guc_context_update_stats(struct intel_context *ce)
> >>> > {
> >>> >    struct intel_guc *guc = ce_to_guc(ce);
> >>> >    struct intel_gt *gt = ce->engine->gt;
> >>> >    u32 *pphwsp, last_switch, engine_id;
> >>> > -    u64 start_gt_clk, active;
> >>> >    unsigned long flags;
> >>> > +    u64 total, active = 0;
> >>> >    ktime_t unused;
> >>> >
> >>> > +    intel_context_pin(ce);
> >>>
> >>> intel_context_pin can sleep and we are not allowed to sleep in this
> >>> path -
> >>> intel_context_get_total_runtime_ns(), however we can sleep in the ping
> >>> worker path, so ideally we want to separate it out for the 2 paths.
> >>
> >> Do we know which intel_context_get_total_runtime_ns() call is not allowed
> >> to sleep? E.g. the code path from i915_drm_client_fdinfo() is allowed to
> >> sleep. As mentioned I have done this is v2 of RFC patch which I think is
> >> sufficient, but a more complicated scheme (which I think we can avoid for
> >> now) would be to pin in code paths when sleeping is allowed.
> >>
> >
> > Hmm, maybe intel_context_get_total_runtime_ns can sleep, not sure why I
> > was assuming that this falls in the perf_pmu path. This is now in the
> > drm_fdinfo query path. + Tvrtko.
> >
> > @Tvrtko, any idea if intel_context_get_total_runtime_ns path can sleep?
>
> Not at the moment - it calls it from a lockless (rcu) section when walking
> all the contexts belonging to a client. Idea being performance queries
> should have minimum effect on a running system.

Hmm my bad, missing the rcu and assuming a userspace thread will be able to
sleep.

> I think it would be possible to change in theory but not sure how much
> work. There is a hard need to sleep in there or what?

GuC contexts need to be pinned/unpinned which can sleep but investigating
if we can return a previously computed busyness when we cannot pin/sleep.

Thanks.
--
Ashutosh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 2/2] Fix per client busyness locking
  2022-09-07 15:03           ` Dixit, Ashutosh
@ 2022-09-08  9:33             ` Tvrtko Ursulin
  0 siblings, 0 replies; 10+ messages in thread
From: Tvrtko Ursulin @ 2022-09-08  9:33 UTC (permalink / raw)
  To: Dixit, Ashutosh; +Cc: intel-gfx


On 07/09/2022 16:03, Dixit, Ashutosh wrote:
> On Wed, 07 Sep 2022 00:28:48 -0700, Tvrtko Ursulin wrote:
>>
>> On 06/09/2022 19:29, Umesh Nerlige Ramappa wrote:
>>> On Thu, Sep 01, 2022 at 04:55:22PM -0700, Dixit, Ashutosh wrote:
>>>> On Wed, 31 Aug 2022 15:45:49 -0700, Umesh Nerlige Ramappa wrote:
>>
>> [snip]
>>
>>>>>>
>>>>>>      intel_gt_reset_unlock(gt, srcu);
>>>>>>
>>>>>> @@ -1476,17 +1476,21 @@ void intel_guc_busyness_unpark(struct
>>>>> intel_gt *gt)
>>>>>>               guc->timestamp.ping_delay);
>>>>>> }
>>>>>>
>>>>>> -static void __guc_context_update_clks(struct intel_context *ce)
>>>>>> +static u64 guc_context_update_stats(struct intel_context *ce)
>>>>>> {
>>>>>>      struct intel_guc *guc = ce_to_guc(ce);
>>>>>>      struct intel_gt *gt = ce->engine->gt;
>>>>>>      u32 *pphwsp, last_switch, engine_id;
>>>>>> -    u64 start_gt_clk, active;
>>>>>>      unsigned long flags;
>>>>>> +    u64 total, active = 0;
>>>>>>      ktime_t unused;
>>>>>>
>>>>>> +    intel_context_pin(ce);
>>>>>
>>>>> intel_context_pin can sleep and we are not allowed to sleep in this
>>>>> path -
>>>>> intel_context_get_total_runtime_ns(), however we can sleep in the ping
>>>>> worker path, so ideally we want to separate it out for the 2 paths.
>>>>
>>>> Do we know which intel_context_get_total_runtime_ns() call is not allowed
>>>> to sleep? E.g. the code path from i915_drm_client_fdinfo() is allowed to
>>>> sleep. As mentioned I have done this is v2 of RFC patch which I think is
>>>> sufficient, but a more complicated scheme (which I think we can avoid for
>>>> now) would be to pin in code paths when sleeping is allowed.
>>>>
>>>
>>> Hmm, maybe intel_context_get_total_runtime_ns can sleep, not sure why I
>>> was assuming that this falls in the perf_pmu path. This is now in the
>>> drm_fdinfo query path. + Tvrtko.
>>>
>>> @Tvrtko, any idea if intel_context_get_total_runtime_ns path can sleep?
>>
>> Not at the moment - it calls it from a lockless (rcu) section when walking
>> all the contexts belonging to a client. Idea being performance queries
>> should have minimum effect on a running system.
> 
> Hmm my bad, missing the rcu and assuming a userspace thread will be able to
> sleep.
> 
>> I think it would be possible to change in theory but not sure how much
>> work. There is a hard need to sleep in there or what?
> 
> GuC contexts need to be pinned/unpinned which can sleep but investigating
> if we can return a previously computed busyness when we cannot pin/sleep.

Yeah it would be conceptually nice to keep that query light weight. 
Doing too much work to query accumulated state, like in case of pinning 
the unpinned contexts, feels a bit over the top. Hopefully there aren't 
any nasty races which would make this hard.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-09-08  9:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-31 19:33 [Intel-gfx] [PATCH 1/2] i915/pmu: Wire GuC backend to per-client busyness Ashutosh Dixit
2022-08-31 19:33 ` [Intel-gfx] [RFC PATCH 2/2] Fix per client busyness locking Ashutosh Dixit
2022-08-31 22:45   ` Umesh Nerlige Ramappa
2022-09-01 23:55     ` Dixit, Ashutosh
2022-09-06 18:29       ` Umesh Nerlige Ramappa
2022-09-07  7:26         ` Dixit, Ashutosh
2022-09-07  7:28         ` Tvrtko Ursulin
2022-09-07 15:03           ` Dixit, Ashutosh
2022-09-08  9:33             ` Tvrtko Ursulin
2022-08-31 19:50 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for series starting with [1/2] i915/pmu: Wire GuC backend to per-client busyness Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.