[PATCH v2 0/3] Improve anti-pre-emption w/a for compute workloads

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] Improve anti-pre-emption w/a for compute workloads
@ 2022-02-25 20:41 ` John.C.Harrison
  0 siblings, 0 replies; 15+ messages in thread
From: John.C.Harrison @ 2022-02-25 20:41 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Compute workloads are inherently not pre-emptible on current hardware.
Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the heartbeat
and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset and so
is much more destructive. Instead, just bump the pre-emption timeout
to a big value. Also, update the heartbeat to allow such a long
pre-emption delay in the final heartbeat period.

v2: Add clamping helpers.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>

John Harrison (3):
  drm/i915/guc: Limit scheduling properties to avoid overflow
  drm/i915/gt: Make the heartbeat play nice with long pre-emption
    timeouts
  drm/i915: Improve long running OCL w/a for GuC submission

 drivers/gpu/drm/i915/gt/intel_engine.h        |  6 ++
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 91 ++++++++++++++++++-
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  | 16 ++++
 drivers/gpu/drm/i915/gt/sysfs_engines.c       | 25 +++--
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  9 ++
 5 files changed, 134 insertions(+), 13 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Intel-gfx] [PATCH v2 0/3] Improve anti-pre-emption w/a for compute workloads
@ 2022-02-25 20:41 ` John.C.Harrison
  0 siblings, 0 replies; 15+ messages in thread
From: John.C.Harrison @ 2022-02-25 20:41 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Compute workloads are inherently not pre-emptible on current hardware.
Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the heartbeat
and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset and so
is much more destructive. Instead, just bump the pre-emption timeout
to a big value. Also, update the heartbeat to allow such a long
pre-emption delay in the final heartbeat period.

v2: Add clamping helpers.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>

John Harrison (3):
  drm/i915/guc: Limit scheduling properties to avoid overflow
  drm/i915/gt: Make the heartbeat play nice with long pre-emption
    timeouts
  drm/i915: Improve long running OCL w/a for GuC submission

 drivers/gpu/drm/i915/gt/intel_engine.h        |  6 ++
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 91 ++++++++++++++++++-
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  | 16 ++++
 drivers/gpu/drm/i915/gt/sysfs_engines.c       | 25 +++--
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  9 ++
 5 files changed, 134 insertions(+), 13 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow
  2022-02-25 20:41 ` [Intel-gfx] " John.C.Harrison
@ 2022-02-25 20:41   ` John.C.Harrison
  -1 siblings, 0 replies; 15+ messages in thread
From: John.C.Harrison @ 2022-02-25 20:41 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Daniele Ceraolo Spurio, John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

GuC converts the pre-emption timeout and timeslice quantum values into
clock ticks internally. That significantly reduces the point of 32bit
overflow. On current platforms, worst case scenario is approximately
110 seconds. Rather than allowing the user to set higher values and
then get confused by early timeouts, add limits when setting these
values.

v2: Add helper functins for clamping (review feedback from Tvrtko).

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> (v1)
---
 drivers/gpu/drm/i915/gt/intel_engine.h      |  6 ++
 drivers/gpu/drm/i915/gt/intel_engine_cs.c   | 69 +++++++++++++++++++++
 drivers/gpu/drm/i915/gt/sysfs_engines.c     | 25 +++++---
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h |  9 +++
 4 files changed, 99 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index be4b1e65442f..5a9186f784c4 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -349,4 +349,10 @@ intel_engine_get_hung_context(struct intel_engine_cs *engine)
 	return engine->hung_ce;
 }
 
+u64 intel_clamp_heartbeat_interval_ms(struct intel_engine_cs *engine, u64 value);
+u64 intel_clamp_max_busywait_duration_ns(struct intel_engine_cs *engine, u64 value);
+u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, u64 value);
+u64 intel_clamp_stop_timeout_ms(struct intel_engine_cs *engine, u64 value);
+u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs *engine, u64 value);
+
 #endif /* _INTEL_RINGBUFFER_H_ */
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index e855c801ba28..7ad9e6006656 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -399,6 +399,26 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
 	if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS)
 		engine->props.preempt_timeout_ms = 0;
 
+	/* Cap properties according to any system limits */
+#define CLAMP_PROP(field) \
+	do { \
+		u64 clamp = intel_clamp_##field(engine, engine->props.field); \
+		if (clamp != engine->props.field) { \
+			drm_notice(&engine->i915->drm, \
+				   "Warning, clamping %s to %lld to prevent overflow\n", \
+				   #field, clamp); \
+			engine->props.field = clamp; \
+		} \
+	} while (0)
+
+	CLAMP_PROP(heartbeat_interval_ms);
+	CLAMP_PROP(max_busywait_duration_ns);
+	CLAMP_PROP(preempt_timeout_ms);
+	CLAMP_PROP(stop_timeout_ms);
+	CLAMP_PROP(timeslice_duration_ms);
+
+#undef CLAMP_PROP
+
 	engine->defaults = engine->props; /* never to change again */
 
 	engine->context_size = intel_engine_context_size(gt, engine->class);
@@ -421,6 +441,55 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
 	return 0;
 }
 
+u64 intel_clamp_heartbeat_interval_ms(struct intel_engine_cs *engine, u64 value)
+{
+	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
+
+	return value;
+}
+
+u64 intel_clamp_max_busywait_duration_ns(struct intel_engine_cs *engine, u64 value)
+{
+	value = min(value, jiffies_to_nsecs(2));
+
+	return value;
+}
+
+u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, u64 value)
+{
+	/*
+	 * NB: The GuC API only supports 32bit values. However, the limit is further
+	 * reduced due to internal calculations which would otherwise overflow.
+	 */
+	if (intel_guc_submission_is_wanted(&engine->gt->uc.guc))
+		value = min_t(u64, value, GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS);
+
+	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
+
+	return value;
+}
+
+u64 intel_clamp_stop_timeout_ms(struct intel_engine_cs *engine, u64 value)
+{
+	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
+
+	return value;
+}
+
+u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs *engine, u64 value)
+{
+	/*
+	 * NB: The GuC API only supports 32bit values. However, the limit is further
+	 * reduced due to internal calculations which would otherwise overflow.
+	 */
+	if (intel_guc_submission_is_wanted(&engine->gt->uc.guc))
+		value = min_t(u64, value, GUC_POLICY_MAX_EXEC_QUANTUM_MS);
+
+	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
+
+	return value;
+}
+
 static void __setup_engine_capabilities(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *i915 = engine->i915;
diff --git a/drivers/gpu/drm/i915/gt/sysfs_engines.c b/drivers/gpu/drm/i915/gt/sysfs_engines.c
index 967031056202..f2d9858d827c 100644
--- a/drivers/gpu/drm/i915/gt/sysfs_engines.c
+++ b/drivers/gpu/drm/i915/gt/sysfs_engines.c
@@ -144,7 +144,7 @@ max_spin_store(struct kobject *kobj, struct kobj_attribute *attr,
 	       const char *buf, size_t count)
 {
 	struct intel_engine_cs *engine = kobj_to_engine(kobj);
-	unsigned long long duration;
+	unsigned long long duration, clamped;
 	int err;
 
 	/*
@@ -168,7 +168,8 @@ max_spin_store(struct kobject *kobj, struct kobj_attribute *attr,
 	if (err)
 		return err;
 
-	if (duration > jiffies_to_nsecs(2))
+	clamped = intel_clamp_max_busywait_duration_ns(engine, duration);
+	if (duration != clamped)
 		return -EINVAL;
 
 	WRITE_ONCE(engine->props.max_busywait_duration_ns, duration);
@@ -203,7 +204,7 @@ timeslice_store(struct kobject *kobj, struct kobj_attribute *attr,
 		const char *buf, size_t count)
 {
 	struct intel_engine_cs *engine = kobj_to_engine(kobj);
-	unsigned long long duration;
+	unsigned long long duration, clamped;
 	int err;
 
 	/*
@@ -218,7 +219,8 @@ timeslice_store(struct kobject *kobj, struct kobj_attribute *attr,
 	if (err)
 		return err;
 
-	if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
+	clamped = intel_clamp_timeslice_duration_ms(engine, duration);
+	if (duration != clamped)
 		return -EINVAL;
 
 	WRITE_ONCE(engine->props.timeslice_duration_ms, duration);
@@ -256,7 +258,7 @@ stop_store(struct kobject *kobj, struct kobj_attribute *attr,
 	   const char *buf, size_t count)
 {
 	struct intel_engine_cs *engine = kobj_to_engine(kobj);
-	unsigned long long duration;
+	unsigned long long duration, clamped;
 	int err;
 
 	/*
@@ -272,7 +274,8 @@ stop_store(struct kobject *kobj, struct kobj_attribute *attr,
 	if (err)
 		return err;
 
-	if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
+	clamped = intel_clamp_stop_timeout_ms(engine, duration);
+	if (duration != clamped)
 		return -EINVAL;
 
 	WRITE_ONCE(engine->props.stop_timeout_ms, duration);
@@ -306,7 +309,7 @@ preempt_timeout_store(struct kobject *kobj, struct kobj_attribute *attr,
 		      const char *buf, size_t count)
 {
 	struct intel_engine_cs *engine = kobj_to_engine(kobj);
-	unsigned long long timeout;
+	unsigned long long timeout, clamped;
 	int err;
 
 	/*
@@ -322,7 +325,8 @@ preempt_timeout_store(struct kobject *kobj, struct kobj_attribute *attr,
 	if (err)
 		return err;
 
-	if (timeout > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
+	clamped = intel_clamp_preempt_timeout_ms(engine, timeout);
+	if (timeout != clamped)
 		return -EINVAL;
 
 	WRITE_ONCE(engine->props.preempt_timeout_ms, timeout);
@@ -362,7 +366,7 @@ heartbeat_store(struct kobject *kobj, struct kobj_attribute *attr,
 		const char *buf, size_t count)
 {
 	struct intel_engine_cs *engine = kobj_to_engine(kobj);
-	unsigned long long delay;
+	unsigned long long delay, clamped;
 	int err;
 
 	/*
@@ -379,7 +383,8 @@ heartbeat_store(struct kobject *kobj, struct kobj_attribute *attr,
 	if (err)
 		return err;
 
-	if (delay >= jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
+	clamped = intel_clamp_heartbeat_interval_ms(engine, delay);
+	if (delay != clamped)
 		return -EINVAL;
 
 	err = intel_engine_set_heartbeat(engine, delay);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 6a4612a852e2..ad131092f8df 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -248,6 +248,15 @@ struct guc_lrc_desc {
 
 #define GLOBAL_POLICY_DEFAULT_DPC_PROMOTE_TIME_US 500000
 
+/*
+ * GuC converts the timeout to clock ticks internally. Different platforms have
+ * different GuC clocks. Thus, the maximum value before overflow is platform
+ * dependent. Current worst case scenario is about 110s. So, limit to 100s to be
+ * safe.
+ */
+#define GUC_POLICY_MAX_EXEC_QUANTUM_MS		(100 * 1000)
+#define GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS	(100 * 1000)
+
 struct guc_policies {
 	u32 submission_queue_depth[GUC_MAX_ENGINE_CLASSES];
 	/* In micro seconds. How much time to allow before DPC processing is
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Intel-gfx] [PATCH v2 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow
@ 2022-02-25 20:41   ` John.C.Harrison
  0 siblings, 0 replies; 15+ messages in thread
From: John.C.Harrison @ 2022-02-25 20:41 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

GuC converts the pre-emption timeout and timeslice quantum values into
clock ticks internally. That significantly reduces the point of 32bit
overflow. On current platforms, worst case scenario is approximately
110 seconds. Rather than allowing the user to set higher values and
then get confused by early timeouts, add limits when setting these
values.

v2: Add helper functins for clamping (review feedback from Tvrtko).

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> (v1)
---
 drivers/gpu/drm/i915/gt/intel_engine.h      |  6 ++
 drivers/gpu/drm/i915/gt/intel_engine_cs.c   | 69 +++++++++++++++++++++
 drivers/gpu/drm/i915/gt/sysfs_engines.c     | 25 +++++---
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h |  9 +++
 4 files changed, 99 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index be4b1e65442f..5a9186f784c4 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -349,4 +349,10 @@ intel_engine_get_hung_context(struct intel_engine_cs *engine)
 	return engine->hung_ce;
 }
 
+u64 intel_clamp_heartbeat_interval_ms(struct intel_engine_cs *engine, u64 value);
+u64 intel_clamp_max_busywait_duration_ns(struct intel_engine_cs *engine, u64 value);
+u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, u64 value);
+u64 intel_clamp_stop_timeout_ms(struct intel_engine_cs *engine, u64 value);
+u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs *engine, u64 value);
+
 #endif /* _INTEL_RINGBUFFER_H_ */
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index e855c801ba28..7ad9e6006656 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -399,6 +399,26 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
 	if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS)
 		engine->props.preempt_timeout_ms = 0;
 
+	/* Cap properties according to any system limits */
+#define CLAMP_PROP(field) \
+	do { \
+		u64 clamp = intel_clamp_##field(engine, engine->props.field); \
+		if (clamp != engine->props.field) { \
+			drm_notice(&engine->i915->drm, \
+				   "Warning, clamping %s to %lld to prevent overflow\n", \
+				   #field, clamp); \
+			engine->props.field = clamp; \
+		} \
+	} while (0)
+
+	CLAMP_PROP(heartbeat_interval_ms);
+	CLAMP_PROP(max_busywait_duration_ns);
+	CLAMP_PROP(preempt_timeout_ms);
+	CLAMP_PROP(stop_timeout_ms);
+	CLAMP_PROP(timeslice_duration_ms);
+
+#undef CLAMP_PROP
+
 	engine->defaults = engine->props; /* never to change again */
 
 	engine->context_size = intel_engine_context_size(gt, engine->class);
@@ -421,6 +441,55 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
 	return 0;
 }
 
+u64 intel_clamp_heartbeat_interval_ms(struct intel_engine_cs *engine, u64 value)
+{
+	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
+
+	return value;
+}
+
+u64 intel_clamp_max_busywait_duration_ns(struct intel_engine_cs *engine, u64 value)
+{
+	value = min(value, jiffies_to_nsecs(2));
+
+	return value;
+}
+
+u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, u64 value)
+{
+	/*
+	 * NB: The GuC API only supports 32bit values. However, the limit is further
+	 * reduced due to internal calculations which would otherwise overflow.
+	 */
+	if (intel_guc_submission_is_wanted(&engine->gt->uc.guc))
+		value = min_t(u64, value, GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS);
+
+	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
+
+	return value;
+}
+
+u64 intel_clamp_stop_timeout_ms(struct intel_engine_cs *engine, u64 value)
+{
+	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
+
+	return value;
+}
+
+u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs *engine, u64 value)
+{
+	/*
+	 * NB: The GuC API only supports 32bit values. However, the limit is further
+	 * reduced due to internal calculations which would otherwise overflow.
+	 */
+	if (intel_guc_submission_is_wanted(&engine->gt->uc.guc))
+		value = min_t(u64, value, GUC_POLICY_MAX_EXEC_QUANTUM_MS);
+
+	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
+
+	return value;
+}
+
 static void __setup_engine_capabilities(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *i915 = engine->i915;
diff --git a/drivers/gpu/drm/i915/gt/sysfs_engines.c b/drivers/gpu/drm/i915/gt/sysfs_engines.c
index 967031056202..f2d9858d827c 100644
--- a/drivers/gpu/drm/i915/gt/sysfs_engines.c
+++ b/drivers/gpu/drm/i915/gt/sysfs_engines.c
@@ -144,7 +144,7 @@ max_spin_store(struct kobject *kobj, struct kobj_attribute *attr,
 	       const char *buf, size_t count)
 {
 	struct intel_engine_cs *engine = kobj_to_engine(kobj);
-	unsigned long long duration;
+	unsigned long long duration, clamped;
 	int err;
 
 	/*
@@ -168,7 +168,8 @@ max_spin_store(struct kobject *kobj, struct kobj_attribute *attr,
 	if (err)
 		return err;
 
-	if (duration > jiffies_to_nsecs(2))
+	clamped = intel_clamp_max_busywait_duration_ns(engine, duration);
+	if (duration != clamped)
 		return -EINVAL;
 
 	WRITE_ONCE(engine->props.max_busywait_duration_ns, duration);
@@ -203,7 +204,7 @@ timeslice_store(struct kobject *kobj, struct kobj_attribute *attr,
 		const char *buf, size_t count)
 {
 	struct intel_engine_cs *engine = kobj_to_engine(kobj);
-	unsigned long long duration;
+	unsigned long long duration, clamped;
 	int err;
 
 	/*
@@ -218,7 +219,8 @@ timeslice_store(struct kobject *kobj, struct kobj_attribute *attr,
 	if (err)
 		return err;
 
-	if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
+	clamped = intel_clamp_timeslice_duration_ms(engine, duration);
+	if (duration != clamped)
 		return -EINVAL;
 
 	WRITE_ONCE(engine->props.timeslice_duration_ms, duration);
@@ -256,7 +258,7 @@ stop_store(struct kobject *kobj, struct kobj_attribute *attr,
 	   const char *buf, size_t count)
 {
 	struct intel_engine_cs *engine = kobj_to_engine(kobj);
-	unsigned long long duration;
+	unsigned long long duration, clamped;
 	int err;
 
 	/*
@@ -272,7 +274,8 @@ stop_store(struct kobject *kobj, struct kobj_attribute *attr,
 	if (err)
 		return err;
 
-	if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
+	clamped = intel_clamp_stop_timeout_ms(engine, duration);
+	if (duration != clamped)
 		return -EINVAL;
 
 	WRITE_ONCE(engine->props.stop_timeout_ms, duration);
@@ -306,7 +309,7 @@ preempt_timeout_store(struct kobject *kobj, struct kobj_attribute *attr,
 		      const char *buf, size_t count)
 {
 	struct intel_engine_cs *engine = kobj_to_engine(kobj);
-	unsigned long long timeout;
+	unsigned long long timeout, clamped;
 	int err;
 
 	/*
@@ -322,7 +325,8 @@ preempt_timeout_store(struct kobject *kobj, struct kobj_attribute *attr,
 	if (err)
 		return err;
 
-	if (timeout > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
+	clamped = intel_clamp_preempt_timeout_ms(engine, timeout);
+	if (timeout != clamped)
 		return -EINVAL;
 
 	WRITE_ONCE(engine->props.preempt_timeout_ms, timeout);
@@ -362,7 +366,7 @@ heartbeat_store(struct kobject *kobj, struct kobj_attribute *attr,
 		const char *buf, size_t count)
 {
 	struct intel_engine_cs *engine = kobj_to_engine(kobj);
-	unsigned long long delay;
+	unsigned long long delay, clamped;
 	int err;
 
 	/*
@@ -379,7 +383,8 @@ heartbeat_store(struct kobject *kobj, struct kobj_attribute *attr,
 	if (err)
 		return err;
 
-	if (delay >= jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
+	clamped = intel_clamp_heartbeat_interval_ms(engine, delay);
+	if (delay != clamped)
 		return -EINVAL;
 
 	err = intel_engine_set_heartbeat(engine, delay);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 6a4612a852e2..ad131092f8df 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -248,6 +248,15 @@ struct guc_lrc_desc {
 
 #define GLOBAL_POLICY_DEFAULT_DPC_PROMOTE_TIME_US 500000
 
+/*
+ * GuC converts the timeout to clock ticks internally. Different platforms have
+ * different GuC clocks. Thus, the maximum value before overflow is platform
+ * dependent. Current worst case scenario is about 110s. So, limit to 100s to be
+ * safe.
+ */
+#define GUC_POLICY_MAX_EXEC_QUANTUM_MS		(100 * 1000)
+#define GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS	(100 * 1000)
+
 struct guc_policies {
 	u32 submission_queue_depth[GUC_MAX_ENGINE_CLASSES];
 	/* In micro seconds. How much time to allow before DPC processing is
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts
  2022-02-25 20:41 ` [Intel-gfx] " John.C.Harrison
@ 2022-02-25 20:41   ` John.C.Harrison
  -1 siblings, 0 replies; 15+ messages in thread
From: John.C.Harrison @ 2022-02-25 20:41 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Compute workloads are inherently not pre-emptible for long periods on
current hardware. As a workaround for this, the pre-emption timeout
for compute capable engines was disabled. This is undesirable with GuC
submission as it prevents per engine reset of hung contexts. Hence the
next patch will re-enable the timeout but bumped up by an order of
magnitude.

However, the heartbeat might not respect that. Depending upon current
activity, a pre-emption to the heartbeat pulse might not even be
attempted until the last heartbeat period. Which means that only one
period is granted for the pre-emption to occur. With the aforesaid
bump, the pre-emption timeout could be significantly larger than this
heartbeat period.

So adjust the heartbeat code to take the pre-emption timeout into
account. When it reaches the final (high priority) period, it now
ensures the delay before hitting reset is bigger than the pre-emption
timeout.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index a3698f611f45..72a82a6085e0 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -22,9 +22,25 @@

 static bool next_heartbeat(struct intel_engine_cs *engine)
 {
+	struct i915_request *rq;
 	long delay;

 	delay = READ_ONCE(engine->props.heartbeat_interval_ms);
+
+	rq = engine->heartbeat.systole;
+	if (rq && rq->sched.attr.priority >= I915_PRIORITY_BARRIER) {
+		long longer;
+
+		/*
+		 * The final try is at the highest priority possible. Up until now
+		 * a pre-emption might not even have been attempted. So make sure
+		 * this last attempt allows enough time for a pre-emption to occur.
+		 */
+		longer = READ_ONCE(engine->props.preempt_timeout_ms) * 2;
+		if (longer > delay)
+			delay = longer;
+	}
+
 	if (!delay)
 		return false;

-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Intel-gfx] [PATCH v2 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts
@ 2022-02-25 20:41   ` John.C.Harrison
  0 siblings, 0 replies; 15+ messages in thread
From: John.C.Harrison @ 2022-02-25 20:41 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Compute workloads are inherently not pre-emptible for long periods on
current hardware. As a workaround for this, the pre-emption timeout
for compute capable engines was disabled. This is undesirable with GuC
submission as it prevents per engine reset of hung contexts. Hence the
next patch will re-enable the timeout but bumped up by an order of
magnitude.

However, the heartbeat might not respect that. Depending upon current
activity, a pre-emption to the heartbeat pulse might not even be
attempted until the last heartbeat period. Which means that only one
period is granted for the pre-emption to occur. With the aforesaid
bump, the pre-emption timeout could be significantly larger than this
heartbeat period.

So adjust the heartbeat code to take the pre-emption timeout into
account. When it reaches the final (high priority) period, it now
ensures the delay before hitting reset is bigger than the pre-emption
timeout.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index a3698f611f45..72a82a6085e0 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -22,9 +22,25 @@

 static bool next_heartbeat(struct intel_engine_cs *engine)
 {
+	struct i915_request *rq;
 	long delay;

 	delay = READ_ONCE(engine->props.heartbeat_interval_ms);
+
+	rq = engine->heartbeat.systole;
+	if (rq && rq->sched.attr.priority >= I915_PRIORITY_BARRIER) {
+		long longer;
+
+		/*
+		 * The final try is at the highest priority possible. Up until now
+		 * a pre-emption might not even have been attempted. So make sure
+		 * this last attempt allows enough time for a pre-emption to occur.
+		 */
+		longer = READ_ONCE(engine->props.preempt_timeout_ms) * 2;
+		if (longer > delay)
+			delay = longer;
+	}
+
 	if (!delay)
 		return false;

-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/3] drm/i915: Improve long running OCL w/a for GuC submission
  2022-02-25 20:41 ` [Intel-gfx] " John.C.Harrison
@ 2022-02-25 20:41   ` John.C.Harrison
  -1 siblings, 0 replies; 15+ messages in thread
From: John.C.Harrison @ 2022-02-25 20:41 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Michal Mrozek, Daniele Ceraolo Spurio, John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

A workaround was added to the driver to allow OpenCL workloads to run
'forever' by disabling pre-emption on the RCS engine for Gen12.
It is not totally unbound as the heartbeat will kick in eventually
and cause a reset of the hung engine.

However, this does not work well in GuC submission mode. In GuC mode,
the pre-emption timeout is how GuC detects hung contexts and triggers
a per engine reset. Thus, disabling the timeout means also losing all
per engine reset ability. A full GT reset will still occur when the
heartbeat finally expires, but that is a much more destructive and
undesirable mechanism.

The purpose of the workaround is actually to give OpenCL tasks longer
to reach a pre-emption point after a pre-emption request has been
issued. This is necessary because Gen12 does not support mid-thread
pre-emption and OpenCL can have long running threads.

So, rather than disabling the timeout completely, just set it to a
'long' value.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Acked-by: Michal Mrozek <michal.mrozek@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 7ad9e6006656..84db5bf36285 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -395,9 +395,25 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
 	engine->props.timeslice_duration_ms =
 		CONFIG_DRM_I915_TIMESLICE_DURATION;
 
-	/* Override to uninterruptible for OpenCL workloads. */
-	if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS)
-		engine->props.preempt_timeout_ms = 0;
+	/*
+	 * Mid-thread pre-emption is not available in Gen12. Unfortunately,
+	 * some OpenCL workloads run quite long threads. That means they get
+	 * reset due to not pre-empting in a timely manner. So, bump the
+	 * pre-emption timeout value to be much higher for compute engines.
+	 * Using three times the heartbeat period seems long enough for a
+	 * reasonable task to reach a pre-emption point but not so long as to
+	 * allow genuine hangs to go unresolved.
+	 */
+	if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS) {
+		unsigned long triple_beat = engine->props.heartbeat_interval_ms * 3;
+
+		if (triple_beat > engine->props.preempt_timeout_ms) {
+			drm_info(&gt->i915->drm, "Bumping pre-emption timeout from %ld to %ld on %s to allow slow compute pre-emption\n",
+				 engine->props.preempt_timeout_ms, triple_beat, engine->name);
+
+			engine->props.preempt_timeout_ms = triple_beat;
+		}
+	}
 
 	/* Cap properties according to any system limits */
 #define CLAMP_PROP(field) \
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Intel-gfx] [PATCH v2 3/3] drm/i915: Improve long running OCL w/a for GuC submission
@ 2022-02-25 20:41   ` John.C.Harrison
  0 siblings, 0 replies; 15+ messages in thread
From: John.C.Harrison @ 2022-02-25 20:41 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Michal Mrozek, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

A workaround was added to the driver to allow OpenCL workloads to run
'forever' by disabling pre-emption on the RCS engine for Gen12.
It is not totally unbound as the heartbeat will kick in eventually
and cause a reset of the hung engine.

However, this does not work well in GuC submission mode. In GuC mode,
the pre-emption timeout is how GuC detects hung contexts and triggers
a per engine reset. Thus, disabling the timeout means also losing all
per engine reset ability. A full GT reset will still occur when the
heartbeat finally expires, but that is a much more destructive and
undesirable mechanism.

The purpose of the workaround is actually to give OpenCL tasks longer
to reach a pre-emption point after a pre-emption request has been
issued. This is necessary because Gen12 does not support mid-thread
pre-emption and OpenCL can have long running threads.

So, rather than disabling the timeout completely, just set it to a
'long' value.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Acked-by: Michal Mrozek <michal.mrozek@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 7ad9e6006656..84db5bf36285 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -395,9 +395,25 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
 	engine->props.timeslice_duration_ms =
 		CONFIG_DRM_I915_TIMESLICE_DURATION;
 
-	/* Override to uninterruptible for OpenCL workloads. */
-	if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS)
-		engine->props.preempt_timeout_ms = 0;
+	/*
+	 * Mid-thread pre-emption is not available in Gen12. Unfortunately,
+	 * some OpenCL workloads run quite long threads. That means they get
+	 * reset due to not pre-empting in a timely manner. So, bump the
+	 * pre-emption timeout value to be much higher for compute engines.
+	 * Using three times the heartbeat period seems long enough for a
+	 * reasonable task to reach a pre-emption point but not so long as to
+	 * allow genuine hangs to go unresolved.
+	 */
+	if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS) {
+		unsigned long triple_beat = engine->props.heartbeat_interval_ms * 3;
+
+		if (triple_beat > engine->props.preempt_timeout_ms) {
+			drm_info(&gt->i915->drm, "Bumping pre-emption timeout from %ld to %ld on %s to allow slow compute pre-emption\n",
+				 engine->props.preempt_timeout_ms, triple_beat, engine->name);
+
+			engine->props.preempt_timeout_ms = triple_beat;
+		}
+	}
 
 	/* Cap properties according to any system limits */
 #define CLAMP_PROP(field) \
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Improve anti-pre-emption w/a for compute workloads (rev2)
  2022-02-25 20:41 ` [Intel-gfx] " John.C.Harrison
                   ` (3 preceding siblings ...)
  (?)
@ 2022-02-26  1:04 ` Patchwork
  -1 siblings, 0 replies; 15+ messages in thread
From: Patchwork @ 2022-02-26  1:04 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

== Series Details ==

Series: Improve anti-pre-emption w/a for compute workloads (rev2)
URL   : https://patchwork.freedesktop.org/series/100428/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
57800ec7188b drm/i915/guc: Limit scheduling properties to avoid overflow
-:42: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'field' - possible side-effects?
#42: FILE: drivers/gpu/drm/i915/gt/intel_engine_cs.c:403:
+#define CLAMP_PROP(field) \
+	do { \
+		u64 clamp = intel_clamp_##field(engine, engine->props.field); \
+		if (clamp != engine->props.field) { \
+			drm_notice(&engine->i915->drm, \
+				   "Warning, clamping %s to %lld to prevent overflow\n", \
+				   #field, clamp); \
+			engine->props.field = clamp; \
+		} \
+	} while (0)

total: 0 errors, 0 warnings, 1 checks, 191 lines checked
5faecf54eba0 drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts
ada91b27b412 drm/i915: Improve long running OCL w/a for GuC submission



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Improve anti-pre-emption w/a for compute workloads (rev2)
  2022-02-25 20:41 ` [Intel-gfx] " John.C.Harrison
                   ` (4 preceding siblings ...)
  (?)
@ 2022-02-26  1:05 ` Patchwork
  -1 siblings, 0 replies; 15+ messages in thread
From: Patchwork @ 2022-02-26  1:05 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

== Series Details ==

Series: Improve anti-pre-emption w/a for compute workloads (rev2)
URL   : https://patchwork.freedesktop.org/series/100428/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for Improve anti-pre-emption w/a for compute workloads (rev2)
  2022-02-25 20:41 ` [Intel-gfx] " John.C.Harrison
                   ` (5 preceding siblings ...)
  (?)
@ 2022-02-26  1:45 ` Patchwork
  -1 siblings, 0 replies; 15+ messages in thread
From: Patchwork @ 2022-02-26  1:45 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 15673 bytes --]

== Series Details ==

Series: Improve anti-pre-emption w/a for compute workloads (rev2)
URL   : https://patchwork.freedesktop.org/series/100428/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_11290 -> Patchwork_22419
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_22419 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_22419, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/index.html

Participating hosts (41 -> 39)
------------------------------

  Additional (1): fi-kbl-guc 
  Missing    (3): fi-bsw-cyan fi-kbl-8809g fi-pnv-d510 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_22419:

### IGT changes ###

#### Possible regressions ####

  * igt@i915_selftest@live@execlists:
    - fi-cfl-8109u:       [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-cfl-8109u/igt@i915_selftest@live@execlists.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-cfl-8109u/igt@i915_selftest@live@execlists.html
    - fi-glk-dsi:         [PASS][3] -> [INCOMPLETE][4]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-glk-dsi/igt@i915_selftest@live@execlists.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-glk-dsi/igt@i915_selftest@live@execlists.html
    - fi-cfl-8700k:       [PASS][5] -> [INCOMPLETE][6]
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-cfl-8700k/igt@i915_selftest@live@execlists.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-cfl-8700k/igt@i915_selftest@live@execlists.html
    - fi-glk-j4005:       [PASS][7] -> [INCOMPLETE][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-glk-j4005/igt@i915_selftest@live@execlists.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-glk-j4005/igt@i915_selftest@live@execlists.html
    - fi-skl-guc:         [PASS][9] -> [INCOMPLETE][10]
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-skl-guc/igt@i915_selftest@live@execlists.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-skl-guc/igt@i915_selftest@live@execlists.html
    - fi-skl-6700k2:      [PASS][11] -> [INCOMPLETE][12]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-skl-6700k2/igt@i915_selftest@live@execlists.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-skl-6700k2/igt@i915_selftest@live@execlists.html
    - fi-cfl-guc:         [PASS][13] -> [INCOMPLETE][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-cfl-guc/igt@i915_selftest@live@execlists.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-cfl-guc/igt@i915_selftest@live@execlists.html
    - fi-bxt-dsi:         [PASS][15] -> [INCOMPLETE][16]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-bxt-dsi/igt@i915_selftest@live@execlists.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-bxt-dsi/igt@i915_selftest@live@execlists.html
    - fi-tgl-1115g4:      [PASS][17] -> [INCOMPLETE][18]
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-tgl-1115g4/igt@i915_selftest@live@execlists.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-tgl-1115g4/igt@i915_selftest@live@execlists.html

  
#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * igt@i915_selftest@live@execlists:
    - {fi-rkl-11600}:     NOTRUN -> [INCOMPLETE][19]
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-rkl-11600/igt@i915_selftest@live@execlists.html
    - {fi-ehl-2}:         [PASS][20] -> [INCOMPLETE][21]
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-ehl-2/igt@i915_selftest@live@execlists.html
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-ehl-2/igt@i915_selftest@live@execlists.html
    - {fi-jsl-1}:         [PASS][22] -> [INCOMPLETE][23]
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-jsl-1/igt@i915_selftest@live@execlists.html
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-jsl-1/igt@i915_selftest@live@execlists.html
    - {bat-jsl-1}:        [PASS][24] -> [INCOMPLETE][25]
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/bat-jsl-1/igt@i915_selftest@live@execlists.html
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/bat-jsl-1/igt@i915_selftest@live@execlists.html
    - {bat-jsl-2}:        [PASS][26] -> [INCOMPLETE][27]
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/bat-jsl-2/igt@i915_selftest@live@execlists.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/bat-jsl-2/igt@i915_selftest@live@execlists.html
    - {fi-tgl-dsi}:       NOTRUN -> [INCOMPLETE][28]
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-tgl-dsi/igt@i915_selftest@live@execlists.html

  
Known issues
------------

  Here are the changes found in Patchwork_22419 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_cs_nop@sync-fork-compute0:
    - fi-snb-2600:        NOTRUN -> [SKIP][29] ([fdo#109271]) +17 similar issues
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-snb-2600/igt@amdgpu/amd_cs_nop@sync-fork-compute0.html

  * igt@gem_lmem_swapping@basic:
    - fi-kbl-guc:         NOTRUN -> [SKIP][30] ([fdo#109271] / [i915#4613]) +3 similar issues
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-guc/igt@gem_lmem_swapping@basic.html

  * igt@i915_selftest@live@execlists:
    - fi-bsw-nick:        [PASS][31] -> [INCOMPLETE][32] ([i915#2940])
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-bsw-nick/igt@i915_selftest@live@execlists.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-bsw-nick/igt@i915_selftest@live@execlists.html
    - fi-kbl-x1275:       [PASS][33] -> [INCOMPLETE][34] ([i915#794])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-kbl-x1275/igt@i915_selftest@live@execlists.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-x1275/igt@i915_selftest@live@execlists.html
    - fi-bsw-kefka:       [PASS][35] -> [INCOMPLETE][36] ([i915#2940])
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-bsw-kefka/igt@i915_selftest@live@execlists.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-bsw-kefka/igt@i915_selftest@live@execlists.html
    - fi-bsw-n3050:       [PASS][37] -> [INCOMPLETE][38] ([i915#2940])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-bsw-n3050/igt@i915_selftest@live@execlists.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-bsw-n3050/igt@i915_selftest@live@execlists.html
    - fi-kbl-7567u:       [PASS][39] -> [INCOMPLETE][40] ([i915#794])
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-kbl-7567u/igt@i915_selftest@live@execlists.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-7567u/igt@i915_selftest@live@execlists.html
    - fi-kbl-soraka:      [PASS][41] -> [INCOMPLETE][42] ([i915#794])
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-kbl-soraka/igt@i915_selftest@live@execlists.html
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-soraka/igt@i915_selftest@live@execlists.html
    - fi-kbl-7500u:       [PASS][43] -> [INCOMPLETE][44] ([i915#794])
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-kbl-7500u/igt@i915_selftest@live@execlists.html
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-7500u/igt@i915_selftest@live@execlists.html

  * igt@kms_busy@basic:
    - fi-kbl-guc:         NOTRUN -> [SKIP][45] ([fdo#109271] / [i915#1845])
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-guc/igt@kms_busy@basic.html

  * igt@kms_chamelium@vga-hpd-fast:
    - fi-kbl-guc:         NOTRUN -> [SKIP][46] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-guc/igt@kms_chamelium@vga-hpd-fast.html

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d:
    - fi-kbl-guc:         NOTRUN -> [SKIP][47] ([fdo#109271] / [i915#533])
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-guc/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d.html

  * igt@kms_pipe_crc_basic@read-crc-pipe-c:
    - fi-kbl-guc:         NOTRUN -> [SKIP][48] ([fdo#109271]) +40 similar issues
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-guc/igt@kms_pipe_crc_basic@read-crc-pipe-c.html

  * igt@runner@aborted:
    - fi-kbl-x1275:       NOTRUN -> [FAIL][49] ([i915#1436] / [i915#4312])
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-x1275/igt@runner@aborted.html
    - fi-bsw-kefka:       NOTRUN -> [FAIL][50] ([fdo#109271] / [i915#1436] / [i915#3428] / [i915#4312])
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-bsw-kefka/igt@runner@aborted.html
    - fi-cfl-8700k:       NOTRUN -> [FAIL][51] ([i915#4312])
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-cfl-8700k/igt@runner@aborted.html
    - fi-cfl-8109u:       NOTRUN -> [FAIL][52] ([i915#4312])
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-cfl-8109u/igt@runner@aborted.html
    - fi-glk-dsi:         NOTRUN -> [FAIL][53] ([i915#4312] / [k.org#202321])
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-glk-dsi/igt@runner@aborted.html
    - fi-bsw-nick:        NOTRUN -> [FAIL][54] ([fdo#109271] / [i915#1436] / [i915#3428] / [i915#4312])
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-bsw-nick/igt@runner@aborted.html
    - fi-kbl-soraka:      NOTRUN -> [FAIL][55] ([i915#1436] / [i915#4312])
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-soraka/igt@runner@aborted.html
    - fi-kbl-7500u:       NOTRUN -> [FAIL][56] ([i915#1436] / [i915#4312])
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-7500u/igt@runner@aborted.html
    - fi-bxt-dsi:         NOTRUN -> [FAIL][57] ([i915#4312])
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-bxt-dsi/igt@runner@aborted.html
    - fi-tgl-1115g4:      NOTRUN -> [FAIL][58] ([i915#1436] / [i915#4312])
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-tgl-1115g4/igt@runner@aborted.html
    - fi-cfl-guc:         NOTRUN -> [FAIL][59] ([i915#4312])
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-cfl-guc/igt@runner@aborted.html
    - fi-glk-j4005:       NOTRUN -> [FAIL][60] ([i915#4312] / [k.org#202321])
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-glk-j4005/igt@runner@aborted.html
    - fi-kbl-7567u:       NOTRUN -> [FAIL][61] ([i915#1436] / [i915#4312])
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-kbl-7567u/igt@runner@aborted.html
    - fi-skl-guc:         NOTRUN -> [FAIL][62] ([i915#1436] / [i915#4312])
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-skl-guc/igt@runner@aborted.html
    - fi-skl-6700k2:      NOTRUN -> [FAIL][63] ([i915#1436] / [i915#4312])
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-skl-6700k2/igt@runner@aborted.html
    - fi-bsw-n3050:       NOTRUN -> [FAIL][64] ([fdo#109271] / [i915#1436] / [i915#3428] / [i915#4312])
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-bsw-n3050/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@gem_exec_suspend@basic-s3@smem:
    - {fi-rkl-11600}:     [INCOMPLETE][65] ([i915#5127]) -> [PASS][66]
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-rkl-11600/igt@gem_exec_suspend@basic-s3@smem.html
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-rkl-11600/igt@gem_exec_suspend@basic-s3@smem.html

  * igt@i915_selftest@live@hangcheck:
    - bat-dg1-6:          [DMESG-FAIL][67] ([i915#4494] / [i915#4957]) -> [PASS][68]
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/bat-dg1-6/igt@i915_selftest@live@hangcheck.html
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/bat-dg1-6/igt@i915_selftest@live@hangcheck.html
    - fi-snb-2600:        [INCOMPLETE][69] ([i915#3921]) -> [PASS][70]
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11290/fi-snb-2600/igt@i915_selftest@live@hangcheck.html
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/fi-snb-2600/igt@i915_selftest@live@hangcheck.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#1436]: https://gitlab.freedesktop.org/drm/intel/issues/1436
  [i915#1759]: https://gitlab.freedesktop.org/drm/intel/issues/1759
  [i915#1845]: https://gitlab.freedesktop.org/drm/intel/issues/1845
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2940]: https://gitlab.freedesktop.org/drm/intel/issues/2940
  [i915#3012]: https://gitlab.freedesktop.org/drm/intel/issues/3012
  [i915#3282]: https://gitlab.freedesktop.org/drm/intel/issues/3282
  [i915#3291]: https://gitlab.freedesktop.org/drm/intel/issues/3291
  [i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
  [i915#3428]: https://gitlab.freedesktop.org/drm/intel/issues/3428
  [i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708
  [i915#3921]: https://gitlab.freedesktop.org/drm/intel/issues/3921
  [i915#4098]: https://gitlab.freedesktop.org/drm/intel/issues/4098
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#4494]: https://gitlab.freedesktop.org/drm/intel/issues/4494
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4957]: https://gitlab.freedesktop.org/drm/intel/issues/4957
  [i915#5127]: https://gitlab.freedesktop.org/drm/intel/issues/5127
  [i915#533]: https://gitlab.freedesktop.org/drm/intel/issues/533
  [i915#794]: https://gitlab.freedesktop.org/drm/intel/issues/794
  [k.org#202321]: https://bugzilla.kernel.org/show_bug.cgi?id=202321


Build changes
-------------

  * Linux: CI_DRM_11290 -> Patchwork_22419

  CI-20190529: 20190529
  CI_DRM_11290: e4658cb77436a0a406de83fef483b52f84e17208 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6359: 57049558c452272b27eeb099fac07e55a924bbf9 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_22419: ada91b27b412221c517d4f2af307b5d88bb26fa9 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

ada91b27b412 drm/i915: Improve long running OCL w/a for GuC submission
5faecf54eba0 drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts
57800ec7188b drm/i915/guc: Limit scheduling properties to avoid overflow

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22419/index.html

[-- Attachment #2: Type: text/html, Size: 18984 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Intel-gfx] [PATCH v2 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow
  2022-02-25 20:41   ` [Intel-gfx] " John.C.Harrison
  (?)
@ 2022-03-02  9:49   ` Tvrtko Ursulin
  2022-03-02 18:22     ` John Harrison
  -1 siblings, 1 reply; 15+ messages in thread
From: Tvrtko Ursulin @ 2022-03-02  9:49 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 25/02/2022 20:41, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> GuC converts the pre-emption timeout and timeslice quantum values into
> clock ticks internally. That significantly reduces the point of 32bit
> overflow. On current platforms, worst case scenario is approximately
> 110 seconds. Rather than allowing the user to set higher values and
> then get confused by early timeouts, add limits when setting these
> values.
> 
> v2: Add helper functins for clamping (review feedback from Tvrtko).
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> (v1)
> ---
>   drivers/gpu/drm/i915/gt/intel_engine.h      |  6 ++
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c   | 69 +++++++++++++++++++++
>   drivers/gpu/drm/i915/gt/sysfs_engines.c     | 25 +++++---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h |  9 +++
>   4 files changed, 99 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index be4b1e65442f..5a9186f784c4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -349,4 +349,10 @@ intel_engine_get_hung_context(struct intel_engine_cs *engine)
>   	return engine->hung_ce;
>   }
>   
> +u64 intel_clamp_heartbeat_interval_ms(struct intel_engine_cs *engine, u64 value);
> +u64 intel_clamp_max_busywait_duration_ns(struct intel_engine_cs *engine, u64 value);
> +u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, u64 value);
> +u64 intel_clamp_stop_timeout_ms(struct intel_engine_cs *engine, u64 value);
> +u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs *engine, u64 value);
> +
>   #endif /* _INTEL_RINGBUFFER_H_ */
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index e855c801ba28..7ad9e6006656 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -399,6 +399,26 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
>   	if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS)
>   		engine->props.preempt_timeout_ms = 0;
>   
> +	/* Cap properties according to any system limits */
> +#define CLAMP_PROP(field) \
> +	do { \
> +		u64 clamp = intel_clamp_##field(engine, engine->props.field); \
> +		if (clamp != engine->props.field) { \
> +			drm_notice(&engine->i915->drm, \
> +				   "Warning, clamping %s to %lld to prevent overflow\n", \
> +				   #field, clamp); \
> +			engine->props.field = clamp; \
> +		} \
> +	} while (0)
> +
> +	CLAMP_PROP(heartbeat_interval_ms);
> +	CLAMP_PROP(max_busywait_duration_ns);
> +	CLAMP_PROP(preempt_timeout_ms);
> +	CLAMP_PROP(stop_timeout_ms);
> +	CLAMP_PROP(timeslice_duration_ms);
> +
> +#undef CLAMP_PROP
> +
>   	engine->defaults = engine->props; /* never to change again */
>   
>   	engine->context_size = intel_engine_context_size(gt, engine->class);
> @@ -421,6 +441,55 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id,
>   	return 0;
>   }
>   
> +u64 intel_clamp_heartbeat_interval_ms(struct intel_engine_cs *engine, u64 value)
> +{
> +	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
> +
> +	return value;
> +}
> +
> +u64 intel_clamp_max_busywait_duration_ns(struct intel_engine_cs *engine, u64 value)
> +{
> +	value = min(value, jiffies_to_nsecs(2));
> +
> +	return value;
> +}
> +
> +u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, u64 value)
> +{
> +	/*
> +	 * NB: The GuC API only supports 32bit values. However, the limit is further
> +	 * reduced due to internal calculations which would otherwise overflow.
> +	 */
> +	if (intel_guc_submission_is_wanted(&engine->gt->uc.guc))
> +		value = min_t(u64, value, GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS);
> +
> +	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
> +
> +	return value;
> +}
> +
> +u64 intel_clamp_stop_timeout_ms(struct intel_engine_cs *engine, u64 value)
> +{
> +	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
> +
> +	return value;
> +}
> +
> +u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs *engine, u64 value)
> +{
> +	/*
> +	 * NB: The GuC API only supports 32bit values. However, the limit is further
> +	 * reduced due to internal calculations which would otherwise overflow.
> +	 */
> +	if (intel_guc_submission_is_wanted(&engine->gt->uc.guc))
> +		value = min_t(u64, value, GUC_POLICY_MAX_EXEC_QUANTUM_MS);
> +
> +	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
> +
> +	return value;
> +}
> +
>   static void __setup_engine_capabilities(struct intel_engine_cs *engine)
>   {
>   	struct drm_i915_private *i915 = engine->i915;
> diff --git a/drivers/gpu/drm/i915/gt/sysfs_engines.c b/drivers/gpu/drm/i915/gt/sysfs_engines.c
> index 967031056202..f2d9858d827c 100644
> --- a/drivers/gpu/drm/i915/gt/sysfs_engines.c
> +++ b/drivers/gpu/drm/i915/gt/sysfs_engines.c
> @@ -144,7 +144,7 @@ max_spin_store(struct kobject *kobj, struct kobj_attribute *attr,
>   	       const char *buf, size_t count)
>   {
>   	struct intel_engine_cs *engine = kobj_to_engine(kobj);
> -	unsigned long long duration;
> +	unsigned long long duration, clamped;
>   	int err;
>   
>   	/*
> @@ -168,7 +168,8 @@ max_spin_store(struct kobject *kobj, struct kobj_attribute *attr,
>   	if (err)
>   		return err;
>   
> -	if (duration > jiffies_to_nsecs(2))
> +	clamped = intel_clamp_max_busywait_duration_ns(engine, duration);
> +	if (duration != clamped)
>   		return -EINVAL;
>   
>   	WRITE_ONCE(engine->props.max_busywait_duration_ns, duration);
> @@ -203,7 +204,7 @@ timeslice_store(struct kobject *kobj, struct kobj_attribute *attr,
>   		const char *buf, size_t count)
>   {
>   	struct intel_engine_cs *engine = kobj_to_engine(kobj);
> -	unsigned long long duration;
> +	unsigned long long duration, clamped;
>   	int err;
>   
>   	/*
> @@ -218,7 +219,8 @@ timeslice_store(struct kobject *kobj, struct kobj_attribute *attr,
>   	if (err)
>   		return err;
>   
> -	if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
> +	clamped = intel_clamp_timeslice_duration_ms(engine, duration);
> +	if (duration != clamped)
>   		return -EINVAL;
>   
>   	WRITE_ONCE(engine->props.timeslice_duration_ms, duration);
> @@ -256,7 +258,7 @@ stop_store(struct kobject *kobj, struct kobj_attribute *attr,
>   	   const char *buf, size_t count)
>   {
>   	struct intel_engine_cs *engine = kobj_to_engine(kobj);
> -	unsigned long long duration;
> +	unsigned long long duration, clamped;
>   	int err;
>   
>   	/*
> @@ -272,7 +274,8 @@ stop_store(struct kobject *kobj, struct kobj_attribute *attr,
>   	if (err)
>   		return err;
>   
> -	if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
> +	clamped = intel_clamp_stop_timeout_ms(engine, duration);
> +	if (duration != clamped)
>   		return -EINVAL;
>   
>   	WRITE_ONCE(engine->props.stop_timeout_ms, duration);
> @@ -306,7 +309,7 @@ preempt_timeout_store(struct kobject *kobj, struct kobj_attribute *attr,
>   		      const char *buf, size_t count)
>   {
>   	struct intel_engine_cs *engine = kobj_to_engine(kobj);
> -	unsigned long long timeout;
> +	unsigned long long timeout, clamped;
>   	int err;
>   
>   	/*
> @@ -322,7 +325,8 @@ preempt_timeout_store(struct kobject *kobj, struct kobj_attribute *attr,
>   	if (err)
>   		return err;
>   
> -	if (timeout > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
> +	clamped = intel_clamp_preempt_timeout_ms(engine, timeout);
> +	if (timeout != clamped)
>   		return -EINVAL;
>   
>   	WRITE_ONCE(engine->props.preempt_timeout_ms, timeout);
> @@ -362,7 +366,7 @@ heartbeat_store(struct kobject *kobj, struct kobj_attribute *attr,
>   		const char *buf, size_t count)
>   {
>   	struct intel_engine_cs *engine = kobj_to_engine(kobj);
> -	unsigned long long delay;
> +	unsigned long long delay, clamped;
>   	int err;
>   
>   	/*
> @@ -379,7 +383,8 @@ heartbeat_store(struct kobject *kobj, struct kobj_attribute *attr,
>   	if (err)
>   		return err;
>   
> -	if (delay >= jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
> +	clamped = intel_clamp_heartbeat_interval_ms(engine, delay);
> +	if (delay != clamped)
>   		return -EINVAL;
>   
>   	err = intel_engine_set_heartbeat(engine, delay);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> index 6a4612a852e2..ad131092f8df 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> @@ -248,6 +248,15 @@ struct guc_lrc_desc {
>   
>   #define GLOBAL_POLICY_DEFAULT_DPC_PROMOTE_TIME_US 500000
>   
> +/*
> + * GuC converts the timeout to clock ticks internally. Different platforms have
> + * different GuC clocks. Thus, the maximum value before overflow is platform
> + * dependent. Current worst case scenario is about 110s. So, limit to 100s to be
> + * safe.
> + */
> +#define GUC_POLICY_MAX_EXEC_QUANTUM_MS		(100 * 1000)
> +#define GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS	(100 * 1000)
> +
>   struct guc_policies {
>   	u32 submission_queue_depth[GUC_MAX_ENGINE_CLASSES];
>   	/* In micro seconds. How much time to allow before DPC processing is

LGTM. Pretty please:

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b3a429a92c0d..8208164c25e7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2218,13 +2218,24 @@ static inline u32 get_children_join_value(struct intel_context *ce,
  static void guc_context_policy_init(struct intel_engine_cs *engine,
                                     struct guc_lrc_desc *desc)
  {
+       struct drm_device *drm = &engine->i915->drm;
+
         desc->policy_flags = 0;
  
         if (engine->flags & I915_ENGINE_WANT_FORCED_PREEMPTION)
                 desc->policy_flags |= CONTEXT_POLICY_FLAG_PREEMPT_TO_IDLE;
  
         /* NB: For both of these, zero means disabled. */
+       if (overflows_type(engine->props.timeslice_duration_ms * 1000,
+                          desc->execution_quantum))
+               drm_warn_once(drm, "GuC interface cannot support %lums timeslice!\n",
+                             engine->props.timeslice_duration_ms);
         desc->execution_quantum = engine->props.timeslice_duration_ms * 1000;
+
+       if (overflows_type(engine->props.preempt_timeout_ms * 1000,
+                          desc->preemption_timeout))
+               drm_warn_once(drm, "GuC interface cannot support %lums preemption timeout!\n",
+                             engine->props.preempt_timeout_ms);
         desc->preemption_timeout = engine->props.preempt_timeout_ms * 1000;
  }
  

With that:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Any idea what happened with the CI run? It's full of odd failures.

Regards,

Tvrtko

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for Improve anti-pre-emption w/a for compute workloads (rev3)
  2022-02-25 20:41 ` [Intel-gfx] " John.C.Harrison
                   ` (6 preceding siblings ...)
  (?)
@ 2022-03-02 11:16 ` Patchwork
  -1 siblings, 0 replies; 15+ messages in thread
From: Patchwork @ 2022-03-02 11:16 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

== Series Details ==

Series: Improve anti-pre-emption w/a for compute workloads (rev3)
URL   : https://patchwork.freedesktop.org/series/100428/
State : failure

== Summary ==

Applying: drm/i915/guc: Limit scheduling properties to avoid overflow
error: patch failed: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c:2218
error: drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c: patch does not apply
error: Did you hand edit your patch?
It does not apply to blobs recorded in its index.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Using index info to reconstruct a base tree...
M	drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
Patch failed at 0001 drm/i915/guc: Limit scheduling properties to avoid overflow
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Intel-gfx] [PATCH v2 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow
  2022-03-02  9:49   ` Tvrtko Ursulin
@ 2022-03-02 18:22     ` John Harrison
  2022-03-03 10:06       ` Tvrtko Ursulin
  0 siblings, 1 reply; 15+ messages in thread
From: John Harrison @ 2022-03-02 18:22 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 3/2/2022 01:49, Tvrtko Ursulin wrote:
> On 25/02/2022 20:41, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> GuC converts the pre-emption timeout and timeslice quantum values into
>> clock ticks internally. That significantly reduces the point of 32bit
>> overflow. On current platforms, worst case scenario is approximately
>> 110 seconds. Rather than allowing the user to set higher values and
>> then get confused by early timeouts, add limits when setting these
>> values.
>>
>> v2: Add helper functins for clamping (review feedback from Tvrtko).
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> 
>> (v1)
>> ---
>>   drivers/gpu/drm/i915/gt/intel_engine.h      |  6 ++
>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c   | 69 +++++++++++++++++++++
>>   drivers/gpu/drm/i915/gt/sysfs_engines.c     | 25 +++++---
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h |  9 +++
>>   4 files changed, 99 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h 
>> b/drivers/gpu/drm/i915/gt/intel_engine.h
>> index be4b1e65442f..5a9186f784c4 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
>> @@ -349,4 +349,10 @@ intel_engine_get_hung_context(struct 
>> intel_engine_cs *engine)
>>       return engine->hung_ce;
>>   }
>>   +u64 intel_clamp_heartbeat_interval_ms(struct intel_engine_cs 
>> *engine, u64 value);
>> +u64 intel_clamp_max_busywait_duration_ns(struct intel_engine_cs 
>> *engine, u64 value);
>> +u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, 
>> u64 value);
>> +u64 intel_clamp_stop_timeout_ms(struct intel_engine_cs *engine, u64 
>> value);
>> +u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs 
>> *engine, u64 value);
>> +
>>   #endif /* _INTEL_RINGBUFFER_H_ */
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index e855c801ba28..7ad9e6006656 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -399,6 +399,26 @@ static int intel_engine_setup(struct intel_gt 
>> *gt, enum intel_engine_id id,
>>       if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS)
>>           engine->props.preempt_timeout_ms = 0;
>>   +    /* Cap properties according to any system limits */
>> +#define CLAMP_PROP(field) \
>> +    do { \
>> +        u64 clamp = intel_clamp_##field(engine, engine->props.field); \
>> +        if (clamp != engine->props.field) { \
>> +            drm_notice(&engine->i915->drm, \
>> +                   "Warning, clamping %s to %lld to prevent 
>> overflow\n", \
>> +                   #field, clamp); \
>> +            engine->props.field = clamp; \
>> +        } \
>> +    } while (0)
>> +
>> +    CLAMP_PROP(heartbeat_interval_ms);
>> +    CLAMP_PROP(max_busywait_duration_ns);
>> +    CLAMP_PROP(preempt_timeout_ms);
>> +    CLAMP_PROP(stop_timeout_ms);
>> +    CLAMP_PROP(timeslice_duration_ms);
>> +
>> +#undef CLAMP_PROP
>> +
>>       engine->defaults = engine->props; /* never to change again */
>>         engine->context_size = intel_engine_context_size(gt, 
>> engine->class);
>> @@ -421,6 +441,55 @@ static int intel_engine_setup(struct intel_gt 
>> *gt, enum intel_engine_id id,
>>       return 0;
>>   }
>>   +u64 intel_clamp_heartbeat_interval_ms(struct intel_engine_cs 
>> *engine, u64 value)
>> +{
>> +    value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
>> +
>> +    return value;
>> +}
>> +
>> +u64 intel_clamp_max_busywait_duration_ns(struct intel_engine_cs 
>> *engine, u64 value)
>> +{
>> +    value = min(value, jiffies_to_nsecs(2));
>> +
>> +    return value;
>> +}
>> +
>> +u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, 
>> u64 value)
>> +{
>> +    /*
>> +     * NB: The GuC API only supports 32bit values. However, the 
>> limit is further
>> +     * reduced due to internal calculations which would otherwise 
>> overflow.
>> +     */
>> +    if (intel_guc_submission_is_wanted(&engine->gt->uc.guc))
>> +        value = min_t(u64, value, GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS);
>> +
>> +    value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
>> +
>> +    return value;
>> +}
>> +
>> +u64 intel_clamp_stop_timeout_ms(struct intel_engine_cs *engine, u64 
>> value)
>> +{
>> +    value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
>> +
>> +    return value;
>> +}
>> +
>> +u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs 
>> *engine, u64 value)
>> +{
>> +    /*
>> +     * NB: The GuC API only supports 32bit values. However, the 
>> limit is further
>> +     * reduced due to internal calculations which would otherwise 
>> overflow.
>> +     */
>> +    if (intel_guc_submission_is_wanted(&engine->gt->uc.guc))
>> +        value = min_t(u64, value, GUC_POLICY_MAX_EXEC_QUANTUM_MS);
>> +
>> +    value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
>> +
>> +    return value;
>> +}
>> +
>>   static void __setup_engine_capabilities(struct intel_engine_cs 
>> *engine)
>>   {
>>       struct drm_i915_private *i915 = engine->i915;
>> diff --git a/drivers/gpu/drm/i915/gt/sysfs_engines.c 
>> b/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> index 967031056202..f2d9858d827c 100644
>> --- a/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> +++ b/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> @@ -144,7 +144,7 @@ max_spin_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>              const char *buf, size_t count)
>>   {
>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>> -    unsigned long long duration;
>> +    unsigned long long duration, clamped;
>>       int err;
>>         /*
>> @@ -168,7 +168,8 @@ max_spin_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>       if (err)
>>           return err;
>>   -    if (duration > jiffies_to_nsecs(2))
>> +    clamped = intel_clamp_max_busywait_duration_ns(engine, duration);
>> +    if (duration != clamped)
>>           return -EINVAL;
>>         WRITE_ONCE(engine->props.max_busywait_duration_ns, duration);
>> @@ -203,7 +204,7 @@ timeslice_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>           const char *buf, size_t count)
>>   {
>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>> -    unsigned long long duration;
>> +    unsigned long long duration, clamped;
>>       int err;
>>         /*
>> @@ -218,7 +219,8 @@ timeslice_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>       if (err)
>>           return err;
>>   -    if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>> +    clamped = intel_clamp_timeslice_duration_ms(engine, duration);
>> +    if (duration != clamped)
>>           return -EINVAL;
>>         WRITE_ONCE(engine->props.timeslice_duration_ms, duration);
>> @@ -256,7 +258,7 @@ stop_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>          const char *buf, size_t count)
>>   {
>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>> -    unsigned long long duration;
>> +    unsigned long long duration, clamped;
>>       int err;
>>         /*
>> @@ -272,7 +274,8 @@ stop_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>       if (err)
>>           return err;
>>   -    if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>> +    clamped = intel_clamp_stop_timeout_ms(engine, duration);
>> +    if (duration != clamped)
>>           return -EINVAL;
>>         WRITE_ONCE(engine->props.stop_timeout_ms, duration);
>> @@ -306,7 +309,7 @@ preempt_timeout_store(struct kobject *kobj, 
>> struct kobj_attribute *attr,
>>                 const char *buf, size_t count)
>>   {
>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>> -    unsigned long long timeout;
>> +    unsigned long long timeout, clamped;
>>       int err;
>>         /*
>> @@ -322,7 +325,8 @@ preempt_timeout_store(struct kobject *kobj, 
>> struct kobj_attribute *attr,
>>       if (err)
>>           return err;
>>   -    if (timeout > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>> +    clamped = intel_clamp_preempt_timeout_ms(engine, timeout);
>> +    if (timeout != clamped)
>>           return -EINVAL;
>>         WRITE_ONCE(engine->props.preempt_timeout_ms, timeout);
>> @@ -362,7 +366,7 @@ heartbeat_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>           const char *buf, size_t count)
>>   {
>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>> -    unsigned long long delay;
>> +    unsigned long long delay, clamped;
>>       int err;
>>         /*
>> @@ -379,7 +383,8 @@ heartbeat_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>       if (err)
>>           return err;
>>   -    if (delay >= jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>> +    clamped = intel_clamp_heartbeat_interval_ms(engine, delay);
>> +    if (delay != clamped)
>>           return -EINVAL;
>>         err = intel_engine_set_heartbeat(engine, delay);
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> index 6a4612a852e2..ad131092f8df 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> @@ -248,6 +248,15 @@ struct guc_lrc_desc {
>>     #define GLOBAL_POLICY_DEFAULT_DPC_PROMOTE_TIME_US 500000
>>   +/*
>> + * GuC converts the timeout to clock ticks internally. Different 
>> platforms have
>> + * different GuC clocks. Thus, the maximum value before overflow is 
>> platform
>> + * dependent. Current worst case scenario is about 110s. So, limit 
>> to 100s to be
>> + * safe.
>> + */
>> +#define GUC_POLICY_MAX_EXEC_QUANTUM_MS        (100 * 1000)
>> +#define GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS    (100 * 1000)
>> +
>>   struct guc_policies {
>>       u32 submission_queue_depth[GUC_MAX_ENGINE_CLASSES];
>>       /* In micro seconds. How much time to allow before DPC 
>> processing is
>
> LGTM. Pretty please:
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index b3a429a92c0d..8208164c25e7 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -2218,13 +2218,24 @@ static inline u32 
> get_children_join_value(struct intel_context *ce,
>  static void guc_context_policy_init(struct intel_engine_cs *engine,
>                                     struct guc_lrc_desc *desc)
>  {
> +       struct drm_device *drm = &engine->i915->drm;
> +
>         desc->policy_flags = 0;
>
>         if (engine->flags & I915_ENGINE_WANT_FORCED_PREEMPTION)
>                 desc->policy_flags |= 
> CONTEXT_POLICY_FLAG_PREEMPT_TO_IDLE;
>
>         /* NB: For both of these, zero means disabled. */
> +       if (overflows_type(engine->props.timeslice_duration_ms * 1000,
> +                          desc->execution_quantum))
> +               drm_warn_once(drm, "GuC interface cannot support %lums 
> timeslice!\n",
> + engine->props.timeslice_duration_ms);
>         desc->execution_quantum = engine->props.timeslice_duration_ms 
> * 1000;
> +
> +       if (overflows_type(engine->props.preempt_timeout_ms * 1000,
> +                          desc->preemption_timeout))
> +               drm_warn_once(drm, "GuC interface cannot support %lums 
> preemption timeout!\n",
> + engine->props.preempt_timeout_ms);
>         desc->preemption_timeout = engine->props.preempt_timeout_ms * 
> 1000;
>  }
>
As per comments in other thread, this is redundant code and is too late 
in the chain to do anything. If such a check is present then should be a 
BUG not a drm_warn. The whole point of the helper function is to 
consolidate all limits in one place at the point where it is still 
possible to correct for problems.

> With that:
>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> Any idea what happened with the CI run? It's full of odd failures.
Not sure where 'rev3' came from, it's the 'rev2' code and I don't recall 
hitting the 'retest' button. Either way, that version just hit a merge 
conflict and didn't build. The previous two revs seemed to be hitting 
'failed to reset' problems in an execlist selftest on execlist only 
platforms. I'm not seeing how that could be related. The changes should 
only kick in on Gen12 platforms and/or with GuC submission but the 
failures are all on earlier platforms. I'm planning to grab a SKL or 
something to test on to make sure, I just haven't had the time to do so yet.

John.


>
> Regards,
>
> Tvrtko


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Intel-gfx] [PATCH v2 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow
  2022-03-02 18:22     ` John Harrison
@ 2022-03-03 10:06       ` Tvrtko Ursulin
  0 siblings, 0 replies; 15+ messages in thread
From: Tvrtko Ursulin @ 2022-03-03 10:06 UTC (permalink / raw)
  To: John Harrison, Intel-GFX; +Cc: DRI-Devel


On 02/03/2022 18:22, John Harrison wrote:
> On 3/2/2022 01:49, Tvrtko Ursulin wrote:
>> On 25/02/2022 20:41, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> GuC converts the pre-emption timeout and timeslice quantum values into
>>> clock ticks internally. That significantly reduces the point of 32bit
>>> overflow. On current platforms, worst case scenario is approximately
>>> 110 seconds. Rather than allowing the user to set higher values and
>>> then get confused by early timeouts, add limits when setting these
>>> values.
>>>
>>> v2: Add helper functins for clamping (review feedback from Tvrtko).
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> 
>>> (v1)
>>> ---
>>>   drivers/gpu/drm/i915/gt/intel_engine.h      |  6 ++
>>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c   | 69 +++++++++++++++++++++
>>>   drivers/gpu/drm/i915/gt/sysfs_engines.c     | 25 +++++---
>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h |  9 +++
>>>   4 files changed, 99 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h 
>>> b/drivers/gpu/drm/i915/gt/intel_engine.h
>>> index be4b1e65442f..5a9186f784c4 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
>>> @@ -349,4 +349,10 @@ intel_engine_get_hung_context(struct 
>>> intel_engine_cs *engine)
>>>       return engine->hung_ce;
>>>   }
>>>   +u64 intel_clamp_heartbeat_interval_ms(struct intel_engine_cs 
>>> *engine, u64 value);
>>> +u64 intel_clamp_max_busywait_duration_ns(struct intel_engine_cs 
>>> *engine, u64 value);
>>> +u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, 
>>> u64 value);
>>> +u64 intel_clamp_stop_timeout_ms(struct intel_engine_cs *engine, u64 
>>> value);
>>> +u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs 
>>> *engine, u64 value);
>>> +
>>>   #endif /* _INTEL_RINGBUFFER_H_ */
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
>>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>> index e855c801ba28..7ad9e6006656 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>> @@ -399,6 +399,26 @@ static int intel_engine_setup(struct intel_gt 
>>> *gt, enum intel_engine_id id,
>>>       if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS)
>>>           engine->props.preempt_timeout_ms = 0;
>>>   +    /* Cap properties according to any system limits */
>>> +#define CLAMP_PROP(field) \
>>> +    do { \
>>> +        u64 clamp = intel_clamp_##field(engine, engine->props.field); \
>>> +        if (clamp != engine->props.field) { \
>>> +            drm_notice(&engine->i915->drm, \
>>> +                   "Warning, clamping %s to %lld to prevent 
>>> overflow\n", \
>>> +                   #field, clamp); \
>>> +            engine->props.field = clamp; \
>>> +        } \
>>> +    } while (0)
>>> +
>>> +    CLAMP_PROP(heartbeat_interval_ms);
>>> +    CLAMP_PROP(max_busywait_duration_ns);
>>> +    CLAMP_PROP(preempt_timeout_ms);
>>> +    CLAMP_PROP(stop_timeout_ms);
>>> +    CLAMP_PROP(timeslice_duration_ms);
>>> +
>>> +#undef CLAMP_PROP
>>> +
>>>       engine->defaults = engine->props; /* never to change again */
>>>         engine->context_size = intel_engine_context_size(gt, 
>>> engine->class);
>>> @@ -421,6 +441,55 @@ static int intel_engine_setup(struct intel_gt 
>>> *gt, enum intel_engine_id id,
>>>       return 0;
>>>   }
>>>   +u64 intel_clamp_heartbeat_interval_ms(struct intel_engine_cs 
>>> *engine, u64 value)
>>> +{
>>> +    value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
>>> +
>>> +    return value;
>>> +}
>>> +
>>> +u64 intel_clamp_max_busywait_duration_ns(struct intel_engine_cs 
>>> *engine, u64 value)
>>> +{
>>> +    value = min(value, jiffies_to_nsecs(2));
>>> +
>>> +    return value;
>>> +}
>>> +
>>> +u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, 
>>> u64 value)
>>> +{
>>> +    /*
>>> +     * NB: The GuC API only supports 32bit values. However, the 
>>> limit is further
>>> +     * reduced due to internal calculations which would otherwise 
>>> overflow.
>>> +     */
>>> +    if (intel_guc_submission_is_wanted(&engine->gt->uc.guc))
>>> +        value = min_t(u64, value, GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS);
>>> +
>>> +    value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
>>> +
>>> +    return value;
>>> +}
>>> +
>>> +u64 intel_clamp_stop_timeout_ms(struct intel_engine_cs *engine, u64 
>>> value)
>>> +{
>>> +    value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
>>> +
>>> +    return value;
>>> +}
>>> +
>>> +u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs 
>>> *engine, u64 value)
>>> +{
>>> +    /*
>>> +     * NB: The GuC API only supports 32bit values. However, the 
>>> limit is further
>>> +     * reduced due to internal calculations which would otherwise 
>>> overflow.
>>> +     */
>>> +    if (intel_guc_submission_is_wanted(&engine->gt->uc.guc))
>>> +        value = min_t(u64, value, GUC_POLICY_MAX_EXEC_QUANTUM_MS);
>>> +
>>> +    value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
>>> +
>>> +    return value;
>>> +}
>>> +
>>>   static void __setup_engine_capabilities(struct intel_engine_cs 
>>> *engine)
>>>   {
>>>       struct drm_i915_private *i915 = engine->i915;
>>> diff --git a/drivers/gpu/drm/i915/gt/sysfs_engines.c 
>>> b/drivers/gpu/drm/i915/gt/sysfs_engines.c
>>> index 967031056202..f2d9858d827c 100644
>>> --- a/drivers/gpu/drm/i915/gt/sysfs_engines.c
>>> +++ b/drivers/gpu/drm/i915/gt/sysfs_engines.c
>>> @@ -144,7 +144,7 @@ max_spin_store(struct kobject *kobj, struct 
>>> kobj_attribute *attr,
>>>              const char *buf, size_t count)
>>>   {
>>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>>> -    unsigned long long duration;
>>> +    unsigned long long duration, clamped;
>>>       int err;
>>>         /*
>>> @@ -168,7 +168,8 @@ max_spin_store(struct kobject *kobj, struct 
>>> kobj_attribute *attr,
>>>       if (err)
>>>           return err;
>>>   -    if (duration > jiffies_to_nsecs(2))
>>> +    clamped = intel_clamp_max_busywait_duration_ns(engine, duration);
>>> +    if (duration != clamped)
>>>           return -EINVAL;
>>>         WRITE_ONCE(engine->props.max_busywait_duration_ns, duration);
>>> @@ -203,7 +204,7 @@ timeslice_store(struct kobject *kobj, struct 
>>> kobj_attribute *attr,
>>>           const char *buf, size_t count)
>>>   {
>>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>>> -    unsigned long long duration;
>>> +    unsigned long long duration, clamped;
>>>       int err;
>>>         /*
>>> @@ -218,7 +219,8 @@ timeslice_store(struct kobject *kobj, struct 
>>> kobj_attribute *attr,
>>>       if (err)
>>>           return err;
>>>   -    if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>>> +    clamped = intel_clamp_timeslice_duration_ms(engine, duration);
>>> +    if (duration != clamped)
>>>           return -EINVAL;
>>>         WRITE_ONCE(engine->props.timeslice_duration_ms, duration);
>>> @@ -256,7 +258,7 @@ stop_store(struct kobject *kobj, struct 
>>> kobj_attribute *attr,
>>>          const char *buf, size_t count)
>>>   {
>>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>>> -    unsigned long long duration;
>>> +    unsigned long long duration, clamped;
>>>       int err;
>>>         /*
>>> @@ -272,7 +274,8 @@ stop_store(struct kobject *kobj, struct 
>>> kobj_attribute *attr,
>>>       if (err)
>>>           return err;
>>>   -    if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>>> +    clamped = intel_clamp_stop_timeout_ms(engine, duration);
>>> +    if (duration != clamped)
>>>           return -EINVAL;
>>>         WRITE_ONCE(engine->props.stop_timeout_ms, duration);
>>> @@ -306,7 +309,7 @@ preempt_timeout_store(struct kobject *kobj, 
>>> struct kobj_attribute *attr,
>>>                 const char *buf, size_t count)
>>>   {
>>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>>> -    unsigned long long timeout;
>>> +    unsigned long long timeout, clamped;
>>>       int err;
>>>         /*
>>> @@ -322,7 +325,8 @@ preempt_timeout_store(struct kobject *kobj, 
>>> struct kobj_attribute *attr,
>>>       if (err)
>>>           return err;
>>>   -    if (timeout > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>>> +    clamped = intel_clamp_preempt_timeout_ms(engine, timeout);
>>> +    if (timeout != clamped)
>>>           return -EINVAL;
>>>         WRITE_ONCE(engine->props.preempt_timeout_ms, timeout);
>>> @@ -362,7 +366,7 @@ heartbeat_store(struct kobject *kobj, struct 
>>> kobj_attribute *attr,
>>>           const char *buf, size_t count)
>>>   {
>>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>>> -    unsigned long long delay;
>>> +    unsigned long long delay, clamped;
>>>       int err;
>>>         /*
>>> @@ -379,7 +383,8 @@ heartbeat_store(struct kobject *kobj, struct 
>>> kobj_attribute *attr,
>>>       if (err)
>>>           return err;
>>>   -    if (delay >= jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>>> +    clamped = intel_clamp_heartbeat_interval_ms(engine, delay);
>>> +    if (delay != clamped)
>>>           return -EINVAL;
>>>         err = intel_engine_set_heartbeat(engine, delay);
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h 
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>> index 6a4612a852e2..ad131092f8df 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>> @@ -248,6 +248,15 @@ struct guc_lrc_desc {
>>>     #define GLOBAL_POLICY_DEFAULT_DPC_PROMOTE_TIME_US 500000
>>>   +/*
>>> + * GuC converts the timeout to clock ticks internally. Different 
>>> platforms have
>>> + * different GuC clocks. Thus, the maximum value before overflow is 
>>> platform
>>> + * dependent. Current worst case scenario is about 110s. So, limit 
>>> to 100s to be
>>> + * safe.
>>> + */
>>> +#define GUC_POLICY_MAX_EXEC_QUANTUM_MS        (100 * 1000)
>>> +#define GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS    (100 * 1000)
>>> +
>>>   struct guc_policies {
>>>       u32 submission_queue_depth[GUC_MAX_ENGINE_CLASSES];
>>>       /* In micro seconds. How much time to allow before DPC 
>>> processing is
>>
>> LGTM. Pretty please:
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index b3a429a92c0d..8208164c25e7 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -2218,13 +2218,24 @@ static inline u32 
>> get_children_join_value(struct intel_context *ce,
>>  static void guc_context_policy_init(struct intel_engine_cs *engine,
>>                                     struct guc_lrc_desc *desc)
>>  {
>> +       struct drm_device *drm = &engine->i915->drm;
>> +
>>         desc->policy_flags = 0;
>>
>>         if (engine->flags & I915_ENGINE_WANT_FORCED_PREEMPTION)
>>                 desc->policy_flags |= 
>> CONTEXT_POLICY_FLAG_PREEMPT_TO_IDLE;
>>
>>         /* NB: For both of these, zero means disabled. */
>> +       if (overflows_type(engine->props.timeslice_duration_ms * 1000,
>> +                          desc->execution_quantum))
>> +               drm_warn_once(drm, "GuC interface cannot support %lums 
>> timeslice!\n",
>> + engine->props.timeslice_duration_ms);
>>         desc->execution_quantum = engine->props.timeslice_duration_ms 
>> * 1000;
>> +
>> +       if (overflows_type(engine->props.preempt_timeout_ms * 1000,
>> +                          desc->preemption_timeout))
>> +               drm_warn_once(drm, "GuC interface cannot support %lums 
>> preemption timeout!\n",
>> + engine->props.preempt_timeout_ms);
>>         desc->preemption_timeout = engine->props.preempt_timeout_ms * 
>> 1000;
>>  }
>>
> As per comments in other thread, this is redundant code and is too late 
> in the chain to do anything. If such a check is present then should be a 
> BUG not a drm_warn. The whole point of the helper function is to 
> consolidate all limits in one place at the point where it is still 
> possible to correct for problems.

You could put this check in the clamp helpers. I not sure it would look 
amazing since it needs access to struct guc_lrc_desc, which is much 
lower layer, but perhaps it would be passable since some implementation 
knowledge (MAX_SCHEDULE_TIMEOUT) is already there. Maybe by the time you 
multiply by 1000 to check it start looking too ugly. Not sure. I would 
just stick with the above diff. Point is not whether it is too late or 
not, point is just to have a marker which would let us know we forgot 
about something, if it comes to it. Could also be a drm_WARN_ON_ONCE, no 
strong preference from my side.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-03-03 10:07 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-25 20:41 [PATCH v2 0/3] Improve anti-pre-emption w/a for compute workloads John.C.Harrison
2022-02-25 20:41 ` [Intel-gfx] " John.C.Harrison
2022-02-25 20:41 ` [PATCH v2 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow John.C.Harrison
2022-02-25 20:41   ` [Intel-gfx] " John.C.Harrison
2022-03-02  9:49   ` Tvrtko Ursulin
2022-03-02 18:22     ` John Harrison
2022-03-03 10:06       ` Tvrtko Ursulin
2022-02-25 20:41 ` [PATCH v2 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts John.C.Harrison
2022-02-25 20:41   ` [Intel-gfx] " John.C.Harrison
2022-02-25 20:41 ` [PATCH v2 3/3] drm/i915: Improve long running OCL w/a for GuC submission John.C.Harrison
2022-02-25 20:41   ` [Intel-gfx] " John.C.Harrison
2022-02-26  1:04 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Improve anti-pre-emption w/a for compute workloads (rev2) Patchwork
2022-02-26  1:05 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-02-26  1:45 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2022-03-02 11:16 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for Improve anti-pre-emption w/a for compute workloads (rev3) Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.