All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/7] Default request/fence expiry + watchdog
@ 2021-03-24 12:13 ` Tvrtko Ursulin
  0 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: dri-devel, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

"Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second
post of a somewhat controversial feature, now upgraded to patch status.

I quote the "watchdog" becuase in classical sense watchdog would allow userspace
to ping it and so remain alive.

I quote "restoring hangcheck" because this series, contrary to the old
hangcheck, is not looking at whether the workload is making any progress from
the kernel side either. (Although disclaimer my memory may be leaky - Daniel
suspects old hangcheck had some stricter, more indiscriminatory, angles to it.
But apart from being prone to both false negatives and false positives I can't
remember that myself.)

Short version - ask is to fail any user submissions after a set time period. In
this RFC that time is twelve seconds.

Time counts from the moment user submission is "runnable" (implicit and explicit
dependencies have been cleared) and keeps counting regardless of the GPU
contetion caused by other users of the system.

So semantics are really a bit weak, but again, I understand this is really
really wanted by the DRM core even if I am not convinced it is a good idea.

There are some dangers with doing this - text borrowed from a patch in the
series:

  This can have an effect that workloads which used to work fine will
  suddenly start failing. Even workloads comprised of short batches but in
  long dependency chains can be terminated.

  And becuase of lack of agreement on usefulness and safety of fence error
  propagation this partial execution can be invisible to userspace even if
  it is "listening" to returned fence status.

  Another interaction is with hangcheck where care needs to be taken timeout
  is not set lower or close to three times the heartbeat interval. Otherwise
  a hang in any application can cause complete termination of all
  submissions from unrelated clients. Any users modifying the per engine
  heartbeat intervals therefore need to be aware of this potential denial of
  service to avoid inadvertently enabling it.

  Given all this I am personally not convinced the scheme is a good idea.
  Intuitively it feels object importers would be better positioned to
  enforce the time they are willing to wait for something to complete.

v2:
 * Dropped context param.
 * Improved commit messages and Kconfig text.

v3:
 * Log timeouts.
 * Bump timeout to 20s to see if it helps Tigerlake.
 * Fix sentinel assert.

v4:
 * A round of review feedback applied.

Chris Wilson (1):
  drm/i915: Individual request cancellation

Tvrtko Ursulin (6):
  drm/i915: Extract active lookup engine to a helper
  drm/i915: Restrict sentinel requests further
  drm/i915: Handle async cancellation in sentinel assert
  drm/i915: Request watchdog infrastructure
  drm/i915: Fail too long user submissions by default
  drm/i915: Allow configuring default request expiry via modparam

 drivers/gpu/drm/i915/Kconfig.profile          |  14 ++
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |  73 ++++---
 .../gpu/drm/i915/gem/i915_gem_context_types.h |   4 +
 drivers/gpu/drm/i915/gt/intel_context_param.h |  11 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
 .../drm/i915/gt/intel_execlists_submission.c  |  23 +-
 .../drm/i915/gt/intel_execlists_submission.h  |   2 +
 drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
 drivers/gpu/drm/i915/gt/intel_gt.h            |   2 +
 drivers/gpu/drm/i915/gt/intel_gt_requests.c   |  28 +++
 drivers/gpu/drm/i915/gt/intel_gt_types.h      |   7 +
 drivers/gpu/drm/i915/i915_params.c            |   5 +
 drivers/gpu/drm/i915/i915_params.h            |   1 +
 drivers/gpu/drm/i915/i915_request.c           | 129 ++++++++++-
 drivers/gpu/drm/i915/i915_request.h           |  16 +-
 drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
 17 files changed, 479 insertions(+), 45 deletions(-)

-- 
2.27.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog
@ 2021-03-24 12:13 ` Tvrtko Ursulin
  0 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: dri-devel

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

"Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second
post of a somewhat controversial feature, now upgraded to patch status.

I quote the "watchdog" becuase in classical sense watchdog would allow userspace
to ping it and so remain alive.

I quote "restoring hangcheck" because this series, contrary to the old
hangcheck, is not looking at whether the workload is making any progress from
the kernel side either. (Although disclaimer my memory may be leaky - Daniel
suspects old hangcheck had some stricter, more indiscriminatory, angles to it.
But apart from being prone to both false negatives and false positives I can't
remember that myself.)

Short version - ask is to fail any user submissions after a set time period. In
this RFC that time is twelve seconds.

Time counts from the moment user submission is "runnable" (implicit and explicit
dependencies have been cleared) and keeps counting regardless of the GPU
contetion caused by other users of the system.

So semantics are really a bit weak, but again, I understand this is really
really wanted by the DRM core even if I am not convinced it is a good idea.

There are some dangers with doing this - text borrowed from a patch in the
series:

  This can have an effect that workloads which used to work fine will
  suddenly start failing. Even workloads comprised of short batches but in
  long dependency chains can be terminated.

  And becuase of lack of agreement on usefulness and safety of fence error
  propagation this partial execution can be invisible to userspace even if
  it is "listening" to returned fence status.

  Another interaction is with hangcheck where care needs to be taken timeout
  is not set lower or close to three times the heartbeat interval. Otherwise
  a hang in any application can cause complete termination of all
  submissions from unrelated clients. Any users modifying the per engine
  heartbeat intervals therefore need to be aware of this potential denial of
  service to avoid inadvertently enabling it.

  Given all this I am personally not convinced the scheme is a good idea.
  Intuitively it feels object importers would be better positioned to
  enforce the time they are willing to wait for something to complete.

v2:
 * Dropped context param.
 * Improved commit messages and Kconfig text.

v3:
 * Log timeouts.
 * Bump timeout to 20s to see if it helps Tigerlake.
 * Fix sentinel assert.

v4:
 * A round of review feedback applied.

Chris Wilson (1):
  drm/i915: Individual request cancellation

Tvrtko Ursulin (6):
  drm/i915: Extract active lookup engine to a helper
  drm/i915: Restrict sentinel requests further
  drm/i915: Handle async cancellation in sentinel assert
  drm/i915: Request watchdog infrastructure
  drm/i915: Fail too long user submissions by default
  drm/i915: Allow configuring default request expiry via modparam

 drivers/gpu/drm/i915/Kconfig.profile          |  14 ++
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |  73 ++++---
 .../gpu/drm/i915/gem/i915_gem_context_types.h |   4 +
 drivers/gpu/drm/i915/gt/intel_context_param.h |  11 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
 .../drm/i915/gt/intel_execlists_submission.c  |  23 +-
 .../drm/i915/gt/intel_execlists_submission.h  |   2 +
 drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
 drivers/gpu/drm/i915/gt/intel_gt.h            |   2 +
 drivers/gpu/drm/i915/gt/intel_gt_requests.c   |  28 +++
 drivers/gpu/drm/i915/gt/intel_gt_types.h      |   7 +
 drivers/gpu/drm/i915/i915_params.c            |   5 +
 drivers/gpu/drm/i915/i915_params.h            |   1 +
 drivers/gpu/drm/i915/i915_request.c           | 129 ++++++++++-
 drivers/gpu/drm/i915/i915_request.h           |  16 +-
 drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
 17 files changed, 479 insertions(+), 45 deletions(-)

-- 
2.27.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/7] drm/i915: Extract active lookup engine to a helper
  2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: dri-devel, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Move active engine lookup to exported i915_request_active_engine.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 34 +---------------
 drivers/gpu/drm/i915/i915_request.c         | 44 +++++++++++++++++++++
 drivers/gpu/drm/i915/i915_request.h         |  4 ++
 3 files changed, 49 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index ca37d93ef5e7..03a2f5f2a11f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -408,38 +408,6 @@ static bool __cancel_engine(struct intel_engine_cs *engine)
 	return intel_engine_pulse(engine) == 0;
 }
 
-static bool
-__active_engine(struct i915_request *rq, struct intel_engine_cs **active)
-{
-	struct intel_engine_cs *engine, *locked;
-	bool ret = false;
-
-	/*
-	 * Serialise with __i915_request_submit() so that it sees
-	 * is-banned?, or we know the request is already inflight.
-	 *
-	 * Note that rq->engine is unstable, and so we double
-	 * check that we have acquired the lock on the final engine.
-	 */
-	locked = READ_ONCE(rq->engine);
-	spin_lock_irq(&locked->sched.lock);
-	while (unlikely(locked != (engine = READ_ONCE(rq->engine)))) {
-		spin_unlock(&locked->sched.lock);
-		locked = engine;
-		spin_lock(&locked->sched.lock);
-	}
-
-	if (i915_request_is_active(rq)) {
-		if (!__i915_request_is_complete(rq))
-			*active = locked;
-		ret = true;
-	}
-
-	spin_unlock_irq(&locked->sched.lock);
-
-	return ret;
-}
-
 static struct intel_engine_cs *active_engine(struct intel_context *ce)
 {
 	struct intel_engine_cs *engine = NULL;
@@ -467,7 +435,7 @@ static struct intel_engine_cs *active_engine(struct intel_context *ce)
 		/* Check with the backend if the request is inflight */
 		found = true;
 		if (likely(rcu_access_pointer(rq->timeline) == ce->timeline))
-			found = __active_engine(rq, &engine);
+			found = i915_request_active_engine(rq, &engine);
 
 		i915_request_put(rq);
 		if (found)
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index e7b4c4bc41a6..8416b0bc4eb3 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -244,6 +244,50 @@ static void __i915_request_fill(struct i915_request *rq, u8 val)
 	memset(vaddr + head, val, rq->postfix - head);
 }
 
+/**
+ * i915_request_active_engine
+ * @rq: request to inspect
+ * @active: pointer in which to return the active engine
+ *
+ * Fills the currently active engine to the @active pointer if the request
+ * is active and still not completed.
+ *
+ * Returns true if request was active or false otherwise.
+ */
+bool
+i915_request_active_engine(struct i915_request *rq,
+			   struct intel_engine_cs **active)
+{
+	struct intel_engine_cs *engine, *locked;
+	bool ret = false;
+
+	/*
+	 * Serialise with __i915_request_submit() so that it sees
+	 * is-banned?, or we know the request is already inflight.
+	 *
+	 * Note that rq->engine is unstable, and so we double
+	 * check that we have acquired the lock on the final engine.
+	 */
+	locked = READ_ONCE(rq->engine);
+	spin_lock_irq(&locked->sched.lock);
+	while (unlikely(locked != (engine = READ_ONCE(rq->engine)))) {
+		spin_unlock(&locked->sched.lock);
+		locked = engine;
+		spin_lock(&locked->sched.lock);
+	}
+
+	if (i915_request_is_active(rq)) {
+		if (!__i915_request_is_complete(rq))
+			*active = locked;
+		ret = true;
+	}
+
+	spin_unlock_irq(&locked->sched.lock);
+
+	return ret;
+}
+
+
 static void remove_from_engine(struct i915_request *rq)
 {
 	struct intel_engine_cs *engine, *locked;
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index dd10a6db3d21..f5374bab7e69 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -650,4 +650,8 @@ static inline bool i915_request_use_semaphores(const struct i915_request *rq)
 	return intel_engine_has_semaphores(rq->engine);
 }
 
+bool
+i915_request_active_engine(struct i915_request *rq,
+			   struct intel_engine_cs **active);
+
 #endif /* I915_REQUEST_H */
-- 
2.27.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Intel-gfx] [PATCH 1/7] drm/i915: Extract active lookup engine to a helper
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  0 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: dri-devel

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Move active engine lookup to exported i915_request_active_engine.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 34 +---------------
 drivers/gpu/drm/i915/i915_request.c         | 44 +++++++++++++++++++++
 drivers/gpu/drm/i915/i915_request.h         |  4 ++
 3 files changed, 49 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index ca37d93ef5e7..03a2f5f2a11f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -408,38 +408,6 @@ static bool __cancel_engine(struct intel_engine_cs *engine)
 	return intel_engine_pulse(engine) == 0;
 }
 
-static bool
-__active_engine(struct i915_request *rq, struct intel_engine_cs **active)
-{
-	struct intel_engine_cs *engine, *locked;
-	bool ret = false;
-
-	/*
-	 * Serialise with __i915_request_submit() so that it sees
-	 * is-banned?, or we know the request is already inflight.
-	 *
-	 * Note that rq->engine is unstable, and so we double
-	 * check that we have acquired the lock on the final engine.
-	 */
-	locked = READ_ONCE(rq->engine);
-	spin_lock_irq(&locked->sched.lock);
-	while (unlikely(locked != (engine = READ_ONCE(rq->engine)))) {
-		spin_unlock(&locked->sched.lock);
-		locked = engine;
-		spin_lock(&locked->sched.lock);
-	}
-
-	if (i915_request_is_active(rq)) {
-		if (!__i915_request_is_complete(rq))
-			*active = locked;
-		ret = true;
-	}
-
-	spin_unlock_irq(&locked->sched.lock);
-
-	return ret;
-}
-
 static struct intel_engine_cs *active_engine(struct intel_context *ce)
 {
 	struct intel_engine_cs *engine = NULL;
@@ -467,7 +435,7 @@ static struct intel_engine_cs *active_engine(struct intel_context *ce)
 		/* Check with the backend if the request is inflight */
 		found = true;
 		if (likely(rcu_access_pointer(rq->timeline) == ce->timeline))
-			found = __active_engine(rq, &engine);
+			found = i915_request_active_engine(rq, &engine);
 
 		i915_request_put(rq);
 		if (found)
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index e7b4c4bc41a6..8416b0bc4eb3 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -244,6 +244,50 @@ static void __i915_request_fill(struct i915_request *rq, u8 val)
 	memset(vaddr + head, val, rq->postfix - head);
 }
 
+/**
+ * i915_request_active_engine
+ * @rq: request to inspect
+ * @active: pointer in which to return the active engine
+ *
+ * Fills the currently active engine to the @active pointer if the request
+ * is active and still not completed.
+ *
+ * Returns true if request was active or false otherwise.
+ */
+bool
+i915_request_active_engine(struct i915_request *rq,
+			   struct intel_engine_cs **active)
+{
+	struct intel_engine_cs *engine, *locked;
+	bool ret = false;
+
+	/*
+	 * Serialise with __i915_request_submit() so that it sees
+	 * is-banned?, or we know the request is already inflight.
+	 *
+	 * Note that rq->engine is unstable, and so we double
+	 * check that we have acquired the lock on the final engine.
+	 */
+	locked = READ_ONCE(rq->engine);
+	spin_lock_irq(&locked->sched.lock);
+	while (unlikely(locked != (engine = READ_ONCE(rq->engine)))) {
+		spin_unlock(&locked->sched.lock);
+		locked = engine;
+		spin_lock(&locked->sched.lock);
+	}
+
+	if (i915_request_is_active(rq)) {
+		if (!__i915_request_is_complete(rq))
+			*active = locked;
+		ret = true;
+	}
+
+	spin_unlock_irq(&locked->sched.lock);
+
+	return ret;
+}
+
+
 static void remove_from_engine(struct i915_request *rq)
 {
 	struct intel_engine_cs *engine, *locked;
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index dd10a6db3d21..f5374bab7e69 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -650,4 +650,8 @@ static inline bool i915_request_use_semaphores(const struct i915_request *rq)
 	return intel_engine_has_semaphores(rq->engine);
 }
 
+bool
+i915_request_active_engine(struct i915_request *rq,
+			   struct intel_engine_cs **active);
+
 #endif /* I915_REQUEST_H */
-- 
2.27.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/7] drm/i915: Individual request cancellation
  2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: Tvrtko Ursulin, Matthew Auld, dri-devel, Chris Wilson

From: Chris Wilson <chris@chris-wilson.co.uk>

Currently, we cancel outstanding requests within a context when the
context is closed. We may also want to cancel individual requests using
the same graceful preemption mechanism.

v2 (Tvrtko):
 * Cancel waiters carefully considering no timeline lock and RCU.
 * Fixed selftests.

v3 (Tvrtko):
 * Remove error propagation to waiters for now.

v4 (Tvrtko):
 * Rebase for extracted i915_request_active_engine. (Matt)

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com> # v3
---
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
 .../drm/i915/gt/intel_execlists_submission.c  |   9 +-
 drivers/gpu/drm/i915/i915_request.c           |  33 ++-
 drivers/gpu/drm/i915/i915_request.h           |   4 +-
 drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
 5 files changed, 242 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 0b062fad1837..e2fb3ae2aaf3 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -314,6 +314,7 @@ int intel_engine_pulse(struct intel_engine_cs *engine)
 		mutex_unlock(&ce->timeline->mutex);
 	}
 
+	intel_engine_flush_scheduler(engine);
 	intel_engine_pm_put(engine);
 	return err;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 85ff5fe861b4..4c2acb5a6c0a 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -421,6 +421,11 @@ static void reset_active(struct i915_request *rq,
 	ce->lrc.lrca = lrc_update_regs(ce, engine, head);
 }
 
+static bool bad_request(const struct i915_request *rq)
+{
+	return rq->fence.error && i915_request_started(rq);
+}
+
 static struct intel_engine_cs *
 __execlists_schedule_in(struct i915_request *rq)
 {
@@ -433,7 +438,7 @@ __execlists_schedule_in(struct i915_request *rq)
 		     !intel_engine_has_heartbeat(engine)))
 		intel_context_set_banned(ce);
 
-	if (unlikely(intel_context_is_banned(ce)))
+	if (unlikely(intel_context_is_banned(ce) || bad_request(rq)))
 		reset_active(rq, engine);
 
 	if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
@@ -1112,7 +1117,7 @@ static unsigned long active_preempt_timeout(struct intel_engine_cs *engine,
 		return 0;
 
 	/* Force a fast reset for terminated contexts (ignoring sysfs!) */
-	if (unlikely(intel_context_is_banned(rq->context)))
+	if (unlikely(intel_context_is_banned(rq->context) || bad_request(rq)))
 		return 1;
 
 	return READ_ONCE(engine->props.preempt_timeout_ms);
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 8416b0bc4eb3..d1a4a3fa7425 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -33,7 +33,10 @@
 #include "gem/i915_gem_context.h"
 #include "gt/intel_breadcrumbs.h"
 #include "gt/intel_context.h"
+#include "gt/intel_engine.h"
+#include "gt/intel_engine_heartbeat.h"
 #include "gt/intel_gpu_commands.h"
+#include "gt/intel_reset.h"
 #include "gt/intel_ring.h"
 #include "gt/intel_rps.h"
 
@@ -473,20 +476,22 @@ void __i915_request_skip(struct i915_request *rq)
 	rq->infix = rq->postfix;
 }
 
-void i915_request_set_error_once(struct i915_request *rq, int error)
+bool i915_request_set_error_once(struct i915_request *rq, int error)
 {
 	int old;
 
 	GEM_BUG_ON(!IS_ERR_VALUE((long)error));
 
 	if (i915_request_signaled(rq))
-		return;
+		return false;
 
 	old = READ_ONCE(rq->fence.error);
 	do {
 		if (fatal_error(old))
-			return;
+			return false;
 	} while (!try_cmpxchg(&rq->fence.error, &old, error));
+
+	return true;
 }
 
 struct i915_request *i915_request_mark_eio(struct i915_request *rq)
@@ -653,6 +658,28 @@ void i915_request_unsubmit(struct i915_request *request)
 	spin_unlock_irqrestore(&se->lock, flags);
 }
 
+static void __cancel_request(struct i915_request *rq)
+{
+	struct intel_engine_cs *engine = NULL;
+
+	i915_request_active_engine(rq, &engine);
+
+	if (engine && intel_engine_pulse(engine))
+		intel_gt_handle_error(engine->gt, engine->mask, 0,
+				      "request cancellation by %s",
+				      current->comm);
+}
+
+void i915_request_cancel(struct i915_request *rq, int error)
+{
+	if (!i915_request_set_error_once(rq, error))
+		return;
+
+	set_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags);
+
+	__cancel_request(rq);
+}
+
 static int __i915_sw_fence_call
 submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 {
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index f5374bab7e69..2dea55ea69e1 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -312,7 +312,7 @@ struct i915_request * __must_check
 i915_request_create(struct intel_context *ce);
 
 void __i915_request_skip(struct i915_request *rq);
-void i915_request_set_error_once(struct i915_request *rq, int error);
+bool i915_request_set_error_once(struct i915_request *rq, int error);
 struct i915_request *i915_request_mark_eio(struct i915_request *rq);
 
 struct i915_request *__i915_request_commit(struct i915_request *request);
@@ -368,6 +368,8 @@ void i915_request_submit(struct i915_request *request);
 void __i915_request_unsubmit(struct i915_request *request);
 void i915_request_unsubmit(struct i915_request *request);
 
+void i915_request_cancel(struct i915_request *rq, int error);
+
 long i915_request_wait(struct i915_request *rq,
 		       unsigned int flags,
 		       long timeout)
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index 8035ea7565ed..72f85d9d5e8b 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -608,6 +608,206 @@ static int live_nop_request(void *arg)
 	return err;
 }
 
+static int __cancel_inactive(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq;
+	int err = 0;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		return -ENOMEM;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+
+	pr_debug("%s: Cancelling inactive request\n", engine->name);
+	i915_request_cancel(rq, -EINTR);
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to cancel inactive request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	if (rq->fence.error != -EINTR) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+	}
+
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+	if (err)
+		pr_err("%s: %s error %d\n", __func__, engine->name, err);
+	return err;
+}
+
+static int __cancel_active(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq;
+	int err = 0;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		return -ENOMEM;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+
+	pr_debug("%s: Cancelling active request\n", engine->name);
+	i915_request_get(rq);
+	i915_request_add(rq);
+	if (!igt_wait_for_spinner(&spin, rq)) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("Failed to start spinner on %s\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+	i915_request_cancel(rq, -EINTR);
+
+	if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to cancel active request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	if (rq->fence.error != -EINTR) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+	}
+
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+	if (err)
+		pr_err("%s: %s error %d\n", __func__, engine->name, err);
+	return err;
+}
+
+static int __cancel_completed(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq;
+	int err = 0;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		return -ENOMEM;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+	igt_spinner_end(&spin);
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	pr_debug("%s: Cancelling completed request\n", engine->name);
+	i915_request_cancel(rq, -EINTR);
+	if (rq->fence.error) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+	}
+
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+	if (err)
+		pr_err("%s: %s error %d\n", __func__, engine->name, err);
+	return err;
+}
+
+static int live_cancel_request(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct intel_engine_cs *engine;
+
+	/*
+	 * Check cancellation of requests. We expect to be able to immediately
+	 * cancel active requests, even if they are currently on the GPU.
+	 */
+
+	for_each_uabi_engine(engine, i915) {
+		struct igt_live_test t;
+		int err, err2;
+
+		if (!intel_engine_has_preemption(engine))
+			continue;
+
+		err = igt_live_test_begin(&t, i915, __func__, engine->name);
+		if (err)
+			return err;
+
+		err = __cancel_inactive(engine);
+		if (err == 0)
+			err = __cancel_active(engine);
+		if (err == 0)
+			err = __cancel_completed(engine);
+
+		err2 = igt_live_test_end(&t);
+		if (err)
+			return err;
+		if (err2)
+			return err2;
+	}
+
+	return 0;
+}
+
 static struct i915_vma *empty_batch(struct drm_i915_private *i915)
 {
 	struct drm_i915_gem_object *obj;
@@ -1485,6 +1685,7 @@ int i915_request_live_selftests(struct drm_i915_private *i915)
 		SUBTEST(live_sequential_engines),
 		SUBTEST(live_parallel_engines),
 		SUBTEST(live_empty_request),
+		SUBTEST(live_cancel_request),
 		SUBTEST(live_breadcrumbs_smoketest),
 	};
 
-- 
2.27.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Intel-gfx] [PATCH 2/7] drm/i915: Individual request cancellation
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  0 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: Matthew Auld, dri-devel, Chris Wilson

From: Chris Wilson <chris@chris-wilson.co.uk>

Currently, we cancel outstanding requests within a context when the
context is closed. We may also want to cancel individual requests using
the same graceful preemption mechanism.

v2 (Tvrtko):
 * Cancel waiters carefully considering no timeline lock and RCU.
 * Fixed selftests.

v3 (Tvrtko):
 * Remove error propagation to waiters for now.

v4 (Tvrtko):
 * Rebase for extracted i915_request_active_engine. (Matt)

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com> # v3
---
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
 .../drm/i915/gt/intel_execlists_submission.c  |   9 +-
 drivers/gpu/drm/i915/i915_request.c           |  33 ++-
 drivers/gpu/drm/i915/i915_request.h           |   4 +-
 drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
 5 files changed, 242 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 0b062fad1837..e2fb3ae2aaf3 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -314,6 +314,7 @@ int intel_engine_pulse(struct intel_engine_cs *engine)
 		mutex_unlock(&ce->timeline->mutex);
 	}
 
+	intel_engine_flush_scheduler(engine);
 	intel_engine_pm_put(engine);
 	return err;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 85ff5fe861b4..4c2acb5a6c0a 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -421,6 +421,11 @@ static void reset_active(struct i915_request *rq,
 	ce->lrc.lrca = lrc_update_regs(ce, engine, head);
 }
 
+static bool bad_request(const struct i915_request *rq)
+{
+	return rq->fence.error && i915_request_started(rq);
+}
+
 static struct intel_engine_cs *
 __execlists_schedule_in(struct i915_request *rq)
 {
@@ -433,7 +438,7 @@ __execlists_schedule_in(struct i915_request *rq)
 		     !intel_engine_has_heartbeat(engine)))
 		intel_context_set_banned(ce);
 
-	if (unlikely(intel_context_is_banned(ce)))
+	if (unlikely(intel_context_is_banned(ce) || bad_request(rq)))
 		reset_active(rq, engine);
 
 	if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
@@ -1112,7 +1117,7 @@ static unsigned long active_preempt_timeout(struct intel_engine_cs *engine,
 		return 0;
 
 	/* Force a fast reset for terminated contexts (ignoring sysfs!) */
-	if (unlikely(intel_context_is_banned(rq->context)))
+	if (unlikely(intel_context_is_banned(rq->context) || bad_request(rq)))
 		return 1;
 
 	return READ_ONCE(engine->props.preempt_timeout_ms);
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 8416b0bc4eb3..d1a4a3fa7425 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -33,7 +33,10 @@
 #include "gem/i915_gem_context.h"
 #include "gt/intel_breadcrumbs.h"
 #include "gt/intel_context.h"
+#include "gt/intel_engine.h"
+#include "gt/intel_engine_heartbeat.h"
 #include "gt/intel_gpu_commands.h"
+#include "gt/intel_reset.h"
 #include "gt/intel_ring.h"
 #include "gt/intel_rps.h"
 
@@ -473,20 +476,22 @@ void __i915_request_skip(struct i915_request *rq)
 	rq->infix = rq->postfix;
 }
 
-void i915_request_set_error_once(struct i915_request *rq, int error)
+bool i915_request_set_error_once(struct i915_request *rq, int error)
 {
 	int old;
 
 	GEM_BUG_ON(!IS_ERR_VALUE((long)error));
 
 	if (i915_request_signaled(rq))
-		return;
+		return false;
 
 	old = READ_ONCE(rq->fence.error);
 	do {
 		if (fatal_error(old))
-			return;
+			return false;
 	} while (!try_cmpxchg(&rq->fence.error, &old, error));
+
+	return true;
 }
 
 struct i915_request *i915_request_mark_eio(struct i915_request *rq)
@@ -653,6 +658,28 @@ void i915_request_unsubmit(struct i915_request *request)
 	spin_unlock_irqrestore(&se->lock, flags);
 }
 
+static void __cancel_request(struct i915_request *rq)
+{
+	struct intel_engine_cs *engine = NULL;
+
+	i915_request_active_engine(rq, &engine);
+
+	if (engine && intel_engine_pulse(engine))
+		intel_gt_handle_error(engine->gt, engine->mask, 0,
+				      "request cancellation by %s",
+				      current->comm);
+}
+
+void i915_request_cancel(struct i915_request *rq, int error)
+{
+	if (!i915_request_set_error_once(rq, error))
+		return;
+
+	set_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags);
+
+	__cancel_request(rq);
+}
+
 static int __i915_sw_fence_call
 submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 {
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index f5374bab7e69..2dea55ea69e1 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -312,7 +312,7 @@ struct i915_request * __must_check
 i915_request_create(struct intel_context *ce);
 
 void __i915_request_skip(struct i915_request *rq);
-void i915_request_set_error_once(struct i915_request *rq, int error);
+bool i915_request_set_error_once(struct i915_request *rq, int error);
 struct i915_request *i915_request_mark_eio(struct i915_request *rq);
 
 struct i915_request *__i915_request_commit(struct i915_request *request);
@@ -368,6 +368,8 @@ void i915_request_submit(struct i915_request *request);
 void __i915_request_unsubmit(struct i915_request *request);
 void i915_request_unsubmit(struct i915_request *request);
 
+void i915_request_cancel(struct i915_request *rq, int error);
+
 long i915_request_wait(struct i915_request *rq,
 		       unsigned int flags,
 		       long timeout)
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index 8035ea7565ed..72f85d9d5e8b 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -608,6 +608,206 @@ static int live_nop_request(void *arg)
 	return err;
 }
 
+static int __cancel_inactive(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq;
+	int err = 0;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		return -ENOMEM;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+
+	pr_debug("%s: Cancelling inactive request\n", engine->name);
+	i915_request_cancel(rq, -EINTR);
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to cancel inactive request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	if (rq->fence.error != -EINTR) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+	}
+
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+	if (err)
+		pr_err("%s: %s error %d\n", __func__, engine->name, err);
+	return err;
+}
+
+static int __cancel_active(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq;
+	int err = 0;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		return -ENOMEM;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+
+	pr_debug("%s: Cancelling active request\n", engine->name);
+	i915_request_get(rq);
+	i915_request_add(rq);
+	if (!igt_wait_for_spinner(&spin, rq)) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("Failed to start spinner on %s\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+	i915_request_cancel(rq, -EINTR);
+
+	if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+		struct drm_printer p = drm_info_printer(engine->i915->drm.dev);
+
+		pr_err("%s: Failed to cancel active request\n", engine->name);
+		intel_engine_dump(engine, &p, "%s\n", engine->name);
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	if (rq->fence.error != -EINTR) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+	}
+
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+	if (err)
+		pr_err("%s: %s error %d\n", __func__, engine->name, err);
+	return err;
+}
+
+static int __cancel_completed(struct intel_engine_cs *engine)
+{
+	struct intel_context *ce;
+	struct igt_spinner spin;
+	struct i915_request *rq;
+	int err = 0;
+
+	if (igt_spinner_init(&spin, engine->gt))
+		return -ENOMEM;
+
+	ce = intel_context_create(engine);
+	if (IS_ERR(ce)) {
+		err = PTR_ERR(ce);
+		goto out_spin;
+	}
+
+	rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto out_ce;
+	}
+	igt_spinner_end(&spin);
+	i915_request_get(rq);
+	i915_request_add(rq);
+
+	if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+		err = -ETIME;
+		goto out_rq;
+	}
+
+	pr_debug("%s: Cancelling completed request\n", engine->name);
+	i915_request_cancel(rq, -EINTR);
+	if (rq->fence.error) {
+		pr_err("%s: fence not cancelled (%u)\n",
+		       engine->name, rq->fence.error);
+		err = -EINVAL;
+	}
+
+out_rq:
+	i915_request_put(rq);
+out_ce:
+	intel_context_put(ce);
+out_spin:
+	igt_spinner_fini(&spin);
+	if (err)
+		pr_err("%s: %s error %d\n", __func__, engine->name, err);
+	return err;
+}
+
+static int live_cancel_request(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct intel_engine_cs *engine;
+
+	/*
+	 * Check cancellation of requests. We expect to be able to immediately
+	 * cancel active requests, even if they are currently on the GPU.
+	 */
+
+	for_each_uabi_engine(engine, i915) {
+		struct igt_live_test t;
+		int err, err2;
+
+		if (!intel_engine_has_preemption(engine))
+			continue;
+
+		err = igt_live_test_begin(&t, i915, __func__, engine->name);
+		if (err)
+			return err;
+
+		err = __cancel_inactive(engine);
+		if (err == 0)
+			err = __cancel_active(engine);
+		if (err == 0)
+			err = __cancel_completed(engine);
+
+		err2 = igt_live_test_end(&t);
+		if (err)
+			return err;
+		if (err2)
+			return err2;
+	}
+
+	return 0;
+}
+
 static struct i915_vma *empty_batch(struct drm_i915_private *i915)
 {
 	struct drm_i915_gem_object *obj;
@@ -1485,6 +1685,7 @@ int i915_request_live_selftests(struct drm_i915_private *i915)
 		SUBTEST(live_sequential_engines),
 		SUBTEST(live_parallel_engines),
 		SUBTEST(live_empty_request),
+		SUBTEST(live_cancel_request),
 		SUBTEST(live_breadcrumbs_smoketest),
 	};
 
-- 
2.27.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/7] drm/i915: Restrict sentinel requests further
  2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: dri-devel, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Disallow sentinel requests follow previous sentinels to make request
cancellation work better when faced with a chain of requests which have
all been marked as in error.

Because in cases where we end up with a stream of cancelled requests we
want to turn of request coalescing so they each will get individually
skipped by the execlists_schedule_in (which is called per ELSP port, not
per request).

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 4c2acb5a6c0a..4b870eca9693 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -896,7 +896,7 @@ static bool can_merge_rq(const struct i915_request *prev,
 	if (__i915_request_is_complete(next))
 		return true;
 
-	if (unlikely((i915_request_flags(prev) ^ i915_request_flags(next)) &
+	if (unlikely((i915_request_flags(prev) | i915_request_flags(next)) &
 		     (BIT(I915_FENCE_FLAG_NOPREEMPT) |
 		      BIT(I915_FENCE_FLAG_SENTINEL))))
 		return false;
-- 
2.27.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Intel-gfx] [PATCH 3/7] drm/i915: Restrict sentinel requests further
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  0 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: dri-devel

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Disallow sentinel requests follow previous sentinels to make request
cancellation work better when faced with a chain of requests which have
all been marked as in error.

Because in cases where we end up with a stream of cancelled requests we
want to turn of request coalescing so they each will get individually
skipped by the execlists_schedule_in (which is called per ELSP port, not
per request).

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 4c2acb5a6c0a..4b870eca9693 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -896,7 +896,7 @@ static bool can_merge_rq(const struct i915_request *prev,
 	if (__i915_request_is_complete(next))
 		return true;
 
-	if (unlikely((i915_request_flags(prev) ^ i915_request_flags(next)) &
+	if (unlikely((i915_request_flags(prev) | i915_request_flags(next)) &
 		     (BIT(I915_FENCE_FLAG_NOPREEMPT) |
 		      BIT(I915_FENCE_FLAG_SENTINEL))))
 		return false;
-- 
2.27.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/7] drm/i915: Handle async cancellation in sentinel assert
  2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: dri-devel, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

With the watchdog cancelling requests asynchronously to preempt-to-busy we
need to relax one assert making it apply only to requests not in error.

v2:
 * Check against the correct request!

v3:
 * Simplify the check to avoid the question of when to sample the fence
   error vs sentinel bit.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 4b870eca9693..9d264d4ffa75 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -700,9 +700,8 @@ assert_pending_valid(const struct intel_engine_execlists *execlists,
 {
 	struct intel_engine_cs *engine =
 		container_of(execlists, typeof(*engine), execlists);
-	struct i915_request * const *port, *rq;
+	struct i915_request * const *port, *rq, *prev = NULL;
 	struct intel_context *ce = NULL;
-	bool sentinel = false;
 	u32 ccid = -1;
 
 	trace_ports(execlists, msg, execlists->pending);
@@ -752,15 +751,20 @@ assert_pending_valid(const struct intel_engine_execlists *execlists,
 		 * Sentinels are supposed to be the last request so they flush
 		 * the current execution off the HW. Check that they are the only
 		 * request in the pending submission.
+		 *
+		 * NB: Due to the async nature of preempt-to-busy and request
+		 * cancellation we need to handle the case where request
+		 * becomes a sentinel in parallel to CSB processing.
 		 */
-		if (sentinel) {
+		if (prev && i915_request_has_sentinel(prev) &&
+		    !READ_ONCE(prev->fence.error)) {
 			GEM_TRACE_ERR("%s: context:%llx after sentinel in pending[%zd]\n",
 				      engine->name,
 				      ce->timeline->fence_context,
 				      port - execlists->pending);
 			return false;
 		}
-		sentinel = i915_request_has_sentinel(rq);
+		prev = rq;
 
 		/*
 		 * We want virtual requests to only be in the first slot so
-- 
2.27.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Intel-gfx] [PATCH 4/7] drm/i915: Handle async cancellation in sentinel assert
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  0 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: dri-devel

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

With the watchdog cancelling requests asynchronously to preempt-to-busy we
need to relax one assert making it apply only to requests not in error.

v2:
 * Check against the correct request!

v3:
 * Simplify the check to avoid the question of when to sample the fence
   error vs sentinel bit.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 4b870eca9693..9d264d4ffa75 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -700,9 +700,8 @@ assert_pending_valid(const struct intel_engine_execlists *execlists,
 {
 	struct intel_engine_cs *engine =
 		container_of(execlists, typeof(*engine), execlists);
-	struct i915_request * const *port, *rq;
+	struct i915_request * const *port, *rq, *prev = NULL;
 	struct intel_context *ce = NULL;
-	bool sentinel = false;
 	u32 ccid = -1;
 
 	trace_ports(execlists, msg, execlists->pending);
@@ -752,15 +751,20 @@ assert_pending_valid(const struct intel_engine_execlists *execlists,
 		 * Sentinels are supposed to be the last request so they flush
 		 * the current execution off the HW. Check that they are the only
 		 * request in the pending submission.
+		 *
+		 * NB: Due to the async nature of preempt-to-busy and request
+		 * cancellation we need to handle the case where request
+		 * becomes a sentinel in parallel to CSB processing.
 		 */
-		if (sentinel) {
+		if (prev && i915_request_has_sentinel(prev) &&
+		    !READ_ONCE(prev->fence.error)) {
 			GEM_TRACE_ERR("%s: context:%llx after sentinel in pending[%zd]\n",
 				      engine->name,
 				      ce->timeline->fence_context,
 				      port - execlists->pending);
 			return false;
 		}
-		sentinel = i915_request_has_sentinel(rq);
+		prev = rq;
 
 		/*
 		 * We want virtual requests to only be in the first slot so
-- 
2.27.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/7] drm/i915: Request watchdog infrastructure
  2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: Daniel Vetter, Matthew Auld, dri-devel, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Prepares the plumbing for setting request/fence expiration time. All code
is put in place but is never activated due yet missing ability to actually
configure the timer.

Outline of the basic operation:

A timer is started when request is ready for execution. If the request
completes (retires) before the timer fires, timer is cancelled and nothing
further happens.

If the timer fires request is added to a lockless list and worker queued.
Purpose of this is twofold: a) It allows request cancellation from a more
friendly context and b) coalesces multiple expirations into a single event
of consuming the list.

Worker locklessly consumes the list of expired requests and cancels them
all using previous added i915_request_cancel().

Associated timeout value is stored in rq->context.watchdog.timeout_us.

v2:
 * Log expiration.

v3:
 * Include more information about user timeline in the log message.

v4:
 * Remove obsolete comment and fix formatting. (Matt)

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  4 ++
 .../drm/i915/gt/intel_execlists_submission.h  |  2 +
 drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
 drivers/gpu/drm/i915/gt/intel_gt.h            |  2 +
 drivers/gpu/drm/i915/gt/intel_gt_requests.c   | 28 ++++++++++
 drivers/gpu/drm/i915/gt/intel_gt_types.h      |  7 +++
 drivers/gpu/drm/i915/i915_request.c           | 52 +++++++++++++++++++
 drivers/gpu/drm/i915/i915_request.h           |  8 +++
 8 files changed, 106 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 0ea18c9e2aca..65a5730a4f5b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -99,6 +99,10 @@ struct intel_context {
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	7
 #define CONTEXT_NOPREEMPT		8
 
+	struct {
+		u64 timeout_us;
+	} watchdog;
+
 	u32 *lrc_reg_state;
 	union {
 		struct {
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
index f7bd3fccfee8..4ca9b475e252 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
@@ -6,6 +6,7 @@
 #ifndef __INTEL_EXECLISTS_SUBMISSION_H__
 #define __INTEL_EXECLISTS_SUBMISSION_H__
 
+#include <linux/llist.h>
 #include <linux/types.h>
 
 struct drm_printer;
@@ -13,6 +14,7 @@ struct drm_printer;
 struct i915_request;
 struct intel_context;
 struct intel_engine_cs;
+struct intel_gt;
 
 enum {
 	INTEL_CONTEXT_SCHEDULE_IN = 0,
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index ca76f93bc03d..8d77dcbad059 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -31,6 +31,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
 	INIT_LIST_HEAD(&gt->closed_vma);
 	spin_lock_init(&gt->closed_lock);
 
+	init_llist_head(&gt->watchdog.list);
+	INIT_WORK(&gt->watchdog.work, intel_gt_watchdog_work);
+
 	intel_gt_init_buffer_pool(gt);
 	intel_gt_init_reset(gt);
 	intel_gt_init_requests(gt);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h b/drivers/gpu/drm/i915/gt/intel_gt.h
index a17bd8b3195f..7ec395cace69 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt.h
@@ -78,4 +78,6 @@ static inline bool intel_gt_is_wedged(const struct intel_gt *gt)
 void intel_gt_info_print(const struct intel_gt_info *info,
 			 struct drm_printer *p);
 
+void intel_gt_watchdog_work(struct work_struct *work);
+
 #endif /* __INTEL_GT_H__ */
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
index 36ec97f79174..fbfd19b2e5f2 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
@@ -8,6 +8,7 @@
 #include "i915_drv.h" /* for_each_engine() */
 #include "i915_request.h"
 #include "intel_engine_heartbeat.h"
+#include "intel_execlists_submission.h"
 #include "intel_gt.h"
 #include "intel_gt_pm.h"
 #include "intel_gt_requests.h"
@@ -242,4 +243,31 @@ void intel_gt_fini_requests(struct intel_gt *gt)
 {
 	/* Wait until the work is marked as finished before unloading! */
 	cancel_delayed_work_sync(&gt->requests.retire_work);
+
+	flush_work(&gt->watchdog.work);
+}
+
+void intel_gt_watchdog_work(struct work_struct *work)
+{
+	struct intel_gt *gt =
+		container_of(work, typeof(*gt), watchdog.work);
+	struct i915_request *rq, *rn;
+	struct llist_node *first;
+
+	first = llist_del_all(&gt->watchdog.list);
+	if (!first)
+		return;
+
+	llist_for_each_entry_safe(rq, rn, first, watchdog.link) {
+		if (!i915_request_completed(rq)) {
+			struct dma_fence *f = &rq->fence;
+
+			pr_notice("Fence expiration time out i915-%s:%s:%llx!\n",
+				  f->ops->get_driver_name(f),
+				  f->ops->get_timeline_name(f),
+				  f->seqno);
+			i915_request_cancel(rq, -EINTR);
+		}
+		i915_request_put(rq);
+	}
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
index 626af37c7790..d70ebcc6f19f 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
@@ -8,10 +8,12 @@
 
 #include <linux/ktime.h>
 #include <linux/list.h>
+#include <linux/llist.h>
 #include <linux/mutex.h>
 #include <linux/notifier.h>
 #include <linux/spinlock.h>
 #include <linux/types.h>
+#include <linux/workqueue.h>
 
 #include "uc/intel_uc.h"
 
@@ -62,6 +64,11 @@ struct intel_gt {
 		struct delayed_work retire_work;
 	} requests;
 
+	struct {
+		struct llist_head list;
+		struct work_struct work;
+	} watchdog;
+
 	struct intel_wakeref wakeref;
 	atomic_t user_wakeref;
 
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index d1a4a3fa7425..d58052f3410c 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -321,6 +321,53 @@ static void remove_from_engine(struct i915_request *rq)
 	__notify_execute_cb_imm(rq);
 }
 
+static void __rq_init_watchdog(struct i915_request *rq)
+{
+	rq->watchdog.timer.function = NULL;
+}
+
+static enum hrtimer_restart __rq_watchdog_expired(struct hrtimer *hrtimer)
+{
+	struct i915_request *rq =
+		container_of(hrtimer, struct i915_request, watchdog.timer);
+	struct intel_gt *gt = rq->engine->gt;
+
+	if (!i915_request_completed(rq)) {
+		if (llist_add(&rq->watchdog.link, &gt->watchdog.list))
+			schedule_work(&gt->watchdog.work);
+	} else {
+		i915_request_put(rq);
+	}
+
+	return HRTIMER_NORESTART;
+}
+
+static void __rq_arm_watchdog(struct i915_request *rq)
+{
+	struct i915_request_watchdog *wdg = &rq->watchdog;
+	struct intel_context *ce = rq->context;
+
+	if (!ce->watchdog.timeout_us)
+		return;
+
+	hrtimer_init(&wdg->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	wdg->timer.function = __rq_watchdog_expired;
+	hrtimer_start_range_ns(&wdg->timer,
+			       ns_to_ktime(ce->watchdog.timeout_us *
+					   NSEC_PER_USEC),
+			       NSEC_PER_MSEC,
+			       HRTIMER_MODE_REL);
+	i915_request_get(rq);
+}
+
+static void __rq_cancel_watchdog(struct i915_request *rq)
+{
+	struct i915_request_watchdog *wdg = &rq->watchdog;
+
+	if (wdg->timer.function && hrtimer_try_to_cancel(&wdg->timer) > 0)
+		i915_request_put(rq);
+}
+
 bool i915_request_retire(struct i915_request *rq)
 {
 	if (!__i915_request_is_complete(rq))
@@ -332,6 +379,8 @@ bool i915_request_retire(struct i915_request *rq)
 	trace_i915_request_retire(rq);
 	i915_request_mark_complete(rq);
 
+	__rq_cancel_watchdog(rq);
+
 	/*
 	 * We know the GPU must have read the request to have
 	 * sent us the seqno + interrupt, so use the position
@@ -692,6 +741,8 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 
 		if (unlikely(fence->error))
 			i915_request_set_error_once(request, fence->error);
+		else
+			__rq_arm_watchdog(request);
 
 		/*
 		 * We need to serialize use of the submit_request() callback
@@ -879,6 +930,7 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp)
 
 	/* No zalloc, everything must be cleared after use */
 	rq->batch = NULL;
+	__rq_init_watchdog(rq);
 	GEM_BUG_ON(rq->capture_list);
 	GEM_BUG_ON(!llist_empty(&rq->execute_cb));
 
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 2dea55ea69e1..5e0946992f1a 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -26,7 +26,9 @@
 #define I915_REQUEST_H
 
 #include <linux/dma-fence.h>
+#include <linux/hrtimer.h>
 #include <linux/irq_work.h>
+#include <linux/llist.h>
 #include <linux/lockdep.h>
 
 #include "gem/i915_gem_context_types.h"
@@ -289,6 +291,12 @@ struct i915_request {
 	/** timeline->request entry for this request */
 	struct list_head link;
 
+	/** Watchdog support fields. */
+	struct i915_request_watchdog {
+		struct llist_node link;
+		struct hrtimer timer;
+	} watchdog;
+
 	I915_SELFTEST_DECLARE(struct {
 		struct list_head link;
 		unsigned long delay;
-- 
2.27.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Intel-gfx] [PATCH 5/7] drm/i915: Request watchdog infrastructure
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  0 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: Daniel Vetter, Matthew Auld, dri-devel

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Prepares the plumbing for setting request/fence expiration time. All code
is put in place but is never activated due yet missing ability to actually
configure the timer.

Outline of the basic operation:

A timer is started when request is ready for execution. If the request
completes (retires) before the timer fires, timer is cancelled and nothing
further happens.

If the timer fires request is added to a lockless list and worker queued.
Purpose of this is twofold: a) It allows request cancellation from a more
friendly context and b) coalesces multiple expirations into a single event
of consuming the list.

Worker locklessly consumes the list of expired requests and cancels them
all using previous added i915_request_cancel().

Associated timeout value is stored in rq->context.watchdog.timeout_us.

v2:
 * Log expiration.

v3:
 * Include more information about user timeline in the log message.

v4:
 * Remove obsolete comment and fix formatting. (Matt)

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  4 ++
 .../drm/i915/gt/intel_execlists_submission.h  |  2 +
 drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
 drivers/gpu/drm/i915/gt/intel_gt.h            |  2 +
 drivers/gpu/drm/i915/gt/intel_gt_requests.c   | 28 ++++++++++
 drivers/gpu/drm/i915/gt/intel_gt_types.h      |  7 +++
 drivers/gpu/drm/i915/i915_request.c           | 52 +++++++++++++++++++
 drivers/gpu/drm/i915/i915_request.h           |  8 +++
 8 files changed, 106 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 0ea18c9e2aca..65a5730a4f5b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -99,6 +99,10 @@ struct intel_context {
 #define CONTEXT_FORCE_SINGLE_SUBMISSION	7
 #define CONTEXT_NOPREEMPT		8
 
+	struct {
+		u64 timeout_us;
+	} watchdog;
+
 	u32 *lrc_reg_state;
 	union {
 		struct {
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
index f7bd3fccfee8..4ca9b475e252 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
@@ -6,6 +6,7 @@
 #ifndef __INTEL_EXECLISTS_SUBMISSION_H__
 #define __INTEL_EXECLISTS_SUBMISSION_H__
 
+#include <linux/llist.h>
 #include <linux/types.h>
 
 struct drm_printer;
@@ -13,6 +14,7 @@ struct drm_printer;
 struct i915_request;
 struct intel_context;
 struct intel_engine_cs;
+struct intel_gt;
 
 enum {
 	INTEL_CONTEXT_SCHEDULE_IN = 0,
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index ca76f93bc03d..8d77dcbad059 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -31,6 +31,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
 	INIT_LIST_HEAD(&gt->closed_vma);
 	spin_lock_init(&gt->closed_lock);
 
+	init_llist_head(&gt->watchdog.list);
+	INIT_WORK(&gt->watchdog.work, intel_gt_watchdog_work);
+
 	intel_gt_init_buffer_pool(gt);
 	intel_gt_init_reset(gt);
 	intel_gt_init_requests(gt);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h b/drivers/gpu/drm/i915/gt/intel_gt.h
index a17bd8b3195f..7ec395cace69 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt.h
@@ -78,4 +78,6 @@ static inline bool intel_gt_is_wedged(const struct intel_gt *gt)
 void intel_gt_info_print(const struct intel_gt_info *info,
 			 struct drm_printer *p);
 
+void intel_gt_watchdog_work(struct work_struct *work);
+
 #endif /* __INTEL_GT_H__ */
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
index 36ec97f79174..fbfd19b2e5f2 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
@@ -8,6 +8,7 @@
 #include "i915_drv.h" /* for_each_engine() */
 #include "i915_request.h"
 #include "intel_engine_heartbeat.h"
+#include "intel_execlists_submission.h"
 #include "intel_gt.h"
 #include "intel_gt_pm.h"
 #include "intel_gt_requests.h"
@@ -242,4 +243,31 @@ void intel_gt_fini_requests(struct intel_gt *gt)
 {
 	/* Wait until the work is marked as finished before unloading! */
 	cancel_delayed_work_sync(&gt->requests.retire_work);
+
+	flush_work(&gt->watchdog.work);
+}
+
+void intel_gt_watchdog_work(struct work_struct *work)
+{
+	struct intel_gt *gt =
+		container_of(work, typeof(*gt), watchdog.work);
+	struct i915_request *rq, *rn;
+	struct llist_node *first;
+
+	first = llist_del_all(&gt->watchdog.list);
+	if (!first)
+		return;
+
+	llist_for_each_entry_safe(rq, rn, first, watchdog.link) {
+		if (!i915_request_completed(rq)) {
+			struct dma_fence *f = &rq->fence;
+
+			pr_notice("Fence expiration time out i915-%s:%s:%llx!\n",
+				  f->ops->get_driver_name(f),
+				  f->ops->get_timeline_name(f),
+				  f->seqno);
+			i915_request_cancel(rq, -EINTR);
+		}
+		i915_request_put(rq);
+	}
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
index 626af37c7790..d70ebcc6f19f 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
@@ -8,10 +8,12 @@
 
 #include <linux/ktime.h>
 #include <linux/list.h>
+#include <linux/llist.h>
 #include <linux/mutex.h>
 #include <linux/notifier.h>
 #include <linux/spinlock.h>
 #include <linux/types.h>
+#include <linux/workqueue.h>
 
 #include "uc/intel_uc.h"
 
@@ -62,6 +64,11 @@ struct intel_gt {
 		struct delayed_work retire_work;
 	} requests;
 
+	struct {
+		struct llist_head list;
+		struct work_struct work;
+	} watchdog;
+
 	struct intel_wakeref wakeref;
 	atomic_t user_wakeref;
 
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index d1a4a3fa7425..d58052f3410c 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -321,6 +321,53 @@ static void remove_from_engine(struct i915_request *rq)
 	__notify_execute_cb_imm(rq);
 }
 
+static void __rq_init_watchdog(struct i915_request *rq)
+{
+	rq->watchdog.timer.function = NULL;
+}
+
+static enum hrtimer_restart __rq_watchdog_expired(struct hrtimer *hrtimer)
+{
+	struct i915_request *rq =
+		container_of(hrtimer, struct i915_request, watchdog.timer);
+	struct intel_gt *gt = rq->engine->gt;
+
+	if (!i915_request_completed(rq)) {
+		if (llist_add(&rq->watchdog.link, &gt->watchdog.list))
+			schedule_work(&gt->watchdog.work);
+	} else {
+		i915_request_put(rq);
+	}
+
+	return HRTIMER_NORESTART;
+}
+
+static void __rq_arm_watchdog(struct i915_request *rq)
+{
+	struct i915_request_watchdog *wdg = &rq->watchdog;
+	struct intel_context *ce = rq->context;
+
+	if (!ce->watchdog.timeout_us)
+		return;
+
+	hrtimer_init(&wdg->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	wdg->timer.function = __rq_watchdog_expired;
+	hrtimer_start_range_ns(&wdg->timer,
+			       ns_to_ktime(ce->watchdog.timeout_us *
+					   NSEC_PER_USEC),
+			       NSEC_PER_MSEC,
+			       HRTIMER_MODE_REL);
+	i915_request_get(rq);
+}
+
+static void __rq_cancel_watchdog(struct i915_request *rq)
+{
+	struct i915_request_watchdog *wdg = &rq->watchdog;
+
+	if (wdg->timer.function && hrtimer_try_to_cancel(&wdg->timer) > 0)
+		i915_request_put(rq);
+}
+
 bool i915_request_retire(struct i915_request *rq)
 {
 	if (!__i915_request_is_complete(rq))
@@ -332,6 +379,8 @@ bool i915_request_retire(struct i915_request *rq)
 	trace_i915_request_retire(rq);
 	i915_request_mark_complete(rq);
 
+	__rq_cancel_watchdog(rq);
+
 	/*
 	 * We know the GPU must have read the request to have
 	 * sent us the seqno + interrupt, so use the position
@@ -692,6 +741,8 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 
 		if (unlikely(fence->error))
 			i915_request_set_error_once(request, fence->error);
+		else
+			__rq_arm_watchdog(request);
 
 		/*
 		 * We need to serialize use of the submit_request() callback
@@ -879,6 +930,7 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp)
 
 	/* No zalloc, everything must be cleared after use */
 	rq->batch = NULL;
+	__rq_init_watchdog(rq);
 	GEM_BUG_ON(rq->capture_list);
 	GEM_BUG_ON(!llist_empty(&rq->execute_cb));
 
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index 2dea55ea69e1..5e0946992f1a 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -26,7 +26,9 @@
 #define I915_REQUEST_H
 
 #include <linux/dma-fence.h>
+#include <linux/hrtimer.h>
 #include <linux/irq_work.h>
+#include <linux/llist.h>
 #include <linux/lockdep.h>
 
 #include "gem/i915_gem_context_types.h"
@@ -289,6 +291,12 @@ struct i915_request {
 	/** timeline->request entry for this request */
 	struct list_head link;
 
+	/** Watchdog support fields. */
+	struct i915_request_watchdog {
+		struct llist_node link;
+		struct hrtimer timer;
+	} watchdog;
+
 	I915_SELFTEST_DECLARE(struct {
 		struct list_head link;
 		unsigned long delay;
-- 
2.27.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 6/7] drm/i915: Fail too long user submissions by default
  2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: Daniel Vetter, Matthew Auld, dri-devel, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

A new Kconfig option CONFIG_DRM_I915_REQUEST_TIMEOUT is added, defaulting
to 20s, and this timeout is applied to all users contexts using the
previously added watchdog facility.

Result of this is that any user submission will simply fail after this
timeout, either causing a reset (for non-preemptable), or incomplete
results.

This can have an effect that workloads which used to work fine will
suddenly start failing. Even workloads comprised of short batches but in
long dependency chains can be terminated.

And because of lack of agreement on usefulness and safety of fence error
propagation this partial execution can be invisible to userspace even if
it is "listening" to returned fence status.

Another interaction is with hangcheck where care needs to be taken timeout
is not set lower or close to three times the heartbeat interval. Otherwise
a hang in any application can cause complete termination of all
submissions from unrelated clients. Any users modifying the per engine
heartbeat intervals therefore need to be aware of this potential denial of
service to avoid inadvertently enabling it.

Given all this I am personally not convinced the scheme is a good idea.
Intuitively it feels object importers would be better positioned to
enforce the time they are willing to wait for something to complete.

v2:
 * Improved commit message and Kconfig text.
 * Pull in some helper code from patch which got dropped.

v3:
 * Bump timeout to 20s to see if it helps Tigerlake.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/Kconfig.profile          | 14 +++++++
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 38 +++++++++++++++++++
 .../gpu/drm/i915/gem/i915_gem_context_types.h |  4 ++
 drivers/gpu/drm/i915/gt/intel_context_param.h | 11 +++++-
 4 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index 35bbe2b80596..39328567c200 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -1,3 +1,17 @@
+config DRM_I915_REQUEST_TIMEOUT
+	int "Default timeout for requests (ms)"
+	default 20000 # milliseconds
+	help
+	  Configures the default timeout after which any user submissions will
+	  be forcefully terminated.
+
+	  Beware setting this value lower, or close to heartbeat interval
+	  rounded to whole seconds times three, in order to avoid allowing
+	  misbehaving applications causing total rendering failure in unrelated
+	  clients.
+
+	  May be 0 to disable the timeout.
+
 config DRM_I915_FENCE_TIMEOUT
 	int "Timeout for unsignaled foreign fences (ms, jiffy granularity)"
 	default 10000 # milliseconds
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 03a2f5f2a11f..33ff1a6a7724 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -233,6 +233,8 @@ static void intel_context_set_gem(struct intel_context *ce,
 	if (ctx->sched.priority >= I915_PRIORITY_NORMAL &&
 	    intel_engine_has_timeslices(ce->engine))
 		__set_bit(CONTEXT_USE_SEMAPHORES, &ce->flags);
+
+	intel_context_set_watchdog_us(ce, ctx->watchdog.timeout_us);
 }
 
 static void __free_engines(struct i915_gem_engines *e, unsigned int count)
@@ -820,6 +822,40 @@ static void __assign_timeline(struct i915_gem_context *ctx,
 	context_apply_all(ctx, __apply_timeline, timeline);
 }
 
+static int __apply_watchdog(struct intel_context *ce, void *timeout_us)
+{
+	return intel_context_set_watchdog_us(ce, (uintptr_t)timeout_us);
+}
+
+static int
+__set_watchdog(struct i915_gem_context *ctx, unsigned long timeout_us)
+{
+	int ret;
+
+	ret = context_apply_all(ctx, __apply_watchdog,
+				(void *)(uintptr_t)timeout_us);
+	if (!ret)
+		ctx->watchdog.timeout_us = timeout_us;
+
+	return ret;
+}
+
+static void __set_default_fence_expiry(struct i915_gem_context *ctx)
+{
+	struct drm_i915_private *i915 = ctx->i915;
+	int ret;
+
+	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT))
+		return;
+
+	/* Default expiry for user fences. */
+	ret = __set_watchdog(ctx, CONFIG_DRM_I915_REQUEST_TIMEOUT * 1000);
+	if (ret)
+		drm_notice(&i915->drm,
+			   "Failed to configure default fence expiry! (%d)",
+			   ret);
+}
+
 static struct i915_gem_context *
 i915_gem_create_context(struct drm_i915_private *i915, unsigned int flags)
 {
@@ -864,6 +900,8 @@ i915_gem_create_context(struct drm_i915_private *i915, unsigned int flags)
 		intel_timeline_put(timeline);
 	}
 
+	__set_default_fence_expiry(ctx);
+
 	trace_i915_context_create(ctx);
 
 	return ctx;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index d5bc75508048..f17da7e26c43 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -150,6 +150,10 @@ struct i915_gem_context {
 	 */
 	atomic_t active_count;
 
+	struct {
+		u64 timeout_us;
+	} watchdog;
+
 	/**
 	 * @hang_timestamp: The last time(s) this context caused a GPU hang
 	 */
diff --git a/drivers/gpu/drm/i915/gt/intel_context_param.h b/drivers/gpu/drm/i915/gt/intel_context_param.h
index f053d8633fe2..3ecacc675f41 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_param.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_param.h
@@ -6,9 +6,18 @@
 #ifndef INTEL_CONTEXT_PARAM_H
 #define INTEL_CONTEXT_PARAM_H
 
-struct intel_context;
+#include <linux/types.h>
+
+#include "intel_context.h"
 
 int intel_context_set_ring_size(struct intel_context *ce, long sz);
 long intel_context_get_ring_size(struct intel_context *ce);
 
+static inline int
+intel_context_set_watchdog_us(struct intel_context *ce, u64 timeout_us)
+{
+	ce->watchdog.timeout_us = timeout_us;
+	return 0;
+}
+
 #endif /* INTEL_CONTEXT_PARAM_H */
-- 
2.27.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Intel-gfx] [PATCH 6/7] drm/i915: Fail too long user submissions by default
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  0 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: Daniel Vetter, Matthew Auld, dri-devel

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

A new Kconfig option CONFIG_DRM_I915_REQUEST_TIMEOUT is added, defaulting
to 20s, and this timeout is applied to all users contexts using the
previously added watchdog facility.

Result of this is that any user submission will simply fail after this
timeout, either causing a reset (for non-preemptable), or incomplete
results.

This can have an effect that workloads which used to work fine will
suddenly start failing. Even workloads comprised of short batches but in
long dependency chains can be terminated.

And because of lack of agreement on usefulness and safety of fence error
propagation this partial execution can be invisible to userspace even if
it is "listening" to returned fence status.

Another interaction is with hangcheck where care needs to be taken timeout
is not set lower or close to three times the heartbeat interval. Otherwise
a hang in any application can cause complete termination of all
submissions from unrelated clients. Any users modifying the per engine
heartbeat intervals therefore need to be aware of this potential denial of
service to avoid inadvertently enabling it.

Given all this I am personally not convinced the scheme is a good idea.
Intuitively it feels object importers would be better positioned to
enforce the time they are willing to wait for something to complete.

v2:
 * Improved commit message and Kconfig text.
 * Pull in some helper code from patch which got dropped.

v3:
 * Bump timeout to 20s to see if it helps Tigerlake.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/Kconfig.profile          | 14 +++++++
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 38 +++++++++++++++++++
 .../gpu/drm/i915/gem/i915_gem_context_types.h |  4 ++
 drivers/gpu/drm/i915/gt/intel_context_param.h | 11 +++++-
 4 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index 35bbe2b80596..39328567c200 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -1,3 +1,17 @@
+config DRM_I915_REQUEST_TIMEOUT
+	int "Default timeout for requests (ms)"
+	default 20000 # milliseconds
+	help
+	  Configures the default timeout after which any user submissions will
+	  be forcefully terminated.
+
+	  Beware setting this value lower, or close to heartbeat interval
+	  rounded to whole seconds times three, in order to avoid allowing
+	  misbehaving applications causing total rendering failure in unrelated
+	  clients.
+
+	  May be 0 to disable the timeout.
+
 config DRM_I915_FENCE_TIMEOUT
 	int "Timeout for unsignaled foreign fences (ms, jiffy granularity)"
 	default 10000 # milliseconds
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 03a2f5f2a11f..33ff1a6a7724 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -233,6 +233,8 @@ static void intel_context_set_gem(struct intel_context *ce,
 	if (ctx->sched.priority >= I915_PRIORITY_NORMAL &&
 	    intel_engine_has_timeslices(ce->engine))
 		__set_bit(CONTEXT_USE_SEMAPHORES, &ce->flags);
+
+	intel_context_set_watchdog_us(ce, ctx->watchdog.timeout_us);
 }
 
 static void __free_engines(struct i915_gem_engines *e, unsigned int count)
@@ -820,6 +822,40 @@ static void __assign_timeline(struct i915_gem_context *ctx,
 	context_apply_all(ctx, __apply_timeline, timeline);
 }
 
+static int __apply_watchdog(struct intel_context *ce, void *timeout_us)
+{
+	return intel_context_set_watchdog_us(ce, (uintptr_t)timeout_us);
+}
+
+static int
+__set_watchdog(struct i915_gem_context *ctx, unsigned long timeout_us)
+{
+	int ret;
+
+	ret = context_apply_all(ctx, __apply_watchdog,
+				(void *)(uintptr_t)timeout_us);
+	if (!ret)
+		ctx->watchdog.timeout_us = timeout_us;
+
+	return ret;
+}
+
+static void __set_default_fence_expiry(struct i915_gem_context *ctx)
+{
+	struct drm_i915_private *i915 = ctx->i915;
+	int ret;
+
+	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT))
+		return;
+
+	/* Default expiry for user fences. */
+	ret = __set_watchdog(ctx, CONFIG_DRM_I915_REQUEST_TIMEOUT * 1000);
+	if (ret)
+		drm_notice(&i915->drm,
+			   "Failed to configure default fence expiry! (%d)",
+			   ret);
+}
+
 static struct i915_gem_context *
 i915_gem_create_context(struct drm_i915_private *i915, unsigned int flags)
 {
@@ -864,6 +900,8 @@ i915_gem_create_context(struct drm_i915_private *i915, unsigned int flags)
 		intel_timeline_put(timeline);
 	}
 
+	__set_default_fence_expiry(ctx);
+
 	trace_i915_context_create(ctx);
 
 	return ctx;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index d5bc75508048..f17da7e26c43 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -150,6 +150,10 @@ struct i915_gem_context {
 	 */
 	atomic_t active_count;
 
+	struct {
+		u64 timeout_us;
+	} watchdog;
+
 	/**
 	 * @hang_timestamp: The last time(s) this context caused a GPU hang
 	 */
diff --git a/drivers/gpu/drm/i915/gt/intel_context_param.h b/drivers/gpu/drm/i915/gt/intel_context_param.h
index f053d8633fe2..3ecacc675f41 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_param.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_param.h
@@ -6,9 +6,18 @@
 #ifndef INTEL_CONTEXT_PARAM_H
 #define INTEL_CONTEXT_PARAM_H
 
-struct intel_context;
+#include <linux/types.h>
+
+#include "intel_context.h"
 
 int intel_context_set_ring_size(struct intel_context *ce, long sz);
 long intel_context_get_ring_size(struct intel_context *ce);
 
+static inline int
+intel_context_set_watchdog_us(struct intel_context *ce, u64 timeout_us)
+{
+	ce->watchdog.timeout_us = timeout_us;
+	return 0;
+}
+
 #endif /* INTEL_CONTEXT_PARAM_H */
-- 
2.27.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 7/7] drm/i915: Allow configuring default request expiry via modparam
  2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: Daniel Vetter, dri-devel, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Module parameter is added (request_timeout_ms) to allow configuring the
default request/fence expiry.

Default value is inherited from CONFIG_DRM_I915_REQUEST_TIMEOUT.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 5 +++--
 drivers/gpu/drm/i915/i915_params.c          | 5 +++++
 drivers/gpu/drm/i915/i915_params.h          | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 33ff1a6a7724..0e8f0476e01f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -845,11 +845,12 @@ static void __set_default_fence_expiry(struct i915_gem_context *ctx)
 	struct drm_i915_private *i915 = ctx->i915;
 	int ret;
 
-	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT))
+	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT) ||
+	    !i915->params.request_timeout_ms)
 		return;
 
 	/* Default expiry for user fences. */
-	ret = __set_watchdog(ctx, CONFIG_DRM_I915_REQUEST_TIMEOUT * 1000);
+	ret = __set_watchdog(ctx, i915->params.request_timeout_ms * 1000);
 	if (ret)
 		drm_notice(&i915->drm,
 			   "Failed to configure default fence expiry! (%d)",
diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 6939634e56ed..0320878d96b0 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -197,6 +197,11 @@ i915_param_named_unsafe(fake_lmem_start, ulong, 0400,
 	"Fake LMEM start offset (default: 0)");
 #endif
 
+#if CONFIG_DRM_I915_REQUEST_TIMEOUT
+i915_param_named_unsafe(request_timeout_ms, uint, 0600,
+			"Default request/fence/batch buffer expiration timeout.");
+#endif
+
 static __always_inline void _print_param(struct drm_printer *p,
 					 const char *name,
 					 const char *type,
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index 48f47e44e848..34ebb0662547 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -72,6 +72,7 @@ struct drm_printer;
 	param(int, enable_dpcd_backlight, -1, 0600) \
 	param(char *, force_probe, CONFIG_DRM_I915_FORCE_PROBE, 0400) \
 	param(unsigned long, fake_lmem_start, 0, 0400) \
+	param(unsigned int, request_timeout_ms, CONFIG_DRM_I915_REQUEST_TIMEOUT, 0600) \
 	/* leave bools at the end to not create holes */ \
 	param(bool, enable_hangcheck, true, 0600) \
 	param(bool, load_detect_test, false, 0600) \
-- 
2.27.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Intel-gfx] [PATCH 7/7] drm/i915: Allow configuring default request expiry via modparam
@ 2021-03-24 12:13   ` Tvrtko Ursulin
  0 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-24 12:13 UTC (permalink / raw)
  To: Intel-gfx; +Cc: Daniel Vetter, dri-devel

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Module parameter is added (request_timeout_ms) to allow configuring the
default request/fence expiry.

Default value is inherited from CONFIG_DRM_I915_REQUEST_TIMEOUT.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 5 +++--
 drivers/gpu/drm/i915/i915_params.c          | 5 +++++
 drivers/gpu/drm/i915/i915_params.h          | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 33ff1a6a7724..0e8f0476e01f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -845,11 +845,12 @@ static void __set_default_fence_expiry(struct i915_gem_context *ctx)
 	struct drm_i915_private *i915 = ctx->i915;
 	int ret;
 
-	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT))
+	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT) ||
+	    !i915->params.request_timeout_ms)
 		return;
 
 	/* Default expiry for user fences. */
-	ret = __set_watchdog(ctx, CONFIG_DRM_I915_REQUEST_TIMEOUT * 1000);
+	ret = __set_watchdog(ctx, i915->params.request_timeout_ms * 1000);
 	if (ret)
 		drm_notice(&i915->drm,
 			   "Failed to configure default fence expiry! (%d)",
diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 6939634e56ed..0320878d96b0 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -197,6 +197,11 @@ i915_param_named_unsafe(fake_lmem_start, ulong, 0400,
 	"Fake LMEM start offset (default: 0)");
 #endif
 
+#if CONFIG_DRM_I915_REQUEST_TIMEOUT
+i915_param_named_unsafe(request_timeout_ms, uint, 0600,
+			"Default request/fence/batch buffer expiration timeout.");
+#endif
+
 static __always_inline void _print_param(struct drm_printer *p,
 					 const char *name,
 					 const char *type,
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index 48f47e44e848..34ebb0662547 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -72,6 +72,7 @@ struct drm_printer;
 	param(int, enable_dpcd_backlight, -1, 0600) \
 	param(char *, force_probe, CONFIG_DRM_I915_FORCE_PROBE, 0400) \
 	param(unsigned long, fake_lmem_start, 0, 0400) \
+	param(unsigned int, request_timeout_ms, CONFIG_DRM_I915_REQUEST_TIMEOUT, 0600) \
 	/* leave bools at the end to not create holes */ \
 	param(bool, enable_hangcheck, true, 0600) \
 	param(bool, load_detect_test, false, 0600) \
-- 
2.27.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH 1/7] drm/i915: Extract active lookup engine to a helper
  2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-24 12:21     ` Matthew Auld
  -1 siblings, 0 replies; 42+ messages in thread
From: Matthew Auld @ 2021-03-24 12:21 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel Graphics Development, ML dri-devel

On Wed, 24 Mar 2021 at 12:13, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> Move active engine lookup to exported i915_request_active_engine.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH 1/7] drm/i915: Extract active lookup engine to a helper
@ 2021-03-24 12:21     ` Matthew Auld
  0 siblings, 0 replies; 42+ messages in thread
From: Matthew Auld @ 2021-03-24 12:21 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel Graphics Development, ML dri-devel

On Wed, 24 Mar 2021 at 12:13, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> Move active engine lookup to exported i915_request_active_engine.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Default request/fence expiry + watchdog (rev5)
  2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
                   ` (7 preceding siblings ...)
  (?)
@ 2021-03-24 13:16 ` Patchwork
  -1 siblings, 0 replies; 42+ messages in thread
From: Patchwork @ 2021-03-24 13:16 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

== Series Details ==

Series: Default request/fence expiry + watchdog (rev5)
URL   : https://patchwork.freedesktop.org/series/87930/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
8fc435bf4d9f drm/i915: Extract active lookup engine to a helper
-:114: CHECK:LINE_SPACING: Please don't use multiple blank lines
#114: FILE: drivers/gpu/drm/i915/i915_request.c:290:
+
+

total: 0 errors, 0 warnings, 1 checks, 104 lines checked
eee19b9617fc drm/i915: Individual request cancellation
263ce11450a2 drm/i915: Restrict sentinel requests further
9d6ef50b29a0 drm/i915: Handle async cancellation in sentinel assert
d0d85357b1fb drm/i915: Request watchdog infrastructure
22d460cbf87f drm/i915: Fail too long user submissions by default
d81b27e279a2 drm/i915: Allow configuring default request expiry via modparam


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Intel-gfx] ✗ Fi.CI.DOCS: warning for Default request/fence expiry + watchdog (rev5)
  2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
                   ` (8 preceding siblings ...)
  (?)
@ 2021-03-24 13:21 ` Patchwork
  -1 siblings, 0 replies; 42+ messages in thread
From: Patchwork @ 2021-03-24 13:21 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

== Series Details ==

Series: Default request/fence expiry + watchdog (rev5)
URL   : https://patchwork.freedesktop.org/series/87930/
State : warning

== Summary ==

$ make htmldocs 2>&1 > /dev/null | grep i915
/home/cidrm/kernel/Documentation/gpu/i915:22: ./drivers/gpu/drm/i915/intel_runtime_pm.c:423: WARNING: Inline strong start-string without end-string.


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for Default request/fence expiry + watchdog (rev5)
  2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
                   ` (9 preceding siblings ...)
  (?)
@ 2021-03-24 13:48 ` Patchwork
  -1 siblings, 0 replies; 42+ messages in thread
From: Patchwork @ 2021-03-24 13:48 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 6533 bytes --]

== Series Details ==

Series: Default request/fence expiry + watchdog (rev5)
URL   : https://patchwork.freedesktop.org/series/87930/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_9889 -> Patchwork_19847
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/index.html

Known issues
------------

  Here are the changes found in Patchwork_19847 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_basic@query-info:
    - fi-bsw-kefka:       NOTRUN -> [SKIP][1] ([fdo#109271]) +17 similar issues
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-bsw-kefka/igt@amdgpu/amd_basic@query-info.html

  * igt@amdgpu/amd_basic@semaphore:
    - fi-bdw-5557u:       NOTRUN -> [SKIP][2] ([fdo#109271]) +26 similar issues
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-bdw-5557u/igt@amdgpu/amd_basic@semaphore.html

  * igt@core_hotunplug@unbind-rebind:
    - fi-bdw-5557u:       NOTRUN -> [WARN][3] ([i915#2283])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-bdw-5557u/igt@core_hotunplug@unbind-rebind.html

  * igt@fbdev@read:
    - fi-tgl-y:           [PASS][4] -> [DMESG-WARN][5] ([i915#402]) +1 similar issue
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/fi-tgl-y/igt@fbdev@read.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-tgl-y/igt@fbdev@read.html

  * igt@gem_exec_suspend@basic-s0:
    - fi-tgl-u2:          [PASS][6] -> [FAIL][7] ([i915#1888])
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/fi-tgl-u2/igt@gem_exec_suspend@basic-s0.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-tgl-u2/igt@gem_exec_suspend@basic-s0.html

  * igt@gem_huc_copy@huc-copy:
    - fi-byt-j1900:       NOTRUN -> [SKIP][8] ([fdo#109271]) +27 similar issues
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-byt-j1900/igt@gem_huc_copy@huc-copy.html
    - fi-skl-guc:         NOTRUN -> [SKIP][9] ([fdo#109271] / [i915#2190])
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-skl-guc/igt@gem_huc_copy@huc-copy.html

  * igt@gem_tiled_blits@basic:
    - fi-kbl-8809g:       [PASS][10] -> [TIMEOUT][11] ([i915#2502] / [i915#3145])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/fi-kbl-8809g/igt@gem_tiled_blits@basic.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-kbl-8809g/igt@gem_tiled_blits@basic.html

  * igt@gem_tiled_fence_blits@basic:
    - fi-kbl-8809g:       [PASS][12] -> [TIMEOUT][13] ([i915#3145])
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/fi-kbl-8809g/igt@gem_tiled_fence_blits@basic.html
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-kbl-8809g/igt@gem_tiled_fence_blits@basic.html

  * igt@kms_chamelium@dp-crc-fast:
    - fi-skl-guc:         NOTRUN -> [SKIP][14] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-skl-guc/igt@kms_chamelium@dp-crc-fast.html
    - fi-bdw-5557u:       NOTRUN -> [SKIP][15] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-bdw-5557u/igt@kms_chamelium@dp-crc-fast.html

  * igt@kms_chamelium@hdmi-crc-fast:
    - fi-byt-j1900:       NOTRUN -> [SKIP][16] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-byt-j1900/igt@kms_chamelium@hdmi-crc-fast.html

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d:
    - fi-skl-guc:         NOTRUN -> [SKIP][17] ([fdo#109271] / [i915#533])
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-skl-guc/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d.html

  * igt@kms_psr@primary_mmap_gtt:
    - fi-skl-guc:         NOTRUN -> [SKIP][18] ([fdo#109271]) +25 similar issues
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-skl-guc/igt@kms_psr@primary_mmap_gtt.html

  
#### Possible fixes ####

  * igt@gem_flink_basic@bad-open:
    - fi-tgl-y:           [DMESG-WARN][19] ([i915#402]) -> [PASS][20]
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/fi-tgl-y/igt@gem_flink_basic@bad-open.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/fi-tgl-y/igt@gem_flink_basic@bad-open.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1888]: https://gitlab.freedesktop.org/drm/intel/issues/1888
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2283]: https://gitlab.freedesktop.org/drm/intel/issues/2283
  [i915#2502]: https://gitlab.freedesktop.org/drm/intel/issues/2502
  [i915#3145]: https://gitlab.freedesktop.org/drm/intel/issues/3145
  [i915#3277]: https://gitlab.freedesktop.org/drm/intel/issues/3277
  [i915#3283]: https://gitlab.freedesktop.org/drm/intel/issues/3283
  [i915#402]: https://gitlab.freedesktop.org/drm/intel/issues/402
  [i915#533]: https://gitlab.freedesktop.org/drm/intel/issues/533


Participating hosts (46 -> 43)
------------------------------

  Additional (2): fi-byt-j1900 fi-skl-guc 
  Missing    (5): fi-ilk-m540 fi-hsw-4200u fi-bsw-cyan fi-ctg-p8600 fi-bdw-samus 


Build changes
-------------

  * Linux: CI_DRM_9889 -> Patchwork_19847

  CI-20190529: 20190529
  CI_DRM_9889: c42d2e7296ecebf00ae234a847059cc92e41a86c @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6044: 2c2fc6470646eb5e25fc6ea02449ef744f8b70c2 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_19847: d81b27e279a26f5d9d43464f3edae299624cfae6 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

d81b27e279a2 drm/i915: Allow configuring default request expiry via modparam
22d460cbf87f drm/i915: Fail too long user submissions by default
d0d85357b1fb drm/i915: Request watchdog infrastructure
9d6ef50b29a0 drm/i915: Handle async cancellation in sentinel assert
263ce11450a2 drm/i915: Restrict sentinel requests further
eee19b9617fc drm/i915: Individual request cancellation
8fc435bf4d9f drm/i915: Extract active lookup engine to a helper

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/index.html

[-- Attachment #1.2: Type: text/html, Size: 8196 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH 2/7] drm/i915: Individual request cancellation
  2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-24 15:24     ` Matthew Auld
  -1 siblings, 0 replies; 42+ messages in thread
From: Matthew Auld @ 2021-03-24 15:24 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Intel Graphics Development, Matthew Auld, ML dri-devel, Chris Wilson

On Wed, 24 Mar 2021 at 12:13, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
> From: Chris Wilson <chris@chris-wilson.co.uk>
>
> Currently, we cancel outstanding requests within a context when the
> context is closed. We may also want to cancel individual requests using
> the same graceful preemption mechanism.
>
> v2 (Tvrtko):
>  * Cancel waiters carefully considering no timeline lock and RCU.
>  * Fixed selftests.
>
> v3 (Tvrtko):
>  * Remove error propagation to waiters for now.
>
> v4 (Tvrtko):
>  * Rebase for extracted i915_request_active_engine. (Matt)
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Reviewed-by: Matthew Auld <matthew.auld@intel.com> # v3
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH 2/7] drm/i915: Individual request cancellation
@ 2021-03-24 15:24     ` Matthew Auld
  0 siblings, 0 replies; 42+ messages in thread
From: Matthew Auld @ 2021-03-24 15:24 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Intel Graphics Development, Matthew Auld, ML dri-devel, Chris Wilson

On Wed, 24 Mar 2021 at 12:13, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
> From: Chris Wilson <chris@chris-wilson.co.uk>
>
> Currently, we cancel outstanding requests within a context when the
> context is closed. We may also want to cancel individual requests using
> the same graceful preemption mechanism.
>
> v2 (Tvrtko):
>  * Cancel waiters carefully considering no timeline lock and RCU.
>  * Fixed selftests.
>
> v3 (Tvrtko):
>  * Remove error propagation to waiters for now.
>
> v4 (Tvrtko):
>  * Rebase for extracted i915_request_active_engine. (Matt)
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Reviewed-by: Matthew Auld <matthew.auld@intel.com> # v3
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/7] drm/i915: Restrict sentinel requests further
  2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-24 15:25     ` Matthew Auld
  -1 siblings, 0 replies; 42+ messages in thread
From: Matthew Auld @ 2021-03-24 15:25 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel Graphics Development, ML dri-devel, Tvrtko Ursulin

On Wed, 24 Mar 2021 at 12:14, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> Disallow sentinel requests follow previous sentinels to make request
> cancellation work better when faced with a chain of requests which have
> all been marked as in error.
>
> Because in cases where we end up with a stream of cancelled requests we
> want to turn of request coalescing so they each will get individually

turn off

> skipped by the execlists_schedule_in (which is called per ELSP port, not
> per request).
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH 3/7] drm/i915: Restrict sentinel requests further
@ 2021-03-24 15:25     ` Matthew Auld
  0 siblings, 0 replies; 42+ messages in thread
From: Matthew Auld @ 2021-03-24 15:25 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel Graphics Development, ML dri-devel

On Wed, 24 Mar 2021 at 12:14, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> Disallow sentinel requests follow previous sentinels to make request
> cancellation work better when faced with a chain of requests which have
> all been marked as in error.
>
> Because in cases where we end up with a stream of cancelled requests we
> want to turn of request coalescing so they each will get individually

turn off

> skipped by the execlists_schedule_in (which is called per ELSP port, not
> per request).
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH 4/7] drm/i915: Handle async cancellation in sentinel assert
  2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-24 17:22     ` Matthew Auld
  -1 siblings, 0 replies; 42+ messages in thread
From: Matthew Auld @ 2021-03-24 17:22 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel Graphics Development, ML dri-devel

On Wed, 24 Mar 2021 at 12:14, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> With the watchdog cancelling requests asynchronously to preempt-to-busy we
> need to relax one assert making it apply only to requests not in error.
>
> v2:
>  * Check against the correct request!
>
> v3:
>  * Simplify the check to avoid the question of when to sample the fence
>    error vs sentinel bit.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH 4/7] drm/i915: Handle async cancellation in sentinel assert
@ 2021-03-24 17:22     ` Matthew Auld
  0 siblings, 0 replies; 42+ messages in thread
From: Matthew Auld @ 2021-03-24 17:22 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel Graphics Development, ML dri-devel

On Wed, 24 Mar 2021 at 12:14, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> With the watchdog cancelling requests asynchronously to preempt-to-busy we
> need to relax one assert making it apply only to requests not in error.
>
> v2:
>  * Check against the correct request!
>
> v3:
>  * Simplify the check to avoid the question of when to sample the fence
>    error vs sentinel bit.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Intel-gfx] ✗ Fi.CI.IGT: failure for Default request/fence expiry + watchdog (rev5)
  2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
                   ` (10 preceding siblings ...)
  (?)
@ 2021-03-24 23:29 ` Patchwork
  -1 siblings, 0 replies; 42+ messages in thread
From: Patchwork @ 2021-03-24 23:29 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 30269 bytes --]

== Series Details ==

Series: Default request/fence expiry + watchdog (rev5)
URL   : https://patchwork.freedesktop.org/series/87930/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_9889_full -> Patchwork_19847_full
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_19847_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_19847_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_19847_full:

### IGT changes ###

#### Possible regressions ####

  * igt@gem_ctx_ringsize@active@bcs0:
    - shard-skl:          NOTRUN -> [INCOMPLETE][1]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl2/igt@gem_ctx_ringsize@active@bcs0.html

  * igt@gem_ctx_ringsize@idle@bcs0:
    - shard-skl:          [PASS][2] -> [INCOMPLETE][3]
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl3/igt@gem_ctx_ringsize@idle@bcs0.html
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl4/igt@gem_ctx_ringsize@idle@bcs0.html

  * igt@kms_flip_tiling@flip-to-y-tiled@edp-1-pipe-a:
    - shard-skl:          [PASS][4] -> [FAIL][5]
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl9/igt@kms_flip_tiling@flip-to-y-tiled@edp-1-pipe-a.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl5/igt@kms_flip_tiling@flip-to-y-tiled@edp-1-pipe-a.html

  
#### Warnings ####

  * igt@gem_userptr_blits@vma-merge:
    - shard-iclb:         [INCOMPLETE][6] ([i915#2502] / [i915#2667]) -> [FAIL][7]
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-iclb3/igt@gem_userptr_blits@vma-merge.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@gem_userptr_blits@vma-merge.html
    - shard-glk:          [INCOMPLETE][8] ([i915#2502] / [i915#2667]) -> [FAIL][9]
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-glk7/igt@gem_userptr_blits@vma-merge.html
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk1/igt@gem_userptr_blits@vma-merge.html
    - shard-kbl:          [INCOMPLETE][10] ([i915#2502] / [i915#2667]) -> [FAIL][11]
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-kbl7/igt@gem_userptr_blits@vma-merge.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-kbl7/igt@gem_userptr_blits@vma-merge.html
    - shard-tglb:         [INCOMPLETE][12] ([i915#2502] / [i915#2667]) -> [FAIL][13]
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-tglb3/igt@gem_userptr_blits@vma-merge.html
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb3/igt@gem_userptr_blits@vma-merge.html
    - shard-skl:          [INCOMPLETE][14] ([i915#2502] / [i915#2667]) -> [FAIL][15]
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl2/igt@gem_userptr_blits@vma-merge.html
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl9/igt@gem_userptr_blits@vma-merge.html

  
Known issues
------------

  Here are the changes found in Patchwork_19847_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@feature_discovery@display-3x:
    - shard-glk:          NOTRUN -> [SKIP][16] ([fdo#109271]) +17 similar issues
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk1/igt@feature_discovery@display-3x.html
    - shard-iclb:         NOTRUN -> [SKIP][17] ([i915#1839])
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@feature_discovery@display-3x.html
    - shard-tglb:         NOTRUN -> [SKIP][18] ([i915#1839])
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb3/igt@feature_discovery@display-3x.html

  * igt@feature_discovery@psr2:
    - shard-iclb:         NOTRUN -> [SKIP][19] ([i915#658]) +1 similar issue
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb8/igt@feature_discovery@psr2.html

  * igt@gem_ctx_param@set-priority-not-supported:
    - shard-tglb:         NOTRUN -> [SKIP][20] ([fdo#109314])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb3/igt@gem_ctx_param@set-priority-not-supported.html
    - shard-iclb:         NOTRUN -> [SKIP][21] ([fdo#109314])
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@gem_ctx_param@set-priority-not-supported.html

  * igt@gem_ctx_persistence@process:
    - shard-snb:          NOTRUN -> [SKIP][22] ([fdo#109271] / [i915#1099]) +1 similar issue
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-snb5/igt@gem_ctx_persistence@process.html

  * igt@gem_exec_fair@basic-pace@rcs0:
    - shard-kbl:          [PASS][23] -> [FAIL][24] ([i915#2842]) +2 similar issues
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-kbl4/igt@gem_exec_fair@basic-pace@rcs0.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-kbl4/igt@gem_exec_fair@basic-pace@rcs0.html

  * igt@gem_exec_fair@basic-pace@vecs0:
    - shard-glk:          [PASS][25] -> [FAIL][26] ([i915#2842]) +3 similar issues
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-glk4/igt@gem_exec_fair@basic-pace@vecs0.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk2/igt@gem_exec_fair@basic-pace@vecs0.html

  * igt@gem_exec_reloc@basic-wide-active@bcs0:
    - shard-skl:          NOTRUN -> [FAIL][27] ([i915#2389]) +3 similar issues
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl8/igt@gem_exec_reloc@basic-wide-active@bcs0.html

  * igt@gem_exec_whisper@basic-contexts-forked-all:
    - shard-glk:          NOTRUN -> [DMESG-WARN][28] ([i915#118] / [i915#95])
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk1/igt@gem_exec_whisper@basic-contexts-forked-all.html

  * igt@gem_mmap_gtt@cpuset-big-copy:
    - shard-glk:          [PASS][29] -> [FAIL][30] ([i915#307])
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-glk3/igt@gem_mmap_gtt@cpuset-big-copy.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk9/igt@gem_mmap_gtt@cpuset-big-copy.html

  * igt@gen3_render_tiledx_blits:
    - shard-iclb:         NOTRUN -> [SKIP][31] ([fdo#109289])
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb8/igt@gen3_render_tiledx_blits.html

  * igt@gen9_exec_parse@bb-start-out:
    - shard-tglb:         NOTRUN -> [SKIP][32] ([fdo#112306])
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb3/igt@gen9_exec_parse@bb-start-out.html
    - shard-iclb:         NOTRUN -> [SKIP][33] ([fdo#112306])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@gen9_exec_parse@bb-start-out.html

  * igt@i915_suspend@sysfs-reader:
    - shard-skl:          NOTRUN -> [INCOMPLETE][34] ([i915#198])
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl8/igt@i915_suspend@sysfs-reader.html

  * igt@kms_big_fb@yf-tiled-16bpp-rotate-0:
    - shard-tglb:         NOTRUN -> [SKIP][35] ([fdo#111615]) +1 similar issue
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb3/igt@kms_big_fb@yf-tiled-16bpp-rotate-0.html

  * igt@kms_big_joiner@basic:
    - shard-skl:          NOTRUN -> [SKIP][36] ([fdo#109271] / [i915#2705])
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl4/igt@kms_big_joiner@basic.html
    - shard-apl:          NOTRUN -> [SKIP][37] ([fdo#109271] / [i915#2705])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl8/igt@kms_big_joiner@basic.html

  * igt@kms_ccs@pipe-c-random-ccs-data:
    - shard-skl:          NOTRUN -> [SKIP][38] ([fdo#109271] / [fdo#111304])
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl4/igt@kms_ccs@pipe-c-random-ccs-data.html

  * igt@kms_chamelium@common-hpd-after-suspend:
    - shard-glk:          NOTRUN -> [SKIP][39] ([fdo#109271] / [fdo#111827]) +5 similar issues
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk1/igt@kms_chamelium@common-hpd-after-suspend.html

  * igt@kms_chamelium@dp-crc-fast:
    - shard-snb:          NOTRUN -> [SKIP][40] ([fdo#109271] / [fdo#111827]) +7 similar issues
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-snb5/igt@kms_chamelium@dp-crc-fast.html

  * igt@kms_chamelium@dp-hpd-storm:
    - shard-iclb:         NOTRUN -> [SKIP][41] ([fdo#109284] / [fdo#111827]) +6 similar issues
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@kms_chamelium@dp-hpd-storm.html

  * igt@kms_chamelium@hdmi-edid-read:
    - shard-tglb:         NOTRUN -> [SKIP][42] ([fdo#109284] / [fdo#111827]) +5 similar issues
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb3/igt@kms_chamelium@hdmi-edid-read.html

  * igt@kms_chamelium@hdmi-hpd-for-each-pipe:
    - shard-kbl:          NOTRUN -> [SKIP][43] ([fdo#109271] / [fdo#111827]) +7 similar issues
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-kbl7/igt@kms_chamelium@hdmi-hpd-for-each-pipe.html

  * igt@kms_chamelium@vga-edid-read:
    - shard-apl:          NOTRUN -> [SKIP][44] ([fdo#109271] / [fdo#111827]) +15 similar issues
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl6/igt@kms_chamelium@vga-edid-read.html

  * igt@kms_color@pipe-c-degamma:
    - shard-iclb:         NOTRUN -> [FAIL][45] ([i915#1149])
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@kms_color@pipe-c-degamma.html
    - shard-tglb:         NOTRUN -> [FAIL][46] ([i915#1149])
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb3/igt@kms_color@pipe-c-degamma.html

  * igt@kms_color@pipe-d-ctm-0-5:
    - shard-iclb:         NOTRUN -> [SKIP][47] ([fdo#109278] / [i915#1149])
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@kms_color@pipe-d-ctm-0-5.html

  * igt@kms_color_chamelium@pipe-d-ctm-0-75:
    - shard-skl:          NOTRUN -> [SKIP][48] ([fdo#109271] / [fdo#111827]) +15 similar issues
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl8/igt@kms_color_chamelium@pipe-d-ctm-0-75.html

  * igt@kms_content_protection@atomic:
    - shard-apl:          NOTRUN -> [TIMEOUT][49] ([i915#1319])
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl6/igt@kms_content_protection@atomic.html

  * igt@kms_cursor_crc@pipe-a-cursor-128x128-random:
    - shard-skl:          [PASS][50] -> [FAIL][51] ([i915#54])
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl7/igt@kms_cursor_crc@pipe-a-cursor-128x128-random.html
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl5/igt@kms_cursor_crc@pipe-a-cursor-128x128-random.html

  * igt@kms_cursor_legacy@flip-vs-cursor-legacy:
    - shard-skl:          [PASS][52] -> [FAIL][53] ([i915#2346])
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl4/igt@kms_cursor_legacy@flip-vs-cursor-legacy.html
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl8/igt@kms_cursor_legacy@flip-vs-cursor-legacy.html

  * igt@kms_draw_crc@draw-method-rgb565-mmap-cpu-xtiled:
    - shard-glk:          [PASS][54] -> [FAIL][55] ([i915#52] / [i915#54])
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-glk1/igt@kms_draw_crc@draw-method-rgb565-mmap-cpu-xtiled.html
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk1/igt@kms_draw_crc@draw-method-rgb565-mmap-cpu-xtiled.html

  * igt@kms_draw_crc@draw-method-xrgb8888-blt-xtiled:
    - shard-skl:          [PASS][56] -> [FAIL][57] ([i915#52] / [i915#54])
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl9/igt@kms_draw_crc@draw-method-xrgb8888-blt-xtiled.html
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl5/igt@kms_draw_crc@draw-method-xrgb8888-blt-xtiled.html

  * igt@kms_flip@2x-flip-vs-dpms-off-vs-modeset-interruptible:
    - shard-tglb:         NOTRUN -> [SKIP][58] ([fdo#111825]) +8 similar issues
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb3/igt@kms_flip@2x-flip-vs-dpms-off-vs-modeset-interruptible.html

  * igt@kms_flip@2x-nonexisting-fb:
    - shard-iclb:         NOTRUN -> [SKIP][59] ([fdo#109274]) +4 similar issues
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@kms_flip@2x-nonexisting-fb.html

  * igt@kms_flip@flip-vs-expired-vblank@b-edp1:
    - shard-tglb:         [PASS][60] -> [FAIL][61] ([i915#79])
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-tglb7/igt@kms_flip@flip-vs-expired-vblank@b-edp1.html
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb8/igt@kms_flip@flip-vs-expired-vblank@b-edp1.html

  * igt@kms_flip@flip-vs-expired-vblank@b-hdmi-a2:
    - shard-glk:          [PASS][62] -> [FAIL][63] ([i915#79])
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-glk6/igt@kms_flip@flip-vs-expired-vblank@b-hdmi-a2.html
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk4/igt@kms_flip@flip-vs-expired-vblank@b-hdmi-a2.html

  * igt@kms_flip@flip-vs-suspend-interruptible@a-dp1:
    - shard-apl:          [PASS][64] -> [DMESG-WARN][65] ([i915#180]) +1 similar issue
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-apl6/igt@kms_flip@flip-vs-suspend-interruptible@a-dp1.html
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl8/igt@kms_flip@flip-vs-suspend-interruptible@a-dp1.html

  * igt@kms_flip@flip-vs-suspend@a-edp1:
    - shard-skl:          [PASS][66] -> [INCOMPLETE][67] ([i915#198])
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl7/igt@kms_flip@flip-vs-suspend@a-edp1.html
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl5/igt@kms_flip@flip-vs-suspend@a-edp1.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs:
    - shard-skl:          NOTRUN -> [FAIL][68] ([i915#2628])
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl4/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs.html
    - shard-apl:          NOTRUN -> [FAIL][69] ([i915#2641])
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl8/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs.html

  * igt@kms_frontbuffer_tracking@fbc-1p-shrfb-fliptrack-mmap-gtt:
    - shard-skl:          NOTRUN -> [SKIP][70] ([fdo#109271]) +117 similar issues
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl9/igt@kms_frontbuffer_tracking@fbc-1p-shrfb-fliptrack-mmap-gtt.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-shrfb-msflip-blt:
    - shard-snb:          NOTRUN -> [SKIP][71] ([fdo#109271]) +122 similar issues
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-snb6/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-shrfb-msflip-blt.html

  * igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-pri-shrfb-draw-render:
    - shard-iclb:         NOTRUN -> [SKIP][72] ([fdo#109280]) +4 similar issues
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-pri-shrfb-draw-render.html

  * igt@kms_frontbuffer_tracking@psr-rgb101010-draw-mmap-wc:
    - shard-kbl:          NOTRUN -> [SKIP][73] ([fdo#109271]) +48 similar issues
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-kbl7/igt@kms_frontbuffer_tracking@psr-rgb101010-draw-mmap-wc.html

  * igt@kms_hdr@bpc-switch-dpms:
    - shard-skl:          [PASS][74] -> [FAIL][75] ([i915#1188])
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl6/igt@kms_hdr@bpc-switch-dpms.html
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl10/igt@kms_hdr@bpc-switch-dpms.html

  * igt@kms_pipe_crc_basic@suspend-read-crc-pipe-d:
    - shard-apl:          NOTRUN -> [SKIP][76] ([fdo#109271] / [i915#533])
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl6/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-d.html

  * igt@kms_plane_alpha_blend@pipe-a-alpha-opaque-fb:
    - shard-apl:          NOTRUN -> [FAIL][77] ([fdo#108145] / [i915#265]) +1 similar issue
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl6/igt@kms_plane_alpha_blend@pipe-a-alpha-opaque-fb.html

  * igt@kms_plane_alpha_blend@pipe-a-constant-alpha-max:
    - shard-skl:          NOTRUN -> [FAIL][78] ([fdo#108145] / [i915#265])
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl9/igt@kms_plane_alpha_blend@pipe-a-constant-alpha-max.html
    - shard-kbl:          NOTRUN -> [FAIL][79] ([fdo#108145] / [i915#265])
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-kbl7/igt@kms_plane_alpha_blend@pipe-a-constant-alpha-max.html

  * igt@kms_plane_alpha_blend@pipe-b-coverage-7efc:
    - shard-skl:          [PASS][80] -> [FAIL][81] ([fdo#108145] / [i915#265]) +1 similar issue
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl5/igt@kms_plane_alpha_blend@pipe-b-coverage-7efc.html
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl6/igt@kms_plane_alpha_blend@pipe-b-coverage-7efc.html

  * igt@kms_plane_lowres@pipe-a-tiling-none:
    - shard-glk:          NOTRUN -> [FAIL][82] ([i915#899])
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk1/igt@kms_plane_lowres@pipe-a-tiling-none.html

  * igt@kms_plane_lowres@pipe-d-tiling-none:
    - shard-iclb:         NOTRUN -> [SKIP][83] ([fdo#109278]) +1 similar issue
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@kms_plane_lowres@pipe-d-tiling-none.html

  * igt@kms_plane_scaling@scaler-with-clipping-clamping@pipe-c-scaler-with-clipping-clamping:
    - shard-apl:          NOTRUN -> [SKIP][84] ([fdo#109271] / [i915#2733])
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl6/igt@kms_plane_scaling@scaler-with-clipping-clamping@pipe-c-scaler-with-clipping-clamping.html

  * igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-2:
    - shard-apl:          NOTRUN -> [SKIP][85] ([fdo#109271] / [i915#658]) +6 similar issues
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl6/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-2.html

  * igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-4:
    - shard-kbl:          NOTRUN -> [SKIP][86] ([fdo#109271] / [i915#658]) +1 similar issue
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-kbl7/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-4.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-5:
    - shard-skl:          NOTRUN -> [SKIP][87] ([fdo#109271] / [i915#658]) +2 similar issues
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl6/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-5.html

  * igt@kms_psr2_su@frontbuffer:
    - shard-glk:          NOTRUN -> [SKIP][88] ([fdo#109271] / [i915#658]) +1 similar issue
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk1/igt@kms_psr2_su@frontbuffer.html
    - shard-tglb:         NOTRUN -> [FAIL][89] ([i915#2596])
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb3/igt@kms_psr2_su@frontbuffer.html
    - shard-iclb:         NOTRUN -> [SKIP][90] ([fdo#109642] / [fdo#111068] / [i915#658])
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@kms_psr2_su@frontbuffer.html

  * igt@kms_psr@psr2_no_drrs:
    - shard-iclb:         [PASS][91] -> [SKIP][92] ([fdo#109441]) +1 similar issue
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-iclb2/igt@kms_psr@psr2_no_drrs.html
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb8/igt@kms_psr@psr2_no_drrs.html

  * igt@kms_psr@psr2_sprite_mmap_gtt:
    - shard-iclb:         NOTRUN -> [SKIP][93] ([fdo#109441])
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@kms_psr@psr2_sprite_mmap_gtt.html

  * igt@kms_sysfs_edid_timing:
    - shard-apl:          NOTRUN -> [FAIL][94] ([IGT#2])
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl6/igt@kms_sysfs_edid_timing.html

  * igt@kms_vblank@pipe-d-ts-continuation-idle:
    - shard-apl:          NOTRUN -> [SKIP][95] ([fdo#109271]) +194 similar issues
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl6/igt@kms_vblank@pipe-d-ts-continuation-idle.html

  * igt@kms_writeback@writeback-invalid-parameters:
    - shard-tglb:         NOTRUN -> [SKIP][96] ([i915#2437])
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb3/igt@kms_writeback@writeback-invalid-parameters.html
    - shard-skl:          NOTRUN -> [SKIP][97] ([fdo#109271] / [i915#2437])
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl9/igt@kms_writeback@writeback-invalid-parameters.html
    - shard-kbl:          NOTRUN -> [SKIP][98] ([fdo#109271] / [i915#2437])
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-kbl7/igt@kms_writeback@writeback-invalid-parameters.html
    - shard-glk:          NOTRUN -> [SKIP][99] ([fdo#109271] / [i915#2437])
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk1/igt@kms_writeback@writeback-invalid-parameters.html
    - shard-iclb:         NOTRUN -> [SKIP][100] ([i915#2437])
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@kms_writeback@writeback-invalid-parameters.html

  * igt@prime_nv_api@i915_nv_import_twice_check_flink_name:
    - shard-iclb:         NOTRUN -> [SKIP][101] ([fdo#109291]) +1 similar issue
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb5/igt@prime_nv_api@i915_nv_import_twice_check_flink_name.html
    - shard-tglb:         NOTRUN -> [SKIP][102] ([fdo#109291]) +1 similar issue
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb3/igt@prime_nv_api@i915_nv_import_twice_check_flink_name.html

  * igt@sysfs_clients@recycle:
    - shard-snb:          [PASS][103] -> [FAIL][104] ([i915#3028])
   [103]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-snb2/igt@sysfs_clients@recycle.html
   [104]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-snb7/igt@sysfs_clients@recycle.html

  * igt@sysfs_clients@sema-10@rcs0:
    - shard-skl:          NOTRUN -> [SKIP][105] ([fdo#109271] / [i915#3026]) +2 similar issues
   [105]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl9/igt@sysfs_clients@sema-10@rcs0.html

  * igt@sysfs_clients@sema-10@vcs0:
    - shard-kbl:          NOTRUN -> [SKIP][106] ([fdo#109271] / [i915#3026]) +3 similar issues
   [106]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-kbl7/igt@sysfs_clients@sema-10@vcs0.html

  
#### Possible fixes ####

  * igt@gem_ctx_persistence@replace@vcs0:
    - shard-iclb:         [FAIL][107] ([i915#2410]) -> [PASS][108]
   [107]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-iclb3/igt@gem_ctx_persistence@replace@vcs0.html
   [108]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb3/igt@gem_ctx_persistence@replace@vcs0.html

  * igt@gem_exec_balancer@hang:
    - shard-iclb:         [INCOMPLETE][109] ([i915#1895] / [i915#3031]) -> [PASS][110]
   [109]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-iclb2/igt@gem_exec_balancer@hang.html
   [110]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-iclb8/igt@gem_exec_balancer@hang.html

  * igt@gem_exec_fair@basic-flow@rcs0:
    - shard-tglb:         [FAIL][111] ([i915#2842]) -> [PASS][112] +1 similar issue
   [111]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-tglb2/igt@gem_exec_fair@basic-flow@rcs0.html
   [112]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb1/igt@gem_exec_fair@basic-flow@rcs0.html

  * igt@gem_exec_fair@basic-none@vcs0:
    - shard-kbl:          [FAIL][113] ([i915#2842]) -> [PASS][114] +3 similar issues
   [113]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-kbl4/igt@gem_exec_fair@basic-none@vcs0.html
   [114]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-kbl4/igt@gem_exec_fair@basic-none@vcs0.html
    - shard-apl:          [FAIL][115] ([i915#2842]) -> [PASS][116]
   [115]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-apl8/igt@gem_exec_fair@basic-none@vcs0.html
   [116]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl3/igt@gem_exec_fair@basic-none@vcs0.html

  * igt@gem_exec_fair@basic-throttle@rcs0:
    - shard-glk:          [FAIL][117] ([i915#2842]) -> [PASS][118] +1 similar issue
   [117]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-glk3/igt@gem_exec_fair@basic-throttle@rcs0.html
   [118]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk4/igt@gem_exec_fair@basic-throttle@rcs0.html

  * igt@gem_exec_schedule@u-fairslice@vcs0:
    - shard-skl:          [DMESG-WARN][119] ([i915#1610] / [i915#2803]) -> [PASS][120]
   [119]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl9/igt@gem_exec_schedule@u-fairslice@vcs0.html
   [120]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl9/igt@gem_exec_schedule@u-fairslice@vcs0.html

  * igt@gem_exec_schedule@u-fairslice@vcs1:
    - shard-kbl:          [DMESG-WARN][121] ([i915#1610] / [i915#2803]) -> [PASS][122]
   [121]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-kbl4/igt@gem_exec_schedule@u-fairslice@vcs1.html
   [122]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-kbl7/igt@gem_exec_schedule@u-fairslice@vcs1.html

  * igt@gem_exec_whisper@basic-fds-forked:
    - shard-glk:          [DMESG-WARN][123] ([i915#118] / [i915#95]) -> [PASS][124] +1 similar issue
   [123]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-glk8/igt@gem_exec_whisper@basic-fds-forked.html
   [124]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk8/igt@gem_exec_whisper@basic-fds-forked.html

  * igt@gem_workarounds@suspend-resume-context:
    - shard-apl:          [INCOMPLETE][125] -> [PASS][126]
   [125]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-apl3/igt@gem_workarounds@suspend-resume-context.html
   [126]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl7/igt@gem_workarounds@suspend-resume-context.html

  * igt@gen9_exec_parse@allowed-single:
    - shard-skl:          [DMESG-WARN][127] ([i915#1436] / [i915#716]) -> [PASS][128]
   [127]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl6/igt@gen9_exec_parse@allowed-single.html
   [128]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl8/igt@gen9_exec_parse@allowed-single.html

  * igt@i915_module_load@reload-with-fault-injection:
    - shard-skl:          [DMESG-WARN][129] ([i915#1982]) -> [PASS][130] +1 similar issue
   [129]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl5/igt@i915_module_load@reload-with-fault-injection.html
   [130]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl2/igt@i915_module_load@reload-with-fault-injection.html

  * igt@kms_async_flips@test-time-stamp:
    - shard-tglb:         [FAIL][131] ([i915#2574]) -> [PASS][132]
   [131]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-tglb1/igt@kms_async_flips@test-time-stamp.html
   [132]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb5/igt@kms_async_flips@test-time-stamp.html

  * igt@kms_atomic_transition@plane-all-transition-nonblocking@dp-1-pipe-b:
    - shard-kbl:          [FAIL][133] ([i915#3168]) -> [PASS][134]
   [133]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-kbl3/igt@kms_atomic_transition@plane-all-transition-nonblocking@dp-1-pipe-b.html
   [134]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-kbl7/igt@kms_atomic_transition@plane-all-transition-nonblocking@dp-1-pipe-b.html

  * igt@kms_flip@flip-vs-expired-vblank@a-edp1:
    - shard-tglb:         [FAIL][135] ([i915#2598]) -> [PASS][136]
   [135]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-tglb7/igt@kms_flip@flip-vs-expired-vblank@a-edp1.html
   [136]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-tglb8/igt@kms_flip@flip-vs-expired-vblank@a-edp1.html

  * igt@kms_flip@flip-vs-suspend-interruptible@c-dp1:
    - shard-apl:          [DMESG-WARN][137] ([i915#180]) -> [PASS][138] +2 similar issues
   [137]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-apl6/igt@kms_flip@flip-vs-suspend-interruptible@c-dp1.html
   [138]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-apl8/igt@kms_flip@flip-vs-suspend-interruptible@c-dp1.html

  * igt@kms_flip@plain-flip-fb-recreate-interruptible@c-edp1:
    - shard-skl:          [FAIL][139] ([i915#2122]) -> [PASS][140] +1 similar issue
   [139]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-skl7/igt@kms_flip@plain-flip-fb-recreate-interruptible@c-edp1.html
   [140]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-skl5/igt@kms_flip@plain-flip-fb-recreate-interruptible@c-edp1.html

  * igt@kms_flip@plain-flip-ts-check-interruptible@c-hdmi-a2:
    - shard-glk:          [FAIL][141] ([i915#2122]) -> [PASS][142]
   [141]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9889/shard-glk2/igt@kms_flip@plain-flip-ts-check-interruptible@c-hdmi-a2.html
   [142]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/shard-glk6/igt@kms_flip@plain-flip-ts-check-interruptible@c-hdmi-a2.html

  * igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
    - shard-skl:          [INCOMPLETE][143] ([i915#198]) -> [PASS

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19847/index.html

[-- Attachment #1.2: Type: text/html, Size: 33528 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/7] drm/i915: Request watchdog infrastructure
  2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-26  0:00     ` Daniel Vetter
  -1 siblings, 0 replies; 42+ messages in thread
From: Daniel Vetter @ 2021-03-26  0:00 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Daniel Vetter, Intel-gfx, Matthew Auld, dri-devel, Tvrtko Ursulin

On Wed, Mar 24, 2021 at 12:13:33PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Prepares the plumbing for setting request/fence expiration time. All code
> is put in place but is never activated due yet missing ability to actually
> configure the timer.
> 
> Outline of the basic operation:
> 
> A timer is started when request is ready for execution. If the request
> completes (retires) before the timer fires, timer is cancelled and nothing
> further happens.
> 
> If the timer fires request is added to a lockless list and worker queued.
> Purpose of this is twofold: a) It allows request cancellation from a more
> friendly context and b) coalesces multiple expirations into a single event
> of consuming the list.
> 
> Worker locklessly consumes the list of expired requests and cancels them
> all using previous added i915_request_cancel().
> 
> Associated timeout value is stored in rq->context.watchdog.timeout_us.
> 
> v2:
>  * Log expiration.
> 
> v3:
>  * Include more information about user timeline in the log message.
> 
> v4:
>  * Remove obsolete comment and fix formatting. (Matt)
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_context_types.h |  4 ++
>  .../drm/i915/gt/intel_execlists_submission.h  |  2 +
>  drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
>  drivers/gpu/drm/i915/gt/intel_gt.h            |  2 +
>  drivers/gpu/drm/i915/gt/intel_gt_requests.c   | 28 ++++++++++
>  drivers/gpu/drm/i915/gt/intel_gt_types.h      |  7 +++
>  drivers/gpu/drm/i915/i915_request.c           | 52 +++++++++++++++++++
>  drivers/gpu/drm/i915/i915_request.h           |  8 +++
>  8 files changed, 106 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 0ea18c9e2aca..65a5730a4f5b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -99,6 +99,10 @@ struct intel_context {
>  #define CONTEXT_FORCE_SINGLE_SUBMISSION	7
>  #define CONTEXT_NOPREEMPT		8
>  
> +	struct {
> +		u64 timeout_us;
> +	} watchdog;
> +
>  	u32 *lrc_reg_state;
>  	union {
>  		struct {
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
> index f7bd3fccfee8..4ca9b475e252 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
> @@ -6,6 +6,7 @@
>  #ifndef __INTEL_EXECLISTS_SUBMISSION_H__
>  #define __INTEL_EXECLISTS_SUBMISSION_H__
>  
> +#include <linux/llist.h>
>  #include <linux/types.h>
>  
>  struct drm_printer;
> @@ -13,6 +14,7 @@ struct drm_printer;
>  struct i915_request;
>  struct intel_context;
>  struct intel_engine_cs;
> +struct intel_gt;
>  
>  enum {
>  	INTEL_CONTEXT_SCHEDULE_IN = 0,
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
> index ca76f93bc03d..8d77dcbad059 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt.c
> @@ -31,6 +31,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
>  	INIT_LIST_HEAD(&gt->closed_vma);
>  	spin_lock_init(&gt->closed_lock);
>  
> +	init_llist_head(&gt->watchdog.list);
> +	INIT_WORK(&gt->watchdog.work, intel_gt_watchdog_work);
> +
>  	intel_gt_init_buffer_pool(gt);
>  	intel_gt_init_reset(gt);
>  	intel_gt_init_requests(gt);
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h b/drivers/gpu/drm/i915/gt/intel_gt.h
> index a17bd8b3195f..7ec395cace69 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt.h
> @@ -78,4 +78,6 @@ static inline bool intel_gt_is_wedged(const struct intel_gt *gt)
>  void intel_gt_info_print(const struct intel_gt_info *info,
>  			 struct drm_printer *p);
>  
> +void intel_gt_watchdog_work(struct work_struct *work);
> +
>  #endif /* __INTEL_GT_H__ */
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> index 36ec97f79174..fbfd19b2e5f2 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> @@ -8,6 +8,7 @@
>  #include "i915_drv.h" /* for_each_engine() */
>  #include "i915_request.h"
>  #include "intel_engine_heartbeat.h"
> +#include "intel_execlists_submission.h"
>  #include "intel_gt.h"
>  #include "intel_gt_pm.h"
>  #include "intel_gt_requests.h"
> @@ -242,4 +243,31 @@ void intel_gt_fini_requests(struct intel_gt *gt)
>  {
>  	/* Wait until the work is marked as finished before unloading! */
>  	cancel_delayed_work_sync(&gt->requests.retire_work);
> +
> +	flush_work(&gt->watchdog.work);
> +}
> +
> +void intel_gt_watchdog_work(struct work_struct *work)
> +{
> +	struct intel_gt *gt =
> +		container_of(work, typeof(*gt), watchdog.work);
> +	struct i915_request *rq, *rn;
> +	struct llist_node *first;
> +
> +	first = llist_del_all(&gt->watchdog.list);
> +	if (!first)
> +		return;
> +
> +	llist_for_each_entry_safe(rq, rn, first, watchdog.link) {
> +		if (!i915_request_completed(rq)) {
> +			struct dma_fence *f = &rq->fence;
> +
> +			pr_notice("Fence expiration time out i915-%s:%s:%llx!\n",
> +				  f->ops->get_driver_name(f),
> +				  f->ops->get_timeline_name(f),
> +				  f->seqno);
> +			i915_request_cancel(rq, -EINTR);
> +		}
> +		i915_request_put(rq);
> +	}
>  }
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> index 626af37c7790..d70ebcc6f19f 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> @@ -8,10 +8,12 @@
>  
>  #include <linux/ktime.h>
>  #include <linux/list.h>
> +#include <linux/llist.h>
>  #include <linux/mutex.h>
>  #include <linux/notifier.h>
>  #include <linux/spinlock.h>
>  #include <linux/types.h>
> +#include <linux/workqueue.h>
>  
>  #include "uc/intel_uc.h"
>  
> @@ -62,6 +64,11 @@ struct intel_gt {
>  		struct delayed_work retire_work;
>  	} requests;
>  
> +	struct {
> +		struct llist_head list;
> +		struct work_struct work;
> +	} watchdog;
> +
>  	struct intel_wakeref wakeref;
>  	atomic_t user_wakeref;
>  
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index d1a4a3fa7425..d58052f3410c 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -321,6 +321,53 @@ static void remove_from_engine(struct i915_request *rq)
>  	__notify_execute_cb_imm(rq);
>  }
>  
> +static void __rq_init_watchdog(struct i915_request *rq)
> +{
> +	rq->watchdog.timer.function = NULL;
> +}
> +
> +static enum hrtimer_restart __rq_watchdog_expired(struct hrtimer *hrtimer)
> +{
> +	struct i915_request *rq =
> +		container_of(hrtimer, struct i915_request, watchdog.timer);
> +	struct intel_gt *gt = rq->engine->gt;
> +
> +	if (!i915_request_completed(rq)) {
> +		if (llist_add(&rq->watchdog.link, &gt->watchdog.list))
> +			schedule_work(&gt->watchdog.work);
> +	} else {
> +		i915_request_put(rq);
> +	}
> +
> +	return HRTIMER_NORESTART;
> +}
> +
> +static void __rq_arm_watchdog(struct i915_request *rq)
> +{
> +	struct i915_request_watchdog *wdg = &rq->watchdog;
> +	struct intel_context *ce = rq->context;
> +
> +	if (!ce->watchdog.timeout_us)
> +		return;
> +
> +	hrtimer_init(&wdg->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	wdg->timer.function = __rq_watchdog_expired;
> +	hrtimer_start_range_ns(&wdg->timer,
> +			       ns_to_ktime(ce->watchdog.timeout_us *
> +					   NSEC_PER_USEC),
> +			       NSEC_PER_MSEC,
> +			       HRTIMER_MODE_REL);
> +	i915_request_get(rq);

Shouldn't we grab the new reference before we arm the timer? Either way
since fairly academic I went ahead and applied, but if you agree pls do a
follow up patch.
-Daniel

> +}
> +
> +static void __rq_cancel_watchdog(struct i915_request *rq)
> +{
> +	struct i915_request_watchdog *wdg = &rq->watchdog;
> +
> +	if (wdg->timer.function && hrtimer_try_to_cancel(&wdg->timer) > 0)
> +		i915_request_put(rq);
> +}
> +
>  bool i915_request_retire(struct i915_request *rq)
>  {
>  	if (!__i915_request_is_complete(rq))
> @@ -332,6 +379,8 @@ bool i915_request_retire(struct i915_request *rq)
>  	trace_i915_request_retire(rq);
>  	i915_request_mark_complete(rq);
>  
> +	__rq_cancel_watchdog(rq);
> +
>  	/*
>  	 * We know the GPU must have read the request to have
>  	 * sent us the seqno + interrupt, so use the position
> @@ -692,6 +741,8 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
>  
>  		if (unlikely(fence->error))
>  			i915_request_set_error_once(request, fence->error);
> +		else
> +			__rq_arm_watchdog(request);
>  
>  		/*
>  		 * We need to serialize use of the submit_request() callback
> @@ -879,6 +930,7 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp)
>  
>  	/* No zalloc, everything must be cleared after use */
>  	rq->batch = NULL;
> +	__rq_init_watchdog(rq);
>  	GEM_BUG_ON(rq->capture_list);
>  	GEM_BUG_ON(!llist_empty(&rq->execute_cb));
>  
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 2dea55ea69e1..5e0946992f1a 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -26,7 +26,9 @@
>  #define I915_REQUEST_H
>  
>  #include <linux/dma-fence.h>
> +#include <linux/hrtimer.h>
>  #include <linux/irq_work.h>
> +#include <linux/llist.h>
>  #include <linux/lockdep.h>
>  
>  #include "gem/i915_gem_context_types.h"
> @@ -289,6 +291,12 @@ struct i915_request {
>  	/** timeline->request entry for this request */
>  	struct list_head link;
>  
> +	/** Watchdog support fields. */
> +	struct i915_request_watchdog {
> +		struct llist_node link;
> +		struct hrtimer timer;
> +	} watchdog;
> +
>  	I915_SELFTEST_DECLARE(struct {
>  		struct list_head link;
>  		unsigned long delay;
> -- 
> 2.27.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH 5/7] drm/i915: Request watchdog infrastructure
@ 2021-03-26  0:00     ` Daniel Vetter
  0 siblings, 0 replies; 42+ messages in thread
From: Daniel Vetter @ 2021-03-26  0:00 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Daniel Vetter, Intel-gfx, Matthew Auld, dri-devel

On Wed, Mar 24, 2021 at 12:13:33PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Prepares the plumbing for setting request/fence expiration time. All code
> is put in place but is never activated due yet missing ability to actually
> configure the timer.
> 
> Outline of the basic operation:
> 
> A timer is started when request is ready for execution. If the request
> completes (retires) before the timer fires, timer is cancelled and nothing
> further happens.
> 
> If the timer fires request is added to a lockless list and worker queued.
> Purpose of this is twofold: a) It allows request cancellation from a more
> friendly context and b) coalesces multiple expirations into a single event
> of consuming the list.
> 
> Worker locklessly consumes the list of expired requests and cancels them
> all using previous added i915_request_cancel().
> 
> Associated timeout value is stored in rq->context.watchdog.timeout_us.
> 
> v2:
>  * Log expiration.
> 
> v3:
>  * Include more information about user timeline in the log message.
> 
> v4:
>  * Remove obsolete comment and fix formatting. (Matt)
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_context_types.h |  4 ++
>  .../drm/i915/gt/intel_execlists_submission.h  |  2 +
>  drivers/gpu/drm/i915/gt/intel_gt.c            |  3 ++
>  drivers/gpu/drm/i915/gt/intel_gt.h            |  2 +
>  drivers/gpu/drm/i915/gt/intel_gt_requests.c   | 28 ++++++++++
>  drivers/gpu/drm/i915/gt/intel_gt_types.h      |  7 +++
>  drivers/gpu/drm/i915/i915_request.c           | 52 +++++++++++++++++++
>  drivers/gpu/drm/i915/i915_request.h           |  8 +++
>  8 files changed, 106 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> index 0ea18c9e2aca..65a5730a4f5b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> @@ -99,6 +99,10 @@ struct intel_context {
>  #define CONTEXT_FORCE_SINGLE_SUBMISSION	7
>  #define CONTEXT_NOPREEMPT		8
>  
> +	struct {
> +		u64 timeout_us;
> +	} watchdog;
> +
>  	u32 *lrc_reg_state;
>  	union {
>  		struct {
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
> index f7bd3fccfee8..4ca9b475e252 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
> @@ -6,6 +6,7 @@
>  #ifndef __INTEL_EXECLISTS_SUBMISSION_H__
>  #define __INTEL_EXECLISTS_SUBMISSION_H__
>  
> +#include <linux/llist.h>
>  #include <linux/types.h>
>  
>  struct drm_printer;
> @@ -13,6 +14,7 @@ struct drm_printer;
>  struct i915_request;
>  struct intel_context;
>  struct intel_engine_cs;
> +struct intel_gt;
>  
>  enum {
>  	INTEL_CONTEXT_SCHEDULE_IN = 0,
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
> index ca76f93bc03d..8d77dcbad059 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt.c
> @@ -31,6 +31,9 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915)
>  	INIT_LIST_HEAD(&gt->closed_vma);
>  	spin_lock_init(&gt->closed_lock);
>  
> +	init_llist_head(&gt->watchdog.list);
> +	INIT_WORK(&gt->watchdog.work, intel_gt_watchdog_work);
> +
>  	intel_gt_init_buffer_pool(gt);
>  	intel_gt_init_reset(gt);
>  	intel_gt_init_requests(gt);
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h b/drivers/gpu/drm/i915/gt/intel_gt.h
> index a17bd8b3195f..7ec395cace69 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt.h
> @@ -78,4 +78,6 @@ static inline bool intel_gt_is_wedged(const struct intel_gt *gt)
>  void intel_gt_info_print(const struct intel_gt_info *info,
>  			 struct drm_printer *p);
>  
> +void intel_gt_watchdog_work(struct work_struct *work);
> +
>  #endif /* __INTEL_GT_H__ */
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> index 36ec97f79174..fbfd19b2e5f2 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> @@ -8,6 +8,7 @@
>  #include "i915_drv.h" /* for_each_engine() */
>  #include "i915_request.h"
>  #include "intel_engine_heartbeat.h"
> +#include "intel_execlists_submission.h"
>  #include "intel_gt.h"
>  #include "intel_gt_pm.h"
>  #include "intel_gt_requests.h"
> @@ -242,4 +243,31 @@ void intel_gt_fini_requests(struct intel_gt *gt)
>  {
>  	/* Wait until the work is marked as finished before unloading! */
>  	cancel_delayed_work_sync(&gt->requests.retire_work);
> +
> +	flush_work(&gt->watchdog.work);
> +}
> +
> +void intel_gt_watchdog_work(struct work_struct *work)
> +{
> +	struct intel_gt *gt =
> +		container_of(work, typeof(*gt), watchdog.work);
> +	struct i915_request *rq, *rn;
> +	struct llist_node *first;
> +
> +	first = llist_del_all(&gt->watchdog.list);
> +	if (!first)
> +		return;
> +
> +	llist_for_each_entry_safe(rq, rn, first, watchdog.link) {
> +		if (!i915_request_completed(rq)) {
> +			struct dma_fence *f = &rq->fence;
> +
> +			pr_notice("Fence expiration time out i915-%s:%s:%llx!\n",
> +				  f->ops->get_driver_name(f),
> +				  f->ops->get_timeline_name(f),
> +				  f->seqno);
> +			i915_request_cancel(rq, -EINTR);
> +		}
> +		i915_request_put(rq);
> +	}
>  }
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> index 626af37c7790..d70ebcc6f19f 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
> @@ -8,10 +8,12 @@
>  
>  #include <linux/ktime.h>
>  #include <linux/list.h>
> +#include <linux/llist.h>
>  #include <linux/mutex.h>
>  #include <linux/notifier.h>
>  #include <linux/spinlock.h>
>  #include <linux/types.h>
> +#include <linux/workqueue.h>
>  
>  #include "uc/intel_uc.h"
>  
> @@ -62,6 +64,11 @@ struct intel_gt {
>  		struct delayed_work retire_work;
>  	} requests;
>  
> +	struct {
> +		struct llist_head list;
> +		struct work_struct work;
> +	} watchdog;
> +
>  	struct intel_wakeref wakeref;
>  	atomic_t user_wakeref;
>  
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index d1a4a3fa7425..d58052f3410c 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -321,6 +321,53 @@ static void remove_from_engine(struct i915_request *rq)
>  	__notify_execute_cb_imm(rq);
>  }
>  
> +static void __rq_init_watchdog(struct i915_request *rq)
> +{
> +	rq->watchdog.timer.function = NULL;
> +}
> +
> +static enum hrtimer_restart __rq_watchdog_expired(struct hrtimer *hrtimer)
> +{
> +	struct i915_request *rq =
> +		container_of(hrtimer, struct i915_request, watchdog.timer);
> +	struct intel_gt *gt = rq->engine->gt;
> +
> +	if (!i915_request_completed(rq)) {
> +		if (llist_add(&rq->watchdog.link, &gt->watchdog.list))
> +			schedule_work(&gt->watchdog.work);
> +	} else {
> +		i915_request_put(rq);
> +	}
> +
> +	return HRTIMER_NORESTART;
> +}
> +
> +static void __rq_arm_watchdog(struct i915_request *rq)
> +{
> +	struct i915_request_watchdog *wdg = &rq->watchdog;
> +	struct intel_context *ce = rq->context;
> +
> +	if (!ce->watchdog.timeout_us)
> +		return;
> +
> +	hrtimer_init(&wdg->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	wdg->timer.function = __rq_watchdog_expired;
> +	hrtimer_start_range_ns(&wdg->timer,
> +			       ns_to_ktime(ce->watchdog.timeout_us *
> +					   NSEC_PER_USEC),
> +			       NSEC_PER_MSEC,
> +			       HRTIMER_MODE_REL);
> +	i915_request_get(rq);

Shouldn't we grab the new reference before we arm the timer? Either way
since fairly academic I went ahead and applied, but if you agree pls do a
follow up patch.
-Daniel

> +}
> +
> +static void __rq_cancel_watchdog(struct i915_request *rq)
> +{
> +	struct i915_request_watchdog *wdg = &rq->watchdog;
> +
> +	if (wdg->timer.function && hrtimer_try_to_cancel(&wdg->timer) > 0)
> +		i915_request_put(rq);
> +}
> +
>  bool i915_request_retire(struct i915_request *rq)
>  {
>  	if (!__i915_request_is_complete(rq))
> @@ -332,6 +379,8 @@ bool i915_request_retire(struct i915_request *rq)
>  	trace_i915_request_retire(rq);
>  	i915_request_mark_complete(rq);
>  
> +	__rq_cancel_watchdog(rq);
> +
>  	/*
>  	 * We know the GPU must have read the request to have
>  	 * sent us the seqno + interrupt, so use the position
> @@ -692,6 +741,8 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
>  
>  		if (unlikely(fence->error))
>  			i915_request_set_error_once(request, fence->error);
> +		else
> +			__rq_arm_watchdog(request);
>  
>  		/*
>  		 * We need to serialize use of the submit_request() callback
> @@ -879,6 +930,7 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp)
>  
>  	/* No zalloc, everything must be cleared after use */
>  	rq->batch = NULL;
> +	__rq_init_watchdog(rq);
>  	GEM_BUG_ON(rq->capture_list);
>  	GEM_BUG_ON(!llist_empty(&rq->execute_cb));
>  
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index 2dea55ea69e1..5e0946992f1a 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -26,7 +26,9 @@
>  #define I915_REQUEST_H
>  
>  #include <linux/dma-fence.h>
> +#include <linux/hrtimer.h>
>  #include <linux/irq_work.h>
> +#include <linux/llist.h>
>  #include <linux/lockdep.h>
>  
>  #include "gem/i915_gem_context_types.h"
> @@ -289,6 +291,12 @@ struct i915_request {
>  	/** timeline->request entry for this request */
>  	struct list_head link;
>  
> +	/** Watchdog support fields. */
> +	struct i915_request_watchdog {
> +		struct llist_node link;
> +		struct hrtimer timer;
> +	} watchdog;
> +
>  	I915_SELFTEST_DECLARE(struct {
>  		struct list_head link;
>  		unsigned long delay;
> -- 
> 2.27.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH 3/7] drm/i915: Restrict sentinel requests further
  2021-03-24 15:25     ` [Intel-gfx] " Matthew Auld
@ 2021-03-26  0:01       ` Daniel Vetter
  -1 siblings, 0 replies; 42+ messages in thread
From: Daniel Vetter @ 2021-03-26  0:01 UTC (permalink / raw)
  To: Matthew Auld; +Cc: Tvrtko Ursulin, Intel Graphics Development, ML dri-devel

On Wed, Mar 24, 2021 at 03:25:55PM +0000, Matthew Auld wrote:
> On Wed, 24 Mar 2021 at 12:14, Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
> >
> > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >
> > Disallow sentinel requests follow previous sentinels to make request
> > cancellation work better when faced with a chain of requests which have
> > all been marked as in error.
> >
> > Because in cases where we end up with a stream of cancelled requests we
> > want to turn of request coalescing so they each will get individually
> 
> turn off

Fixed while applying.
-Daniel

> 
> > skipped by the execlists_schedule_in (which is called per ELSP port, not
> > per request).
> >
> > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH 3/7] drm/i915: Restrict sentinel requests further
@ 2021-03-26  0:01       ` Daniel Vetter
  0 siblings, 0 replies; 42+ messages in thread
From: Daniel Vetter @ 2021-03-26  0:01 UTC (permalink / raw)
  To: Matthew Auld; +Cc: Intel Graphics Development, ML dri-devel

On Wed, Mar 24, 2021 at 03:25:55PM +0000, Matthew Auld wrote:
> On Wed, 24 Mar 2021 at 12:14, Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
> >
> > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >
> > Disallow sentinel requests follow previous sentinels to make request
> > cancellation work better when faced with a chain of requests which have
> > all been marked as in error.
> >
> > Because in cases where we end up with a stream of cancelled requests we
> > want to turn of request coalescing so they each will get individually
> 
> turn off

Fixed while applying.
-Daniel

> 
> > skipped by the execlists_schedule_in (which is called per ELSP port, not
> > per request).
> >
> > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 7/7] drm/i915: Allow configuring default request expiry via modparam
  2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-26  0:25     ` Daniel Vetter
  -1 siblings, 0 replies; 42+ messages in thread
From: Daniel Vetter @ 2021-03-26  0:25 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Daniel Vetter, Intel-gfx, dri-devel, Tvrtko Ursulin

On Wed, Mar 24, 2021 at 12:13:35PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Module parameter is added (request_timeout_ms) to allow configuring the
> default request/fence expiry.
> 
> Default value is inherited from CONFIG_DRM_I915_REQUEST_TIMEOUT.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Entire series merged, thanks for the patches.
-Daniel

> ---
>  drivers/gpu/drm/i915/gem/i915_gem_context.c | 5 +++--
>  drivers/gpu/drm/i915/i915_params.c          | 5 +++++
>  drivers/gpu/drm/i915/i915_params.h          | 1 +
>  3 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 33ff1a6a7724..0e8f0476e01f 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -845,11 +845,12 @@ static void __set_default_fence_expiry(struct i915_gem_context *ctx)
>  	struct drm_i915_private *i915 = ctx->i915;
>  	int ret;
>  
> -	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT))
> +	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT) ||
> +	    !i915->params.request_timeout_ms)
>  		return;
>  
>  	/* Default expiry for user fences. */
> -	ret = __set_watchdog(ctx, CONFIG_DRM_I915_REQUEST_TIMEOUT * 1000);
> +	ret = __set_watchdog(ctx, i915->params.request_timeout_ms * 1000);
>  	if (ret)
>  		drm_notice(&i915->drm,
>  			   "Failed to configure default fence expiry! (%d)",
> diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
> index 6939634e56ed..0320878d96b0 100644
> --- a/drivers/gpu/drm/i915/i915_params.c
> +++ b/drivers/gpu/drm/i915/i915_params.c
> @@ -197,6 +197,11 @@ i915_param_named_unsafe(fake_lmem_start, ulong, 0400,
>  	"Fake LMEM start offset (default: 0)");
>  #endif
>  
> +#if CONFIG_DRM_I915_REQUEST_TIMEOUT
> +i915_param_named_unsafe(request_timeout_ms, uint, 0600,
> +			"Default request/fence/batch buffer expiration timeout.");
> +#endif
> +
>  static __always_inline void _print_param(struct drm_printer *p,
>  					 const char *name,
>  					 const char *type,
> diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
> index 48f47e44e848..34ebb0662547 100644
> --- a/drivers/gpu/drm/i915/i915_params.h
> +++ b/drivers/gpu/drm/i915/i915_params.h
> @@ -72,6 +72,7 @@ struct drm_printer;
>  	param(int, enable_dpcd_backlight, -1, 0600) \
>  	param(char *, force_probe, CONFIG_DRM_I915_FORCE_PROBE, 0400) \
>  	param(unsigned long, fake_lmem_start, 0, 0400) \
> +	param(unsigned int, request_timeout_ms, CONFIG_DRM_I915_REQUEST_TIMEOUT, 0600) \
>  	/* leave bools at the end to not create holes */ \
>  	param(bool, enable_hangcheck, true, 0600) \
>  	param(bool, load_detect_test, false, 0600) \
> -- 
> 2.27.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH 7/7] drm/i915: Allow configuring default request expiry via modparam
@ 2021-03-26  0:25     ` Daniel Vetter
  0 siblings, 0 replies; 42+ messages in thread
From: Daniel Vetter @ 2021-03-26  0:25 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Daniel Vetter, Intel-gfx, dri-devel

On Wed, Mar 24, 2021 at 12:13:35PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Module parameter is added (request_timeout_ms) to allow configuring the
> default request/fence expiry.
> 
> Default value is inherited from CONFIG_DRM_I915_REQUEST_TIMEOUT.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Entire series merged, thanks for the patches.
-Daniel

> ---
>  drivers/gpu/drm/i915/gem/i915_gem_context.c | 5 +++--
>  drivers/gpu/drm/i915/i915_params.c          | 5 +++++
>  drivers/gpu/drm/i915/i915_params.h          | 1 +
>  3 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 33ff1a6a7724..0e8f0476e01f 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -845,11 +845,12 @@ static void __set_default_fence_expiry(struct i915_gem_context *ctx)
>  	struct drm_i915_private *i915 = ctx->i915;
>  	int ret;
>  
> -	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT))
> +	if (!IS_ACTIVE(CONFIG_DRM_I915_REQUEST_TIMEOUT) ||
> +	    !i915->params.request_timeout_ms)
>  		return;
>  
>  	/* Default expiry for user fences. */
> -	ret = __set_watchdog(ctx, CONFIG_DRM_I915_REQUEST_TIMEOUT * 1000);
> +	ret = __set_watchdog(ctx, i915->params.request_timeout_ms * 1000);
>  	if (ret)
>  		drm_notice(&i915->drm,
>  			   "Failed to configure default fence expiry! (%d)",
> diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
> index 6939634e56ed..0320878d96b0 100644
> --- a/drivers/gpu/drm/i915/i915_params.c
> +++ b/drivers/gpu/drm/i915/i915_params.c
> @@ -197,6 +197,11 @@ i915_param_named_unsafe(fake_lmem_start, ulong, 0400,
>  	"Fake LMEM start offset (default: 0)");
>  #endif
>  
> +#if CONFIG_DRM_I915_REQUEST_TIMEOUT
> +i915_param_named_unsafe(request_timeout_ms, uint, 0600,
> +			"Default request/fence/batch buffer expiration timeout.");
> +#endif
> +
>  static __always_inline void _print_param(struct drm_printer *p,
>  					 const char *name,
>  					 const char *type,
> diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
> index 48f47e44e848..34ebb0662547 100644
> --- a/drivers/gpu/drm/i915/i915_params.h
> +++ b/drivers/gpu/drm/i915/i915_params.h
> @@ -72,6 +72,7 @@ struct drm_printer;
>  	param(int, enable_dpcd_backlight, -1, 0600) \
>  	param(char *, force_probe, CONFIG_DRM_I915_FORCE_PROBE, 0400) \
>  	param(unsigned long, fake_lmem_start, 0, 0400) \
> +	param(unsigned int, request_timeout_ms, CONFIG_DRM_I915_REQUEST_TIMEOUT, 0600) \
>  	/* leave bools at the end to not create holes */ \
>  	param(bool, enable_hangcheck, true, 0600) \
>  	param(bool, load_detect_test, false, 0600) \
> -- 
> 2.27.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog
  2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
@ 2021-03-26  9:10   ` Daniel Vetter
  -1 siblings, 0 replies; 42+ messages in thread
From: Daniel Vetter @ 2021-03-26  9:10 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx, dri-devel

On Wed, Mar 24, 2021 at 12:13:28PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> "Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second
> post of a somewhat controversial feature, now upgraded to patch status.
> 
> I quote the "watchdog" becuase in classical sense watchdog would allow userspace
> to ping it and so remain alive.
> 
> I quote "restoring hangcheck" because this series, contrary to the old
> hangcheck, is not looking at whether the workload is making any progress from
> the kernel side either. (Although disclaimer my memory may be leaky - Daniel
> suspects old hangcheck had some stricter, more indiscriminatory, angles to it.
> But apart from being prone to both false negatives and false positives I can't
> remember that myself.)
> 
> Short version - ask is to fail any user submissions after a set time period. In
> this RFC that time is twelve seconds.
> 
> Time counts from the moment user submission is "runnable" (implicit and explicit
> dependencies have been cleared) and keeps counting regardless of the GPU
> contetion caused by other users of the system.
> 
> So semantics are really a bit weak, but again, I understand this is really
> really wanted by the DRM core even if I am not convinced it is a good idea.
> 
> There are some dangers with doing this - text borrowed from a patch in the
> series:
> 
>   This can have an effect that workloads which used to work fine will
>   suddenly start failing. Even workloads comprised of short batches but in
>   long dependency chains can be terminated.
> 
>   And becuase of lack of agreement on usefulness and safety of fence error
>   propagation this partial execution can be invisible to userspace even if
>   it is "listening" to returned fence status.
> 
>   Another interaction is with hangcheck where care needs to be taken timeout
>   is not set lower or close to three times the heartbeat interval. Otherwise
>   a hang in any application can cause complete termination of all
>   submissions from unrelated clients. Any users modifying the per engine
>   heartbeat intervals therefore need to be aware of this potential denial of
>   service to avoid inadvertently enabling it.
> 
>   Given all this I am personally not convinced the scheme is a good idea.
>   Intuitively it feels object importers would be better positioned to
>   enforce the time they are willing to wait for something to complete.
> 
> v2:
>  * Dropped context param.
>  * Improved commit messages and Kconfig text.
> 
> v3:
>  * Log timeouts.
>  * Bump timeout to 20s to see if it helps Tigerlake.

I think 20s is a bit much, and seems like problem is still there in igt. I
think we need look at that and figure out what to do with it. And then go
back down with the timeout somewhat again since 20s is quite a long time.
Irrespective of all the additional gaps/opens around watchdog timeout.
-Daniel

>  * Fix sentinel assert.
> 
> v4:
>  * A round of review feedback applied.
> 
> Chris Wilson (1):
>   drm/i915: Individual request cancellation
> 
> Tvrtko Ursulin (6):
>   drm/i915: Extract active lookup engine to a helper
>   drm/i915: Restrict sentinel requests further
>   drm/i915: Handle async cancellation in sentinel assert
>   drm/i915: Request watchdog infrastructure
>   drm/i915: Fail too long user submissions by default
>   drm/i915: Allow configuring default request expiry via modparam
> 
>  drivers/gpu/drm/i915/Kconfig.profile          |  14 ++
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   |  73 ++++---
>  .../gpu/drm/i915/gem/i915_gem_context_types.h |   4 +
>  drivers/gpu/drm/i915/gt/intel_context_param.h |  11 +-
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
>  .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
>  .../drm/i915/gt/intel_execlists_submission.c  |  23 +-
>  .../drm/i915/gt/intel_execlists_submission.h  |   2 +
>  drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
>  drivers/gpu/drm/i915/gt/intel_gt.h            |   2 +
>  drivers/gpu/drm/i915/gt/intel_gt_requests.c   |  28 +++
>  drivers/gpu/drm/i915/gt/intel_gt_types.h      |   7 +
>  drivers/gpu/drm/i915/i915_params.c            |   5 +
>  drivers/gpu/drm/i915/i915_params.h            |   1 +
>  drivers/gpu/drm/i915/i915_request.c           | 129 ++++++++++-
>  drivers/gpu/drm/i915/i915_request.h           |  16 +-
>  drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
>  17 files changed, 479 insertions(+), 45 deletions(-)
> 
> -- 
> 2.27.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog
@ 2021-03-26  9:10   ` Daniel Vetter
  0 siblings, 0 replies; 42+ messages in thread
From: Daniel Vetter @ 2021-03-26  9:10 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx, dri-devel

On Wed, Mar 24, 2021 at 12:13:28PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> "Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second
> post of a somewhat controversial feature, now upgraded to patch status.
> 
> I quote the "watchdog" becuase in classical sense watchdog would allow userspace
> to ping it and so remain alive.
> 
> I quote "restoring hangcheck" because this series, contrary to the old
> hangcheck, is not looking at whether the workload is making any progress from
> the kernel side either. (Although disclaimer my memory may be leaky - Daniel
> suspects old hangcheck had some stricter, more indiscriminatory, angles to it.
> But apart from being prone to both false negatives and false positives I can't
> remember that myself.)
> 
> Short version - ask is to fail any user submissions after a set time period. In
> this RFC that time is twelve seconds.
> 
> Time counts from the moment user submission is "runnable" (implicit and explicit
> dependencies have been cleared) and keeps counting regardless of the GPU
> contetion caused by other users of the system.
> 
> So semantics are really a bit weak, but again, I understand this is really
> really wanted by the DRM core even if I am not convinced it is a good idea.
> 
> There are some dangers with doing this - text borrowed from a patch in the
> series:
> 
>   This can have an effect that workloads which used to work fine will
>   suddenly start failing. Even workloads comprised of short batches but in
>   long dependency chains can be terminated.
> 
>   And becuase of lack of agreement on usefulness and safety of fence error
>   propagation this partial execution can be invisible to userspace even if
>   it is "listening" to returned fence status.
> 
>   Another interaction is with hangcheck where care needs to be taken timeout
>   is not set lower or close to three times the heartbeat interval. Otherwise
>   a hang in any application can cause complete termination of all
>   submissions from unrelated clients. Any users modifying the per engine
>   heartbeat intervals therefore need to be aware of this potential denial of
>   service to avoid inadvertently enabling it.
> 
>   Given all this I am personally not convinced the scheme is a good idea.
>   Intuitively it feels object importers would be better positioned to
>   enforce the time they are willing to wait for something to complete.
> 
> v2:
>  * Dropped context param.
>  * Improved commit messages and Kconfig text.
> 
> v3:
>  * Log timeouts.
>  * Bump timeout to 20s to see if it helps Tigerlake.

I think 20s is a bit much, and seems like problem is still there in igt. I
think we need look at that and figure out what to do with it. And then go
back down with the timeout somewhat again since 20s is quite a long time.
Irrespective of all the additional gaps/opens around watchdog timeout.
-Daniel

>  * Fix sentinel assert.
> 
> v4:
>  * A round of review feedback applied.
> 
> Chris Wilson (1):
>   drm/i915: Individual request cancellation
> 
> Tvrtko Ursulin (6):
>   drm/i915: Extract active lookup engine to a helper
>   drm/i915: Restrict sentinel requests further
>   drm/i915: Handle async cancellation in sentinel assert
>   drm/i915: Request watchdog infrastructure
>   drm/i915: Fail too long user submissions by default
>   drm/i915: Allow configuring default request expiry via modparam
> 
>  drivers/gpu/drm/i915/Kconfig.profile          |  14 ++
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   |  73 ++++---
>  .../gpu/drm/i915/gem/i915_gem_context_types.h |   4 +
>  drivers/gpu/drm/i915/gt/intel_context_param.h |  11 +-
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
>  .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
>  .../drm/i915/gt/intel_execlists_submission.c  |  23 +-
>  .../drm/i915/gt/intel_execlists_submission.h  |   2 +
>  drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
>  drivers/gpu/drm/i915/gt/intel_gt.h            |   2 +
>  drivers/gpu/drm/i915/gt/intel_gt_requests.c   |  28 +++
>  drivers/gpu/drm/i915/gt/intel_gt_types.h      |   7 +
>  drivers/gpu/drm/i915/i915_params.c            |   5 +
>  drivers/gpu/drm/i915/i915_params.h            |   1 +
>  drivers/gpu/drm/i915/i915_request.c           | 129 ++++++++++-
>  drivers/gpu/drm/i915/i915_request.h           |  16 +-
>  drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
>  17 files changed, 479 insertions(+), 45 deletions(-)
> 
> -- 
> 2.27.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog
  2021-03-26  9:10   ` Daniel Vetter
@ 2021-03-26 10:31     ` Tvrtko Ursulin
  -1 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-26 10:31 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel-gfx, dri-devel


On 26/03/2021 09:10, Daniel Vetter wrote:
> On Wed, Mar 24, 2021 at 12:13:28PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> "Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second
>> post of a somewhat controversial feature, now upgraded to patch status.
>>
>> I quote the "watchdog" becuase in classical sense watchdog would allow userspace
>> to ping it and so remain alive.
>>
>> I quote "restoring hangcheck" because this series, contrary to the old
>> hangcheck, is not looking at whether the workload is making any progress from
>> the kernel side either. (Although disclaimer my memory may be leaky - Daniel
>> suspects old hangcheck had some stricter, more indiscriminatory, angles to it.
>> But apart from being prone to both false negatives and false positives I can't
>> remember that myself.)
>>
>> Short version - ask is to fail any user submissions after a set time period. In
>> this RFC that time is twelve seconds.
>>
>> Time counts from the moment user submission is "runnable" (implicit and explicit
>> dependencies have been cleared) and keeps counting regardless of the GPU
>> contetion caused by other users of the system.
>>
>> So semantics are really a bit weak, but again, I understand this is really
>> really wanted by the DRM core even if I am not convinced it is a good idea.
>>
>> There are some dangers with doing this - text borrowed from a patch in the
>> series:
>>
>>    This can have an effect that workloads which used to work fine will
>>    suddenly start failing. Even workloads comprised of short batches but in
>>    long dependency chains can be terminated.
>>
>>    And becuase of lack of agreement on usefulness and safety of fence error
>>    propagation this partial execution can be invisible to userspace even if
>>    it is "listening" to returned fence status.
>>
>>    Another interaction is with hangcheck where care needs to be taken timeout
>>    is not set lower or close to three times the heartbeat interval. Otherwise
>>    a hang in any application can cause complete termination of all
>>    submissions from unrelated clients. Any users modifying the per engine
>>    heartbeat intervals therefore need to be aware of this potential denial of
>>    service to avoid inadvertently enabling it.
>>
>>    Given all this I am personally not convinced the scheme is a good idea.
>>    Intuitively it feels object importers would be better positioned to
>>    enforce the time they are willing to wait for something to complete.
>>
>> v2:
>>   * Dropped context param.
>>   * Improved commit messages and Kconfig text.
>>
>> v3:
>>   * Log timeouts.
>>   * Bump timeout to 20s to see if it helps Tigerlake.
> 
> I think 20s is a bit much, and seems like problem is still there in igt. I
> think we need look at that and figure out what to do with it. And then go
> back down with the timeout somewhat again since 20s is quite a long time.
> Irrespective of all the additional gaps/opens around watchdog timeout.

1)

The relationship with the hearbeat is the first issue. There we have 3x 
heartbeat period (each rounded to full second) before sending a 
high-prio pulse which can cause a preempt timeout and hence a 
reset/kicking out of a non-compliant request.

Defaults for those values mean default expiry shouldn't be lower than 3x 
rounded hearbeat interval + preempt timeout, currently ~9.75s. In 
practice even 12s which I tried initially was too aggressive due slacks 
on some platforms.

2)

20s seems to work apart that it shows the general regression 
unconditional default expiry adds. Either some existing IGTs which 
create long runnable chains, or the far-fence test which explicitly 
demonstrates this. AFAIK, and apart from the can_merge_rq yet 
unexplained oops, this is the only class of IGT failures which can appear.

So you could tweak it lower, if you also decide to make real hang 
detection stricter. But doing that also worsens the regression with 
loaded systems.

I only can have a large shrug/dontknow here since I wish we went more 
towards my suggestion of emulating setrlimit(RLIMIT_CPU). Meaning at 
least going with GPU time instead of elapsed time and possibly even 
leaving the policy of setting it to sysadmins. That would fit much 
better with our hangcheck, but, doesn't fit the drm core mandate.. hence 
I really don't know.

Regards,

Tvrtko

> -Daniel
> 
>>   * Fix sentinel assert.
>>
>> v4:
>>   * A round of review feedback applied.
>>
>> Chris Wilson (1):
>>    drm/i915: Individual request cancellation
>>
>> Tvrtko Ursulin (6):
>>    drm/i915: Extract active lookup engine to a helper
>>    drm/i915: Restrict sentinel requests further
>>    drm/i915: Handle async cancellation in sentinel assert
>>    drm/i915: Request watchdog infrastructure
>>    drm/i915: Fail too long user submissions by default
>>    drm/i915: Allow configuring default request expiry via modparam
>>
>>   drivers/gpu/drm/i915/Kconfig.profile          |  14 ++
>>   drivers/gpu/drm/i915/gem/i915_gem_context.c   |  73 ++++---
>>   .../gpu/drm/i915/gem/i915_gem_context_types.h |   4 +
>>   drivers/gpu/drm/i915/gt/intel_context_param.h |  11 +-
>>   drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
>>   .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
>>   .../drm/i915/gt/intel_execlists_submission.c  |  23 +-
>>   .../drm/i915/gt/intel_execlists_submission.h  |   2 +
>>   drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
>>   drivers/gpu/drm/i915/gt/intel_gt.h            |   2 +
>>   drivers/gpu/drm/i915/gt/intel_gt_requests.c   |  28 +++
>>   drivers/gpu/drm/i915/gt/intel_gt_types.h      |   7 +
>>   drivers/gpu/drm/i915/i915_params.c            |   5 +
>>   drivers/gpu/drm/i915/i915_params.h            |   1 +
>>   drivers/gpu/drm/i915/i915_request.c           | 129 ++++++++++-
>>   drivers/gpu/drm/i915/i915_request.h           |  16 +-
>>   drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
>>   17 files changed, 479 insertions(+), 45 deletions(-)
>>
>> -- 
>> 2.27.0
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog
@ 2021-03-26 10:31     ` Tvrtko Ursulin
  0 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-26 10:31 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel-gfx, dri-devel


On 26/03/2021 09:10, Daniel Vetter wrote:
> On Wed, Mar 24, 2021 at 12:13:28PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> "Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second
>> post of a somewhat controversial feature, now upgraded to patch status.
>>
>> I quote the "watchdog" becuase in classical sense watchdog would allow userspace
>> to ping it and so remain alive.
>>
>> I quote "restoring hangcheck" because this series, contrary to the old
>> hangcheck, is not looking at whether the workload is making any progress from
>> the kernel side either. (Although disclaimer my memory may be leaky - Daniel
>> suspects old hangcheck had some stricter, more indiscriminatory, angles to it.
>> But apart from being prone to both false negatives and false positives I can't
>> remember that myself.)
>>
>> Short version - ask is to fail any user submissions after a set time period. In
>> this RFC that time is twelve seconds.
>>
>> Time counts from the moment user submission is "runnable" (implicit and explicit
>> dependencies have been cleared) and keeps counting regardless of the GPU
>> contetion caused by other users of the system.
>>
>> So semantics are really a bit weak, but again, I understand this is really
>> really wanted by the DRM core even if I am not convinced it is a good idea.
>>
>> There are some dangers with doing this - text borrowed from a patch in the
>> series:
>>
>>    This can have an effect that workloads which used to work fine will
>>    suddenly start failing. Even workloads comprised of short batches but in
>>    long dependency chains can be terminated.
>>
>>    And becuase of lack of agreement on usefulness and safety of fence error
>>    propagation this partial execution can be invisible to userspace even if
>>    it is "listening" to returned fence status.
>>
>>    Another interaction is with hangcheck where care needs to be taken timeout
>>    is not set lower or close to three times the heartbeat interval. Otherwise
>>    a hang in any application can cause complete termination of all
>>    submissions from unrelated clients. Any users modifying the per engine
>>    heartbeat intervals therefore need to be aware of this potential denial of
>>    service to avoid inadvertently enabling it.
>>
>>    Given all this I am personally not convinced the scheme is a good idea.
>>    Intuitively it feels object importers would be better positioned to
>>    enforce the time they are willing to wait for something to complete.
>>
>> v2:
>>   * Dropped context param.
>>   * Improved commit messages and Kconfig text.
>>
>> v3:
>>   * Log timeouts.
>>   * Bump timeout to 20s to see if it helps Tigerlake.
> 
> I think 20s is a bit much, and seems like problem is still there in igt. I
> think we need look at that and figure out what to do with it. And then go
> back down with the timeout somewhat again since 20s is quite a long time.
> Irrespective of all the additional gaps/opens around watchdog timeout.

1)

The relationship with the hearbeat is the first issue. There we have 3x 
heartbeat period (each rounded to full second) before sending a 
high-prio pulse which can cause a preempt timeout and hence a 
reset/kicking out of a non-compliant request.

Defaults for those values mean default expiry shouldn't be lower than 3x 
rounded hearbeat interval + preempt timeout, currently ~9.75s. In 
practice even 12s which I tried initially was too aggressive due slacks 
on some platforms.

2)

20s seems to work apart that it shows the general regression 
unconditional default expiry adds. Either some existing IGTs which 
create long runnable chains, or the far-fence test which explicitly 
demonstrates this. AFAIK, and apart from the can_merge_rq yet 
unexplained oops, this is the only class of IGT failures which can appear.

So you could tweak it lower, if you also decide to make real hang 
detection stricter. But doing that also worsens the regression with 
loaded systems.

I only can have a large shrug/dontknow here since I wish we went more 
towards my suggestion of emulating setrlimit(RLIMIT_CPU). Meaning at 
least going with GPU time instead of elapsed time and possibly even 
leaving the policy of setting it to sysadmins. That would fit much 
better with our hangcheck, but, doesn't fit the drm core mandate.. hence 
I really don't know.

Regards,

Tvrtko

> -Daniel
> 
>>   * Fix sentinel assert.
>>
>> v4:
>>   * A round of review feedback applied.
>>
>> Chris Wilson (1):
>>    drm/i915: Individual request cancellation
>>
>> Tvrtko Ursulin (6):
>>    drm/i915: Extract active lookup engine to a helper
>>    drm/i915: Restrict sentinel requests further
>>    drm/i915: Handle async cancellation in sentinel assert
>>    drm/i915: Request watchdog infrastructure
>>    drm/i915: Fail too long user submissions by default
>>    drm/i915: Allow configuring default request expiry via modparam
>>
>>   drivers/gpu/drm/i915/Kconfig.profile          |  14 ++
>>   drivers/gpu/drm/i915/gem/i915_gem_context.c   |  73 ++++---
>>   .../gpu/drm/i915/gem/i915_gem_context_types.h |   4 +
>>   drivers/gpu/drm/i915/gt/intel_context_param.h |  11 +-
>>   drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
>>   .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
>>   .../drm/i915/gt/intel_execlists_submission.c  |  23 +-
>>   .../drm/i915/gt/intel_execlists_submission.h  |   2 +
>>   drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
>>   drivers/gpu/drm/i915/gt/intel_gt.h            |   2 +
>>   drivers/gpu/drm/i915/gt/intel_gt_requests.c   |  28 +++
>>   drivers/gpu/drm/i915/gt/intel_gt_types.h      |   7 +
>>   drivers/gpu/drm/i915/i915_params.c            |   5 +
>>   drivers/gpu/drm/i915/i915_params.h            |   1 +
>>   drivers/gpu/drm/i915/i915_request.c           | 129 ++++++++++-
>>   drivers/gpu/drm/i915/i915_request.h           |  16 +-
>>   drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
>>   17 files changed, 479 insertions(+), 45 deletions(-)
>>
>> -- 
>> 2.27.0
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/7] drm/i915: Request watchdog infrastructure
  2021-03-26  0:00     ` [Intel-gfx] " Daniel Vetter
@ 2021-03-26 10:32       ` Tvrtko Ursulin
  -1 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-26 10:32 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Daniel Vetter, Intel-gfx, Matthew Auld, dri-devel, Tvrtko Ursulin


On 26/03/2021 00:00, Daniel Vetter wrote:
> On Wed, Mar 24, 2021 at 12:13:33PM +0000, Tvrtko Ursulin wrote:

[snip]

>> +static enum hrtimer_restart __rq_watchdog_expired(struct hrtimer *hrtimer)
>> +{
>> +	struct i915_request *rq =
>> +		container_of(hrtimer, struct i915_request, watchdog.timer);
>> +	struct intel_gt *gt = rq->engine->gt;
>> +
>> +	if (!i915_request_completed(rq)) {
>> +		if (llist_add(&rq->watchdog.link, &gt->watchdog.list))
>> +			schedule_work(&gt->watchdog.work);
>> +	} else {
>> +		i915_request_put(rq);
>> +	}
>> +
>> +	return HRTIMER_NORESTART;
>> +}
>> +
>> +static void __rq_arm_watchdog(struct i915_request *rq)
>> +{
>> +	struct i915_request_watchdog *wdg = &rq->watchdog;
>> +	struct intel_context *ce = rq->context;
>> +
>> +	if (!ce->watchdog.timeout_us)
>> +		return;
>> +
>> +	hrtimer_init(&wdg->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +	wdg->timer.function = __rq_watchdog_expired;
>> +	hrtimer_start_range_ns(&wdg->timer,
>> +			       ns_to_ktime(ce->watchdog.timeout_us *
>> +					   NSEC_PER_USEC),
>> +			       NSEC_PER_MSEC,
>> +			       HRTIMER_MODE_REL);
>> +	i915_request_get(rq);
> 
> Shouldn't we grab the new reference before we arm the timer? Either way
> since fairly academic I went ahead and applied, but if you agree pls do a
> follow up patch.

Absolutely true.. my bad.

Regards,

Tvrtko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH 5/7] drm/i915: Request watchdog infrastructure
@ 2021-03-26 10:32       ` Tvrtko Ursulin
  0 siblings, 0 replies; 42+ messages in thread
From: Tvrtko Ursulin @ 2021-03-26 10:32 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Intel-gfx, Matthew Auld, dri-devel


On 26/03/2021 00:00, Daniel Vetter wrote:
> On Wed, Mar 24, 2021 at 12:13:33PM +0000, Tvrtko Ursulin wrote:

[snip]

>> +static enum hrtimer_restart __rq_watchdog_expired(struct hrtimer *hrtimer)
>> +{
>> +	struct i915_request *rq =
>> +		container_of(hrtimer, struct i915_request, watchdog.timer);
>> +	struct intel_gt *gt = rq->engine->gt;
>> +
>> +	if (!i915_request_completed(rq)) {
>> +		if (llist_add(&rq->watchdog.link, &gt->watchdog.list))
>> +			schedule_work(&gt->watchdog.work);
>> +	} else {
>> +		i915_request_put(rq);
>> +	}
>> +
>> +	return HRTIMER_NORESTART;
>> +}
>> +
>> +static void __rq_arm_watchdog(struct i915_request *rq)
>> +{
>> +	struct i915_request_watchdog *wdg = &rq->watchdog;
>> +	struct intel_context *ce = rq->context;
>> +
>> +	if (!ce->watchdog.timeout_us)
>> +		return;
>> +
>> +	hrtimer_init(&wdg->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +	wdg->timer.function = __rq_watchdog_expired;
>> +	hrtimer_start_range_ns(&wdg->timer,
>> +			       ns_to_ktime(ce->watchdog.timeout_us *
>> +					   NSEC_PER_USEC),
>> +			       NSEC_PER_MSEC,
>> +			       HRTIMER_MODE_REL);
>> +	i915_request_get(rq);
> 
> Shouldn't we grab the new reference before we arm the timer? Either way
> since fairly academic I went ahead and applied, but if you agree pls do a
> follow up patch.

Absolutely true.. my bad.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog
  2021-03-26 10:31     ` Tvrtko Ursulin
@ 2021-04-08 10:18       ` Daniel Vetter
  -1 siblings, 0 replies; 42+ messages in thread
From: Daniel Vetter @ 2021-04-08 10:18 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx, dri-devel

On Fri, Mar 26, 2021 at 10:31:10AM +0000, Tvrtko Ursulin wrote:
> 
> On 26/03/2021 09:10, Daniel Vetter wrote:
> > On Wed, Mar 24, 2021 at 12:13:28PM +0000, Tvrtko Ursulin wrote:
> > > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > 
> > > "Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second
> > > post of a somewhat controversial feature, now upgraded to patch status.
> > > 
> > > I quote the "watchdog" becuase in classical sense watchdog would allow userspace
> > > to ping it and so remain alive.
> > > 
> > > I quote "restoring hangcheck" because this series, contrary to the old
> > > hangcheck, is not looking at whether the workload is making any progress from
> > > the kernel side either. (Although disclaimer my memory may be leaky - Daniel
> > > suspects old hangcheck had some stricter, more indiscriminatory, angles to it.
> > > But apart from being prone to both false negatives and false positives I can't
> > > remember that myself.)
> > > 
> > > Short version - ask is to fail any user submissions after a set time period. In
> > > this RFC that time is twelve seconds.
> > > 
> > > Time counts from the moment user submission is "runnable" (implicit and explicit
> > > dependencies have been cleared) and keeps counting regardless of the GPU
> > > contetion caused by other users of the system.
> > > 
> > > So semantics are really a bit weak, but again, I understand this is really
> > > really wanted by the DRM core even if I am not convinced it is a good idea.
> > > 
> > > There are some dangers with doing this - text borrowed from a patch in the
> > > series:
> > > 
> > >    This can have an effect that workloads which used to work fine will
> > >    suddenly start failing. Even workloads comprised of short batches but in
> > >    long dependency chains can be terminated.
> > > 
> > >    And becuase of lack of agreement on usefulness and safety of fence error
> > >    propagation this partial execution can be invisible to userspace even if
> > >    it is "listening" to returned fence status.
> > > 
> > >    Another interaction is with hangcheck where care needs to be taken timeout
> > >    is not set lower or close to three times the heartbeat interval. Otherwise
> > >    a hang in any application can cause complete termination of all
> > >    submissions from unrelated clients. Any users modifying the per engine
> > >    heartbeat intervals therefore need to be aware of this potential denial of
> > >    service to avoid inadvertently enabling it.
> > > 
> > >    Given all this I am personally not convinced the scheme is a good idea.
> > >    Intuitively it feels object importers would be better positioned to
> > >    enforce the time they are willing to wait for something to complete.
> > > 
> > > v2:
> > >   * Dropped context param.
> > >   * Improved commit messages and Kconfig text.
> > > 
> > > v3:
> > >   * Log timeouts.
> > >   * Bump timeout to 20s to see if it helps Tigerlake.
> > 
> > I think 20s is a bit much, and seems like problem is still there in igt. I
> > think we need look at that and figure out what to do with it. And then go
> > back down with the timeout somewhat again since 20s is quite a long time.
> > Irrespective of all the additional gaps/opens around watchdog timeout.
> 
> 1)
> 
> The relationship with the hearbeat is the first issue. There we have 3x
> heartbeat period (each rounded to full second) before sending a high-prio
> pulse which can cause a preempt timeout and hence a reset/kicking out of a
> non-compliant request.
> 
> Defaults for those values mean default expiry shouldn't be lower than 3x
> rounded hearbeat interval + preempt timeout, currently ~9.75s. In practice
> even 12s which I tried initially was too aggressive due slacks on some
> platforms.

Hm, would be good to put that as a comment next to the module param, or
something like that. Maybe even a sanity check to make sure these two
values are consistent (i.e. if watchdog is less than 3.5x the heartbeat,
we complain in dmesg).

> 2)
> 
> 20s seems to work apart that it shows the general regression unconditional
> default expiry adds. Either some existing IGTs which create long runnable
> chains, or the far-fence test which explicitly demonstrates this. AFAIK, and
> apart from the can_merge_rq yet unexplained oops, this is the only class of
> IGT failures which can appear.
> 
> So you could tweak it lower, if you also decide to make real hang detection
> stricter. But doing that also worsens the regression with loaded systems.
> 
> I only can have a large shrug/dontknow here since I wish we went more
> towards my suggestion of emulating setrlimit(RLIMIT_CPU). Meaning at least
> going with GPU time instead of elapsed time and possibly even leaving the
> policy of setting it to sysadmins. That would fit much better with our
> hangcheck, but, doesn't fit the drm core mandate.. hence I really don't
> know.

The bikeshed will come back when we wire up drm/scheduler as the frontend
for guc scheduler backend. I guess we can tackle it then.
-Daniel

> 
> Regards,
> 
> Tvrtko
> 
> > -Daniel
> > 
> > >   * Fix sentinel assert.
> > > 
> > > v4:
> > >   * A round of review feedback applied.
> > > 
> > > Chris Wilson (1):
> > >    drm/i915: Individual request cancellation
> > > 
> > > Tvrtko Ursulin (6):
> > >    drm/i915: Extract active lookup engine to a helper
> > >    drm/i915: Restrict sentinel requests further
> > >    drm/i915: Handle async cancellation in sentinel assert
> > >    drm/i915: Request watchdog infrastructure
> > >    drm/i915: Fail too long user submissions by default
> > >    drm/i915: Allow configuring default request expiry via modparam
> > > 
> > >   drivers/gpu/drm/i915/Kconfig.profile          |  14 ++
> > >   drivers/gpu/drm/i915/gem/i915_gem_context.c   |  73 ++++---
> > >   .../gpu/drm/i915/gem/i915_gem_context_types.h |   4 +
> > >   drivers/gpu/drm/i915/gt/intel_context_param.h |  11 +-
> > >   drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
> > >   .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
> > >   .../drm/i915/gt/intel_execlists_submission.c  |  23 +-
> > >   .../drm/i915/gt/intel_execlists_submission.h  |   2 +
> > >   drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
> > >   drivers/gpu/drm/i915/gt/intel_gt.h            |   2 +
> > >   drivers/gpu/drm/i915/gt/intel_gt_requests.c   |  28 +++
> > >   drivers/gpu/drm/i915/gt/intel_gt_types.h      |   7 +
> > >   drivers/gpu/drm/i915/i915_params.c            |   5 +
> > >   drivers/gpu/drm/i915/i915_params.h            |   1 +
> > >   drivers/gpu/drm/i915/i915_request.c           | 129 ++++++++++-
> > >   drivers/gpu/drm/i915/i915_request.h           |  16 +-
> > >   drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
> > >   17 files changed, 479 insertions(+), 45 deletions(-)
> > > 
> > > -- 
> > > 2.27.0
> > > 
> > > _______________________________________________
> > > Intel-gfx mailing list
> > > Intel-gfx@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog
@ 2021-04-08 10:18       ` Daniel Vetter
  0 siblings, 0 replies; 42+ messages in thread
From: Daniel Vetter @ 2021-04-08 10:18 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx, dri-devel

On Fri, Mar 26, 2021 at 10:31:10AM +0000, Tvrtko Ursulin wrote:
> 
> On 26/03/2021 09:10, Daniel Vetter wrote:
> > On Wed, Mar 24, 2021 at 12:13:28PM +0000, Tvrtko Ursulin wrote:
> > > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > 
> > > "Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second
> > > post of a somewhat controversial feature, now upgraded to patch status.
> > > 
> > > I quote the "watchdog" becuase in classical sense watchdog would allow userspace
> > > to ping it and so remain alive.
> > > 
> > > I quote "restoring hangcheck" because this series, contrary to the old
> > > hangcheck, is not looking at whether the workload is making any progress from
> > > the kernel side either. (Although disclaimer my memory may be leaky - Daniel
> > > suspects old hangcheck had some stricter, more indiscriminatory, angles to it.
> > > But apart from being prone to both false negatives and false positives I can't
> > > remember that myself.)
> > > 
> > > Short version - ask is to fail any user submissions after a set time period. In
> > > this RFC that time is twelve seconds.
> > > 
> > > Time counts from the moment user submission is "runnable" (implicit and explicit
> > > dependencies have been cleared) and keeps counting regardless of the GPU
> > > contetion caused by other users of the system.
> > > 
> > > So semantics are really a bit weak, but again, I understand this is really
> > > really wanted by the DRM core even if I am not convinced it is a good idea.
> > > 
> > > There are some dangers with doing this - text borrowed from a patch in the
> > > series:
> > > 
> > >    This can have an effect that workloads which used to work fine will
> > >    suddenly start failing. Even workloads comprised of short batches but in
> > >    long dependency chains can be terminated.
> > > 
> > >    And becuase of lack of agreement on usefulness and safety of fence error
> > >    propagation this partial execution can be invisible to userspace even if
> > >    it is "listening" to returned fence status.
> > > 
> > >    Another interaction is with hangcheck where care needs to be taken timeout
> > >    is not set lower or close to three times the heartbeat interval. Otherwise
> > >    a hang in any application can cause complete termination of all
> > >    submissions from unrelated clients. Any users modifying the per engine
> > >    heartbeat intervals therefore need to be aware of this potential denial of
> > >    service to avoid inadvertently enabling it.
> > > 
> > >    Given all this I am personally not convinced the scheme is a good idea.
> > >    Intuitively it feels object importers would be better positioned to
> > >    enforce the time they are willing to wait for something to complete.
> > > 
> > > v2:
> > >   * Dropped context param.
> > >   * Improved commit messages and Kconfig text.
> > > 
> > > v3:
> > >   * Log timeouts.
> > >   * Bump timeout to 20s to see if it helps Tigerlake.
> > 
> > I think 20s is a bit much, and seems like problem is still there in igt. I
> > think we need look at that and figure out what to do with it. And then go
> > back down with the timeout somewhat again since 20s is quite a long time.
> > Irrespective of all the additional gaps/opens around watchdog timeout.
> 
> 1)
> 
> The relationship with the hearbeat is the first issue. There we have 3x
> heartbeat period (each rounded to full second) before sending a high-prio
> pulse which can cause a preempt timeout and hence a reset/kicking out of a
> non-compliant request.
> 
> Defaults for those values mean default expiry shouldn't be lower than 3x
> rounded hearbeat interval + preempt timeout, currently ~9.75s. In practice
> even 12s which I tried initially was too aggressive due slacks on some
> platforms.

Hm, would be good to put that as a comment next to the module param, or
something like that. Maybe even a sanity check to make sure these two
values are consistent (i.e. if watchdog is less than 3.5x the heartbeat,
we complain in dmesg).

> 2)
> 
> 20s seems to work apart that it shows the general regression unconditional
> default expiry adds. Either some existing IGTs which create long runnable
> chains, or the far-fence test which explicitly demonstrates this. AFAIK, and
> apart from the can_merge_rq yet unexplained oops, this is the only class of
> IGT failures which can appear.
> 
> So you could tweak it lower, if you also decide to make real hang detection
> stricter. But doing that also worsens the regression with loaded systems.
> 
> I only can have a large shrug/dontknow here since I wish we went more
> towards my suggestion of emulating setrlimit(RLIMIT_CPU). Meaning at least
> going with GPU time instead of elapsed time and possibly even leaving the
> policy of setting it to sysadmins. That would fit much better with our
> hangcheck, but, doesn't fit the drm core mandate.. hence I really don't
> know.

The bikeshed will come back when we wire up drm/scheduler as the frontend
for guc scheduler backend. I guess we can tackle it then.
-Daniel

> 
> Regards,
> 
> Tvrtko
> 
> > -Daniel
> > 
> > >   * Fix sentinel assert.
> > > 
> > > v4:
> > >   * A round of review feedback applied.
> > > 
> > > Chris Wilson (1):
> > >    drm/i915: Individual request cancellation
> > > 
> > > Tvrtko Ursulin (6):
> > >    drm/i915: Extract active lookup engine to a helper
> > >    drm/i915: Restrict sentinel requests further
> > >    drm/i915: Handle async cancellation in sentinel assert
> > >    drm/i915: Request watchdog infrastructure
> > >    drm/i915: Fail too long user submissions by default
> > >    drm/i915: Allow configuring default request expiry via modparam
> > > 
> > >   drivers/gpu/drm/i915/Kconfig.profile          |  14 ++
> > >   drivers/gpu/drm/i915/gem/i915_gem_context.c   |  73 ++++---
> > >   .../gpu/drm/i915/gem/i915_gem_context_types.h |   4 +
> > >   drivers/gpu/drm/i915/gt/intel_context_param.h |  11 +-
> > >   drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
> > >   .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
> > >   .../drm/i915/gt/intel_execlists_submission.c  |  23 +-
> > >   .../drm/i915/gt/intel_execlists_submission.h  |   2 +
> > >   drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
> > >   drivers/gpu/drm/i915/gt/intel_gt.h            |   2 +
> > >   drivers/gpu/drm/i915/gt/intel_gt_requests.c   |  28 +++
> > >   drivers/gpu/drm/i915/gt/intel_gt_types.h      |   7 +
> > >   drivers/gpu/drm/i915/i915_params.c            |   5 +
> > >   drivers/gpu/drm/i915/i915_params.h            |   1 +
> > >   drivers/gpu/drm/i915/i915_request.c           | 129 ++++++++++-
> > >   drivers/gpu/drm/i915/i915_request.h           |  16 +-
> > >   drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
> > >   17 files changed, 479 insertions(+), 45 deletions(-)
> > > 
> > > -- 
> > > 2.27.0
> > > 
> > > _______________________________________________
> > > Intel-gfx mailing list
> > > Intel-gfx@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2021-04-08 10:19 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-24 12:13 [PATCH v4 0/7] Default request/fence expiry + watchdog Tvrtko Ursulin
2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 12:13 ` [PATCH 1/7] drm/i915: Extract active lookup engine to a helper Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 12:21   ` Matthew Auld
2021-03-24 12:21     ` Matthew Auld
2021-03-24 12:13 ` [PATCH 2/7] drm/i915: Individual request cancellation Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 15:24   ` Matthew Auld
2021-03-24 15:24     ` Matthew Auld
2021-03-24 12:13 ` [PATCH 3/7] drm/i915: Restrict sentinel requests further Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 15:25   ` Matthew Auld
2021-03-24 15:25     ` [Intel-gfx] " Matthew Auld
2021-03-26  0:01     ` Daniel Vetter
2021-03-26  0:01       ` Daniel Vetter
2021-03-24 12:13 ` [PATCH 4/7] drm/i915: Handle async cancellation in sentinel assert Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 17:22   ` Matthew Auld
2021-03-24 17:22     ` Matthew Auld
2021-03-24 12:13 ` [PATCH 5/7] drm/i915: Request watchdog infrastructure Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-26  0:00   ` Daniel Vetter
2021-03-26  0:00     ` [Intel-gfx] " Daniel Vetter
2021-03-26 10:32     ` Tvrtko Ursulin
2021-03-26 10:32       ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 12:13 ` [PATCH 6/7] drm/i915: Fail too long user submissions by default Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-24 12:13 ` [PATCH 7/7] drm/i915: Allow configuring default request expiry via modparam Tvrtko Ursulin
2021-03-24 12:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-03-26  0:25   ` Daniel Vetter
2021-03-26  0:25     ` [Intel-gfx] " Daniel Vetter
2021-03-24 13:16 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Default request/fence expiry + watchdog (rev5) Patchwork
2021-03-24 13:21 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
2021-03-24 13:48 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-03-24 23:29 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2021-03-26  9:10 ` [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog Daniel Vetter
2021-03-26  9:10   ` Daniel Vetter
2021-03-26 10:31   ` Tvrtko Ursulin
2021-03-26 10:31     ` Tvrtko Ursulin
2021-04-08 10:18     ` Daniel Vetter
2021-04-08 10:18       ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.