[Intel-gfx] [PATCH v4 0/5] Fix error propagation amongst request

intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [Intel-gfx] [PATCH v4 0/5] Fix error propagation amongst request
@ 2023-03-08  9:41 Andi Shyti
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 1/5] drm/i915: Throttle for ringspace prior to taking the timeline mutex Andi Shyti
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Andi Shyti @ 2023-03-08  9:41 UTC (permalink / raw)
  To: intel-gfx, dri-devel, stable
  Cc: Andi Shyti, Matthew Auld, Chris Wilson, Maciej Patelczyk

Hi,

This series of two patches fixes the issue introduced in
cf586021642d80 ("drm/i915/gt: Pipelined page migration") where,
as reported by Matt, in a chain of requests an error is reported
only if happens in the last request.

However Chris noticed that without ensuring exclusivity in the
locking we might end up in some deadlock. That's why patch 1
throttles for the ringspace in order to make sure that no one is
holding it.

Version 1 of this patch has been reviewed by matt and this
version is adding Chris exclusive locking.

Thanks Chris for this work.

Andi

Changelog
=========
v3 -> v4
 - In v3 the timeline was being locked, but I forgot that also
   request_create() and request_add() are locking the timeline
   as well. The former does the locking, the latter does the
   unlocking. In order to avoid this extra lock/unlock, we need
   the "_locked" version of the said functions.

v2 -> v3
 - Really lock the timeline before generating all the requests
   until the last.

v1 -> v2
 - Add patch 1 for ensuring exclusive locking of the timeline
 - Reword git commit of patch 2.

Andi Shyti (4):
  drm/i915/gt: Add intel_context_timeline_is_locked helper
  drm/i915: Create the locked version of the request create
  drm/i915: Create the locked version of the request add
  drm/i915/gt: Make sure that errors are propagated through request
    chains

Chris Wilson (1):
  drm/i915: Throttle for ringspace prior to taking the timeline mutex

 drivers/gpu/drm/i915/gt/intel_context.c | 41 +++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_context.h |  8 ++++
 drivers/gpu/drm/i915/gt/intel_migrate.c | 41 ++++++++++++++-----
 drivers/gpu/drm/i915/i915_request.c     | 54 ++++++++++++++++++-------
 drivers/gpu/drm/i915/i915_request.h     |  3 ++
 5 files changed, 122 insertions(+), 25 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Intel-gfx] [PATCH v4 1/5] drm/i915: Throttle for ringspace prior to taking the timeline mutex
  2023-03-08  9:41 [Intel-gfx] [PATCH v4 0/5] Fix error propagation amongst request Andi Shyti
@ 2023-03-08  9:41 ` Andi Shyti
  2023-04-11  8:58   ` Andrzej Hajda
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 2/5] drm/i915/gt: Add intel_context_timeline_is_locked helper Andi Shyti
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Andi Shyti @ 2023-03-08  9:41 UTC (permalink / raw)
  To: intel-gfx, dri-devel, stable
  Cc: Andi Shyti, Matthew Auld, Chris Wilson, Maciej Patelczyk

From: Chris Wilson <chris@chris-wilson.co.uk>

Before taking exclusive ownership of the ring for emitting the request,
wait for space in the ring to become available. This allows others to
take the timeline->mutex to make forward progresses while userspace is
blocked.

In particular, this allows regular clients to issue requests on the
kernel context, potentially filling the ring, but allow the higher
priority heartbeats and pulses to still be submitted without being
blocked by the less critical work.

Signed-off-by: Chris Wilson <chris.p.wilson@linux.intel.com>
Cc: Maciej Patelczyk <maciej.patelczyk@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c | 41 +++++++++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_context.h |  2 ++
 drivers/gpu/drm/i915/i915_request.c     |  3 ++
 3 files changed, 46 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 2aa63ec521b89..59cd612a23561 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -626,6 +626,47 @@ bool intel_context_revoke(struct intel_context *ce)
 	return ret;
 }
 
+int intel_context_throttle(const struct intel_context *ce)
+{
+	const struct intel_ring *ring = ce->ring;
+	const struct intel_timeline *tl = ce->timeline;
+	struct i915_request *rq;
+	int err = 0;
+
+	if (READ_ONCE(ring->space) >= SZ_1K)
+		return 0;
+
+	rcu_read_lock();
+	list_for_each_entry_reverse(rq, &tl->requests, link) {
+		if (__i915_request_is_complete(rq))
+			break;
+
+		if (rq->ring != ring)
+			continue;
+
+		/* Wait until there will be enough space following that rq */
+		if (__intel_ring_space(rq->postfix,
+				       ring->emit,
+				       ring->size) < ring->size / 2) {
+			if (i915_request_get_rcu(rq)) {
+				rcu_read_unlock();
+
+				if (i915_request_wait(rq,
+						      I915_WAIT_INTERRUPTIBLE,
+						      MAX_SCHEDULE_TIMEOUT) < 0)
+					err = -EINTR;
+
+				rcu_read_lock();
+				i915_request_put(rq);
+			}
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return err;
+}
+
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_context.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index 0a8d553da3f43..f919a66cebf5b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -226,6 +226,8 @@ static inline void intel_context_exit(struct intel_context *ce)
 		ce->ops->exit(ce);
 }
 
+int intel_context_throttle(const struct intel_context *ce);
+
 static inline struct intel_context *intel_context_get(struct intel_context *ce)
 {
 	kref_get(&ce->ref);
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 630a732aaecca..72aed544f8714 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1034,6 +1034,9 @@ i915_request_create(struct intel_context *ce)
 	struct i915_request *rq;
 	struct intel_timeline *tl;
 
+	if (intel_context_throttle(ce))
+		return ERR_PTR(-EINTR);
+
 	tl = intel_context_timeline_lock(ce);
 	if (IS_ERR(tl))
 		return ERR_CAST(tl);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Intel-gfx] [PATCH v4 2/5] drm/i915/gt: Add intel_context_timeline_is_locked helper
  2023-03-08  9:41 [Intel-gfx] [PATCH v4 0/5] Fix error propagation amongst request Andi Shyti
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 1/5] drm/i915: Throttle for ringspace prior to taking the timeline mutex Andi Shyti
@ 2023-03-08  9:41 ` Andi Shyti
  2023-04-11  6:30   ` Das, Nirmoy
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 3/5] drm/i915: Create the locked version of the request create Andi Shyti
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Andi Shyti @ 2023-03-08  9:41 UTC (permalink / raw)
  To: intel-gfx, dri-devel, stable
  Cc: Andi Shyti, Matthew Auld, Chris Wilson, Maciej Patelczyk

We have:

 - intel_context_timeline_lock()
 - intel_context_timeline_unlock()

In the next patches we will also need:

 - intel_context_timeline_is_locked()

Add it.

Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/i915/gt/intel_context.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index f919a66cebf5b..87d5e2d60b6db 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -265,6 +265,12 @@ static inline void intel_context_timeline_unlock(struct intel_timeline *tl)
 	mutex_unlock(&tl->mutex);
 }
 
+static inline void intel_context_assert_timeline_is_locked(struct intel_timeline *tl)
+	__must_hold(&tl->mutex)
+{
+	lockdep_assert_held(&tl->mutex);
+}
+
 int intel_context_prepare_remote_request(struct intel_context *ce,
 					 struct i915_request *rq);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Intel-gfx] [PATCH v4 3/5] drm/i915: Create the locked version of the request create
  2023-03-08  9:41 [Intel-gfx] [PATCH v4 0/5] Fix error propagation amongst request Andi Shyti
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 1/5] drm/i915: Throttle for ringspace prior to taking the timeline mutex Andi Shyti
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 2/5] drm/i915/gt: Add intel_context_timeline_is_locked helper Andi Shyti
@ 2023-03-08  9:41 ` Andi Shyti
  2023-04-11  6:30   ` Das, Nirmoy
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 4/5] drm/i915: Create the locked version of the request add Andi Shyti
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Andi Shyti @ 2023-03-08  9:41 UTC (permalink / raw)
  To: intel-gfx, dri-devel, stable
  Cc: Andi Shyti, Matthew Auld, Chris Wilson, Maciej Patelczyk

Make version of the request creation that doesn't hold any
lock.

Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
 drivers/gpu/drm/i915/i915_request.h |  2 ++
 2 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 72aed544f8714..5ddb0e02b06b7 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1028,18 +1028,11 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp)
 	return ERR_PTR(ret);
 }
 
-struct i915_request *
-i915_request_create(struct intel_context *ce)
+static struct i915_request *
+__i915_request_create_locked(struct intel_context *ce)
 {
 	struct i915_request *rq;
-	struct intel_timeline *tl;
-
-	if (intel_context_throttle(ce))
-		return ERR_PTR(-EINTR);
-
-	tl = intel_context_timeline_lock(ce);
-	if (IS_ERR(tl))
-		return ERR_CAST(tl);
+	struct intel_timeline *tl = ce->timeline;
 
 	/* Move our oldest request to the slab-cache (if not in use!) */
 	rq = list_first_entry(&tl->requests, typeof(*rq), link);
@@ -1049,16 +1042,38 @@ i915_request_create(struct intel_context *ce)
 	intel_context_enter(ce);
 	rq = __i915_request_create(ce, GFP_KERNEL);
 	intel_context_exit(ce); /* active reference transferred to request */
-	if (IS_ERR(rq))
-		goto err_unlock;
 
 	/* Check that we do not interrupt ourselves with a new request */
 	rq->cookie = lockdep_pin_lock(&tl->mutex);
 
 	return rq;
+}
+
+struct i915_request *
+i915_request_create_locked(struct intel_context *ce)
+{
+	intel_context_assert_timeline_is_locked(ce->timeline);
+
+	if (intel_context_throttle(ce))
+		return ERR_PTR(-EINTR);
+
+	return __i915_request_create_locked(ce);
+}
+
+struct i915_request *
+i915_request_create(struct intel_context *ce)
+{
+	struct i915_request *rq;
+	struct intel_timeline *tl;
+
+	tl = intel_context_timeline_lock(ce);
+	if (IS_ERR(tl))
+		return ERR_CAST(tl);
+
+	rq = __i915_request_create_locked(ce);
+	if (IS_ERR(rq))
+		intel_context_timeline_unlock(tl);
 
-err_unlock:
-	intel_context_timeline_unlock(tl);
 	return rq;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index f5e1bb5e857aa..bb48bd4605c03 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -374,6 +374,8 @@ struct i915_request * __must_check
 __i915_request_create(struct intel_context *ce, gfp_t gfp);
 struct i915_request * __must_check
 i915_request_create(struct intel_context *ce);
+struct i915_request * __must_check
+i915_request_create_locked(struct intel_context *ce);
 
 void __i915_request_skip(struct i915_request *rq);
 bool i915_request_set_error_once(struct i915_request *rq, int error);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Intel-gfx] [PATCH v4 4/5] drm/i915: Create the locked version of the request add
  2023-03-08  9:41 [Intel-gfx] [PATCH v4 0/5] Fix error propagation amongst request Andi Shyti
                   ` (2 preceding siblings ...)
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 3/5] drm/i915: Create the locked version of the request create Andi Shyti
@ 2023-03-08  9:41 ` Andi Shyti
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains Andi Shyti
  2023-03-08 11:22 ` [Intel-gfx] ✗ Fi.CI.BAT: failure for Fix error propagation amongst request (rev2) Patchwork
  5 siblings, 0 replies; 16+ messages in thread
From: Andi Shyti @ 2023-03-08  9:41 UTC (permalink / raw)
  To: intel-gfx, dri-devel, stable
  Cc: Andi Shyti, Matthew Auld, Chris Wilson, Maciej Patelczyk

i915_request_add() assumes that the timeline is locked whtn the
function is called. Before exiting it releases the lock. But in
the next commit we have one case where releasing the timeline
mutex is not necessary and we don't want that.

Make a new i915_request_add_locked() version of the function
where the lock is not released.

Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/i915/i915_request.c | 14 +++++++++++---
 drivers/gpu/drm/i915/i915_request.h |  1 +
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 5ddb0e02b06b7..a4af16e25d966 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1852,13 +1852,13 @@ void __i915_request_queue(struct i915_request *rq,
 	local_bh_enable(); /* kick tasklets */
 }
 
-void i915_request_add(struct i915_request *rq)
+void i915_request_add_locked(struct i915_request *rq)
 {
 	struct intel_timeline * const tl = i915_request_timeline(rq);
 	struct i915_sched_attr attr = {};
 	struct i915_gem_context *ctx;
 
-	lockdep_assert_held(&tl->mutex);
+	intel_context_assert_timeline_is_locked(tl);
 	lockdep_unpin_lock(&tl->mutex, rq->cookie);
 
 	trace_i915_request_add(rq);
@@ -1873,7 +1873,15 @@ void i915_request_add(struct i915_request *rq)
 
 	__i915_request_queue(rq, &attr);
 
-	mutex_unlock(&tl->mutex);
+}
+
+void i915_request_add(struct i915_request *rq)
+{
+	struct intel_timeline * const tl = i915_request_timeline(rq);
+
+	i915_request_add_locked(rq);
+
+	intel_context_timeline_unlock(tl);
 }
 
 static unsigned long local_clock_ns(unsigned int *cpu)
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
index bb48bd4605c03..29e3a37c300a7 100644
--- a/drivers/gpu/drm/i915/i915_request.h
+++ b/drivers/gpu/drm/i915/i915_request.h
@@ -425,6 +425,7 @@ int i915_request_await_deps(struct i915_request *rq, const struct i915_deps *dep
 int i915_request_await_execution(struct i915_request *rq,
 				 struct dma_fence *fence);
 
+void i915_request_add_locked(struct i915_request *rq);
 void i915_request_add(struct i915_request *rq);
 
 bool __i915_request_submit(struct i915_request *request);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains
  2023-03-08  9:41 [Intel-gfx] [PATCH v4 0/5] Fix error propagation amongst request Andi Shyti
                   ` (3 preceding siblings ...)
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 4/5] drm/i915: Create the locked version of the request add Andi Shyti
@ 2023-03-08  9:41 ` Andi Shyti
  2023-03-10 10:03   ` Matthew Auld
  2023-04-11  6:39   ` Das, Nirmoy
  2023-03-08 11:22 ` [Intel-gfx] ✗ Fi.CI.BAT: failure for Fix error propagation amongst request (rev2) Patchwork
  5 siblings, 2 replies; 16+ messages in thread
From: Andi Shyti @ 2023-03-08  9:41 UTC (permalink / raw)
  To: intel-gfx, dri-devel, stable
  Cc: Andi Shyti, Matthew Auld, Chris Wilson, Maciej Patelczyk

Currently, when we perform operations such as clearing or copying
large blocks of memory, we generate multiple requests that are
executed in a chain.

However, if one of these requests fails, we may not realize it
unless it happens to be the last request in the chain. This is
because errors are not properly propagated.

For this we need to keep propagating the chain of fence
notification in order to always reach the final fence associated
to the final request.

To address this issue, we need to ensure that the chain of fence
notifications is always propagated so that we can reach the final
fence associated with the last request. By doing so, we will be
able to detect any memory operation  failures and determine
whether the memory is still invalid.

On copy and clear migration signal fences upon completion.

On copy and clear migration, signal fences upon request
completion to ensure that we have a reliable perpetuation of the
operation outcome.

Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration")
Reported-by: Matthew Auld <matthew.auld@intel.com>
Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 41 ++++++++++++++++++-------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 3f638f1987968..0031e7b1b4704 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -742,13 +742,19 @@ intel_context_migrate_copy(struct intel_context *ce,
 			dst_offset = 2 * CHUNK_SZ;
 	}
 
+	/*
+	 * While building the chain of requests, we need to ensure
+	 * that no one can sneak into the timeline unnoticed.
+	 */
+	mutex_lock(&ce->timeline->mutex);
+
 	do {
 		int len;
 
-		rq = i915_request_create(ce);
+		rq = i915_request_create_locked(ce);
 		if (IS_ERR(rq)) {
 			err = PTR_ERR(rq);
-			goto out_ce;
+			break;
 		}
 
 		if (deps) {
@@ -878,10 +884,14 @@ intel_context_migrate_copy(struct intel_context *ce,
 
 		/* Arbitration is re-enabled between requests. */
 out_rq:
-		if (*out)
+		i915_sw_fence_await(&rq->submit);
+		i915_request_get(rq);
+		i915_request_add_locked(rq);
+		if (*out) {
+			i915_sw_fence_complete(&(*out)->submit);
 			i915_request_put(*out);
-		*out = i915_request_get(rq);
-		i915_request_add(rq);
+		}
+		*out = rq;
 
 		if (err)
 			break;
@@ -905,7 +915,10 @@ intel_context_migrate_copy(struct intel_context *ce,
 		cond_resched();
 	} while (1);
 
-out_ce:
+	mutex_unlock(&ce->timeline->mutex);
+
+	if (*out)
+		i915_sw_fence_complete(&(*out)->submit);
 	return err;
 }
 
@@ -1005,7 +1018,7 @@ intel_context_migrate_clear(struct intel_context *ce,
 		rq = i915_request_create(ce);
 		if (IS_ERR(rq)) {
 			err = PTR_ERR(rq);
-			goto out_ce;
+			break;
 		}
 
 		if (deps) {
@@ -1056,17 +1069,23 @@ intel_context_migrate_clear(struct intel_context *ce,
 
 		/* Arbitration is re-enabled between requests. */
 out_rq:
-		if (*out)
-			i915_request_put(*out);
-		*out = i915_request_get(rq);
+		i915_sw_fence_await(&rq->submit);
+		i915_request_get(rq);
 		i915_request_add(rq);
+		if (*out) {
+			i915_sw_fence_complete(&(*out)->submit);
+			i915_request_put(*out);
+		}
+		*out = rq;
+
 		if (err || !it.sg || !sg_dma_len(it.sg))
 			break;
 
 		cond_resched();
 	} while (1);
 
-out_ce:
+	if (*out)
+		i915_sw_fence_complete(&(*out)->submit);
 	return err;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for Fix error propagation amongst request (rev2)
  2023-03-08  9:41 [Intel-gfx] [PATCH v4 0/5] Fix error propagation amongst request Andi Shyti
                   ` (4 preceding siblings ...)
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains Andi Shyti
@ 2023-03-08 11:22 ` Patchwork
  5 siblings, 0 replies; 16+ messages in thread
From: Patchwork @ 2023-03-08 11:22 UTC (permalink / raw)
  To: Andi Shyti; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 14990 bytes --]

== Series Details ==

Series: Fix error propagation amongst request (rev2)
URL   : https://patchwork.freedesktop.org/series/114451/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_12827 -> Patchwork_114451v2
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_114451v2 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_114451v2, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/index.html

Participating hosts (35 -> 36)
------------------------------

  Additional (2): fi-kbl-soraka bat-dg1-6 
  Missing    (1): fi-snb-2520m 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_114451v2:

### IGT changes ###

#### Possible regressions ####

  * igt@i915_selftest@live@guc:
    - bat-dg2-9:          [PASS][1] -> [ABORT][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-dg2-9/igt@i915_selftest@live@guc.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg2-9/igt@i915_selftest@live@guc.html
    - bat-rpls-1:         NOTRUN -> [ABORT][3]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-rpls-1/igt@i915_selftest@live@guc.html
    - bat-dg1-5:          [PASS][4] -> [ABORT][5]
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-dg1-5/igt@i915_selftest@live@guc.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-5/igt@i915_selftest@live@guc.html
    - bat-dg1-7:          [PASS][6] -> [ABORT][7]
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-dg1-7/igt@i915_selftest@live@guc.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-7/igt@i915_selftest@live@guc.html
    - bat-adlp-9:         [PASS][8] -> [ABORT][9]
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-adlp-9/igt@i915_selftest@live@guc.html
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-adlp-9/igt@i915_selftest@live@guc.html
    - bat-dg1-6:          NOTRUN -> [ABORT][10]
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@i915_selftest@live@guc.html
    - bat-dg2-8:          [PASS][11] -> [ABORT][12]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-dg2-8/igt@i915_selftest@live@guc.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg2-8/igt@i915_selftest@live@guc.html
    - bat-adlm-1:         [PASS][13] -> [ABORT][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-adlm-1/igt@i915_selftest@live@guc.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-adlm-1/igt@i915_selftest@live@guc.html

  
Known issues
------------

  Here are the changes found in Patchwork_114451v2 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_huc_copy@huc-copy:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][15] ([fdo#109271] / [i915#2190])
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/fi-kbl-soraka/igt@gem_huc_copy@huc-copy.html

  * igt@gem_lmem_swapping@basic:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][16] ([fdo#109271] / [i915#4613]) +3 similar issues
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/fi-kbl-soraka/igt@gem_lmem_swapping@basic.html

  * igt@gem_mmap@basic:
    - bat-dg1-6:          NOTRUN -> [SKIP][17] ([i915#4083])
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@gem_mmap@basic.html

  * igt@gem_render_tiled_blits@basic:
    - bat-dg1-6:          NOTRUN -> [SKIP][18] ([i915#4079]) +1 similar issue
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@gem_render_tiled_blits@basic.html

  * igt@gem_tiled_fence_blits@basic:
    - bat-dg1-6:          NOTRUN -> [SKIP][19] ([i915#4077]) +2 similar issues
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@gem_tiled_fence_blits@basic.html

  * igt@i915_pm_backlight@basic-brightness:
    - bat-dg1-6:          NOTRUN -> [SKIP][20] ([i915#7561])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@i915_pm_backlight@basic-brightness.html

  * igt@i915_pm_rpm@basic-rte:
    - bat-adln-1:         [PASS][21] -> [ABORT][22] ([i915#7977])
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-adln-1/igt@i915_pm_rpm@basic-rte.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-adln-1/igt@i915_pm_rpm@basic-rte.html

  * igt@i915_pm_rps@basic-api:
    - bat-dg1-6:          NOTRUN -> [SKIP][23] ([i915#6621])
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@i915_pm_rps@basic-api.html

  * igt@i915_selftest@live@execlists:
    - fi-bsw-n3050:       [PASS][24] -> [ABORT][25] ([i915#7911])
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/fi-bsw-n3050/igt@i915_selftest@live@execlists.html
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/fi-bsw-n3050/igt@i915_selftest@live@execlists.html

  * igt@i915_selftest@live@gem_contexts:
    - fi-kbl-soraka:      NOTRUN -> [INCOMPLETE][26] ([i915#7913])
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/fi-kbl-soraka/igt@i915_selftest@live@gem_contexts.html

  * igt@i915_selftest@live@gt_heartbeat:
    - fi-kbl-soraka:      NOTRUN -> [DMESG-FAIL][27] ([i915#5334] / [i915#7872])
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/fi-kbl-soraka/igt@i915_selftest@live@gt_heartbeat.html

  * igt@i915_selftest@live@gt_pm:
    - fi-kbl-soraka:      NOTRUN -> [DMESG-FAIL][28] ([i915#1886])
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/fi-kbl-soraka/igt@i915_selftest@live@gt_pm.html

  * igt@i915_selftest@live@guc:
    - bat-rpls-2:         [PASS][29] -> [ABORT][30] ([i915#7913])
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-rpls-2/igt@i915_selftest@live@guc.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-rpls-2/igt@i915_selftest@live@guc.html
    - bat-atsm-1:         [PASS][31] -> [ABORT][32] ([i915#7913])
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-atsm-1/igt@i915_selftest@live@guc.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-atsm-1/igt@i915_selftest@live@guc.html
    - bat-dg2-11:         [PASS][33] -> [ABORT][34] ([i915#7913])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-dg2-11/igt@i915_selftest@live@guc.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg2-11/igt@i915_selftest@live@guc.html
    - bat-rplp-1:         [PASS][35] -> [ABORT][36] ([i915#7913])
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-rplp-1/igt@i915_selftest@live@guc.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-rplp-1/igt@i915_selftest@live@guc.html

  * igt@i915_selftest@live@migrate:
    - bat-dg2-11:         [PASS][37] -> [DMESG-WARN][38] ([i915#7699])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-dg2-11/igt@i915_selftest@live@migrate.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg2-11/igt@i915_selftest@live@migrate.html
    - bat-atsm-1:         [PASS][39] -> [DMESG-FAIL][40] ([i915#7699])
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-atsm-1/igt@i915_selftest@live@migrate.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-atsm-1/igt@i915_selftest@live@migrate.html

  * igt@i915_selftest@live@slpc:
    - bat-rpls-1:         NOTRUN -> [DMESG-FAIL][41] ([i915#6367])
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-rpls-1/igt@i915_selftest@live@slpc.html

  * igt@kms_addfb_basic@basic-y-tiled-legacy:
    - bat-dg1-6:          NOTRUN -> [SKIP][42] ([i915#4215])
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@kms_addfb_basic@basic-y-tiled-legacy.html

  * igt@kms_addfb_basic@tile-pitch-mismatch:
    - bat-dg1-6:          NOTRUN -> [SKIP][43] ([i915#4212]) +7 similar issues
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@kms_addfb_basic@tile-pitch-mismatch.html

  * igt@kms_chamelium_edid@hdmi-edid-read:
    - bat-dg1-6:          NOTRUN -> [SKIP][44] ([i915#7828]) +7 similar issues
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@kms_chamelium_edid@hdmi-edid-read.html

  * igt@kms_chamelium_frames@hdmi-crc-fast:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][45] ([fdo#109271]) +16 similar issues
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/fi-kbl-soraka/igt@kms_chamelium_frames@hdmi-crc-fast.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
    - bat-dg1-6:          NOTRUN -> [SKIP][46] ([i915#4103] / [i915#4213]) +1 similar issue
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html

  * igt@kms_force_connector_basic@force-load-detect:
    - bat-dg1-6:          NOTRUN -> [SKIP][47] ([fdo#109285])
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@kms_force_connector_basic@force-load-detect.html

  * igt@kms_psr@sprite_plane_onoff:
    - bat-dg1-6:          NOTRUN -> [SKIP][48] ([i915#1072] / [i915#4078]) +3 similar issues
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@kms_psr@sprite_plane_onoff.html

  * igt@kms_setmode@basic-clone-single-crtc:
    - bat-dg1-6:          NOTRUN -> [SKIP][49] ([i915#3555])
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@kms_setmode@basic-clone-single-crtc.html

  * igt@prime_vgem@basic-gtt:
    - bat-dg1-6:          NOTRUN -> [SKIP][50] ([i915#3708] / [i915#4077]) +1 similar issue
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@prime_vgem@basic-gtt.html

  * igt@prime_vgem@basic-read:
    - bat-dg1-6:          NOTRUN -> [SKIP][51] ([i915#3708]) +3 similar issues
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@prime_vgem@basic-read.html

  * igt@prime_vgem@basic-userptr:
    - bat-dg1-6:          NOTRUN -> [SKIP][52] ([i915#3708] / [i915#4873])
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-dg1-6/igt@prime_vgem@basic-userptr.html

  
#### Possible fixes ####

  * igt@i915_selftest@live@hangcheck:
    - fi-skl-guc:         [DMESG-WARN][53] ([i915#8073]) -> [PASS][54]
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/fi-skl-guc/igt@i915_selftest@live@hangcheck.html
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/fi-skl-guc/igt@i915_selftest@live@hangcheck.html

  * igt@i915_selftest@live@requests:
    - bat-rpls-1:         [ABORT][55] ([i915#4983] / [i915#7694] / [i915#7911] / [i915#7981]) -> [PASS][56]
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-rpls-1/igt@i915_selftest@live@requests.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-rpls-1/igt@i915_selftest@live@requests.html

  
#### Warnings ####

  * igt@i915_selftest@live@slpc:
    - bat-rpls-2:         [DMESG-FAIL][57] ([i915#6367] / [i915#7913]) -> [DMESG-FAIL][58] ([i915#6997] / [i915#7913])
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12827/bat-rpls-2/igt@i915_selftest@live@slpc.html
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/bat-rpls-2/igt@i915_selftest@live@slpc.html

  
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#1886]: https://gitlab.freedesktop.org/drm/intel/issues/1886
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
  [i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708
  [i915#4077]: https://gitlab.freedesktop.org/drm/intel/issues/4077
  [i915#4078]: https://gitlab.freedesktop.org/drm/intel/issues/4078
  [i915#4079]: https://gitlab.freedesktop.org/drm/intel/issues/4079
  [i915#4083]: https://gitlab.freedesktop.org/drm/intel/issues/4083
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#4212]: https://gitlab.freedesktop.org/drm/intel/issues/4212
  [i915#4213]: https://gitlab.freedesktop.org/drm/intel/issues/4213
  [i915#4215]: https://gitlab.freedesktop.org/drm/intel/issues/4215
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4873]: https://gitlab.freedesktop.org/drm/intel/issues/4873
  [i915#4983]: https://gitlab.freedesktop.org/drm/intel/issues/4983
  [i915#5334]: https://gitlab.freedesktop.org/drm/intel/issues/5334
  [i915#6367]: https://gitlab.freedesktop.org/drm/intel/issues/6367
  [i915#6621]: https://gitlab.freedesktop.org/drm/intel/issues/6621
  [i915#6997]: https://gitlab.freedesktop.org/drm/intel/issues/6997
  [i915#7561]: https://gitlab.freedesktop.org/drm/intel/issues/7561
  [i915#7694]: https://gitlab.freedesktop.org/drm/intel/issues/7694
  [i915#7699]: https://gitlab.freedesktop.org/drm/intel/issues/7699
  [i915#7828]: https://gitlab.freedesktop.org/drm/intel/issues/7828
  [i915#7872]: https://gitlab.freedesktop.org/drm/intel/issues/7872
  [i915#7911]: https://gitlab.freedesktop.org/drm/intel/issues/7911
  [i915#7913]: https://gitlab.freedesktop.org/drm/intel/issues/7913
  [i915#7977]: https://gitlab.freedesktop.org/drm/intel/issues/7977
  [i915#7981]: https://gitlab.freedesktop.org/drm/intel/issues/7981
  [i915#8073]: https://gitlab.freedesktop.org/drm/intel/issues/8073


Build changes
-------------

  * Linux: CI_DRM_12827 -> Patchwork_114451v2

  CI-20190529: 20190529
  CI_DRM_12827: b794b8d84dc0470ee58467386f41870e81a86580 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7183: 3434cef8be4e487644a740039ad15123cd094526 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_114451v2: b794b8d84dc0470ee58467386f41870e81a86580 @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

43496e7d0218 drm/i915/gt: Make sure that errors are propagated through request chains
deee3777ea9c drm/i915: Create the locked version of the request add
cf2760c79b8f drm/i915: Create the locked version of the request create
7367b6ee10a2 drm/i915/gt: Add intel_context_timeline_is_locked helper
c08f790487bc drm/i915: Throttle for ringspace prior to taking the timeline mutex

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v2/index.html

[-- Attachment #2: Type: text/html, Size: 17478 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains Andi Shyti
@ 2023-03-10 10:03   ` Matthew Auld
  2023-04-11  6:39   ` Das, Nirmoy
  1 sibling, 0 replies; 16+ messages in thread
From: Matthew Auld @ 2023-03-10 10:03 UTC (permalink / raw)
  To: Andi Shyti, intel-gfx, dri-devel, stable
  Cc: Andi Shyti, Chris Wilson, Maciej Patelczyk

On 08/03/2023 09:41, Andi Shyti wrote:
> Currently, when we perform operations such as clearing or copying
> large blocks of memory, we generate multiple requests that are
> executed in a chain.
> 
> However, if one of these requests fails, we may not realize it
> unless it happens to be the last request in the chain. This is
> because errors are not properly propagated.
> 
> For this we need to keep propagating the chain of fence
> notification in order to always reach the final fence associated
> to the final request.
> 
> To address this issue, we need to ensure that the chain of fence
> notifications is always propagated so that we can reach the final
> fence associated with the last request. By doing so, we will be
> able to detect any memory operation  failures and determine
> whether the memory is still invalid.
> 
> On copy and clear migration signal fences upon completion.
> 
> On copy and clear migration, signal fences upon request
> completion to ensure that we have a reliable perpetuation of the
> operation outcome.
> 
> Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration")
> Reported-by: Matthew Auld <matthew.auld@intel.com>
> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> Cc: stable@vger.kernel.org
> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_migrate.c | 41 ++++++++++++++++++-------
>   1 file changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
> index 3f638f1987968..0031e7b1b4704 100644
> --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> @@ -742,13 +742,19 @@ intel_context_migrate_copy(struct intel_context *ce,
>   			dst_offset = 2 * CHUNK_SZ;
>   	}
>   
> +	/*
> +	 * While building the chain of requests, we need to ensure
> +	 * that no one can sneak into the timeline unnoticed.
> +	 */
> +	mutex_lock(&ce->timeline->mutex);
> +

Hmm, this looks different/new from the previous version. Why do we only 
do this for the copy and not the clear btw? Both should be conceptually 
the same. Sorry if I'm misunderstanding something here.

>   	do {
>   		int len;
>   
> -		rq = i915_request_create(ce);
> +		rq = i915_request_create_locked(ce);
>   		if (IS_ERR(rq)) {
>   			err = PTR_ERR(rq);
> -			goto out_ce;
> +			break;
>   		}
>   
>   		if (deps) {
> @@ -878,10 +884,14 @@ intel_context_migrate_copy(struct intel_context *ce,
>   
>   		/* Arbitration is re-enabled between requests. */
>   out_rq:
> -		if (*out)
> +		i915_sw_fence_await(&rq->submit);
> +		i915_request_get(rq);
> +		i915_request_add_locked(rq);
> +		if (*out) {
> +			i915_sw_fence_complete(&(*out)->submit);
>   			i915_request_put(*out);
> -		*out = i915_request_get(rq);
> -		i915_request_add(rq);
> +		}
> +		*out = rq;
>   
>   		if (err)
>   			break;
> @@ -905,7 +915,10 @@ intel_context_migrate_copy(struct intel_context *ce,
>   		cond_resched();
>   	} while (1);
>   
> -out_ce:
> +	mutex_unlock(&ce->timeline->mutex);
> +
> +	if (*out)
> +		i915_sw_fence_complete(&(*out)->submit);
>   	return err;
>   }
>   
> @@ -1005,7 +1018,7 @@ intel_context_migrate_clear(struct intel_context *ce,
>   		rq = i915_request_create(ce);
>   		if (IS_ERR(rq)) {
>   			err = PTR_ERR(rq);
> -			goto out_ce;
> +			break;
>   		}
>   
>   		if (deps) {
> @@ -1056,17 +1069,23 @@ intel_context_migrate_clear(struct intel_context *ce,
>   
>   		/* Arbitration is re-enabled between requests. */
>   out_rq:
> -		if (*out)
> -			i915_request_put(*out);
> -		*out = i915_request_get(rq);
> +		i915_sw_fence_await(&rq->submit);
> +		i915_request_get(rq);
>   		i915_request_add(rq);
> +		if (*out) {
> +			i915_sw_fence_complete(&(*out)->submit);
> +			i915_request_put(*out);
> +		}
> +		*out = rq;
> +
>   		if (err || !it.sg || !sg_dma_len(it.sg))
>   			break;
>   
>   		cond_resched();
>   	} while (1);
>   
> -out_ce:
> +	if (*out)
> +		i915_sw_fence_complete(&(*out)->submit);
>   	return err;
>   }
>   

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH v4 2/5] drm/i915/gt: Add intel_context_timeline_is_locked helper
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 2/5] drm/i915/gt: Add intel_context_timeline_is_locked helper Andi Shyti
@ 2023-04-11  6:30   ` Das, Nirmoy
  0 siblings, 0 replies; 16+ messages in thread
From: Das, Nirmoy @ 2023-04-11  6:30 UTC (permalink / raw)
  To: Andi Shyti, intel-gfx, dri-devel, stable
  Cc: Maciej Patelczyk, Matthew Auld, Chris Wilson, Andi Shyti


On 3/8/2023 10:41 AM, Andi Shyti wrote:
> We have:
>
>   - intel_context_timeline_lock()
>   - intel_context_timeline_unlock()
>
> In the next patches we will also need:
>
>   - intel_context_timeline_is_locked()
>
> Add it.
>
> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> Cc: stable@vger.kernel.org

Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>


> ---
>   drivers/gpu/drm/i915/gt/intel_context.h | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index f919a66cebf5b..87d5e2d60b6db 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -265,6 +265,12 @@ static inline void intel_context_timeline_unlock(struct intel_timeline *tl)
>   	mutex_unlock(&tl->mutex);
>   }
>   
> +static inline void intel_context_assert_timeline_is_locked(struct intel_timeline *tl)
> +	__must_hold(&tl->mutex)
> +{
> +	lockdep_assert_held(&tl->mutex);
> +}
> +
>   int intel_context_prepare_remote_request(struct intel_context *ce,
>   					 struct i915_request *rq);
>   

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH v4 3/5] drm/i915: Create the locked version of the request create
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 3/5] drm/i915: Create the locked version of the request create Andi Shyti
@ 2023-04-11  6:30   ` Das, Nirmoy
  0 siblings, 0 replies; 16+ messages in thread
From: Das, Nirmoy @ 2023-04-11  6:30 UTC (permalink / raw)
  To: Andi Shyti, intel-gfx, dri-devel, stable
  Cc: Andi Shyti, Chris Wilson, Matthew Auld, Maciej Patelczyk


On 3/8/2023 10:41 AM, Andi Shyti wrote:
> Make version of the request creation that doesn't hold any
> lock.
>
> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> Cc: stable@vger.kernel.org

Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>

> ---
>   drivers/gpu/drm/i915/i915_request.c | 43 +++++++++++++++++++----------
>   drivers/gpu/drm/i915/i915_request.h |  2 ++
>   2 files changed, 31 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 72aed544f8714..5ddb0e02b06b7 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1028,18 +1028,11 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp)
>   	return ERR_PTR(ret);
>   }
>   
> -struct i915_request *
> -i915_request_create(struct intel_context *ce)
> +static struct i915_request *
> +__i915_request_create_locked(struct intel_context *ce)
>   {
>   	struct i915_request *rq;
> -	struct intel_timeline *tl;
> -
> -	if (intel_context_throttle(ce))
> -		return ERR_PTR(-EINTR);
> -
> -	tl = intel_context_timeline_lock(ce);
> -	if (IS_ERR(tl))
> -		return ERR_CAST(tl);
> +	struct intel_timeline *tl = ce->timeline;
>   
>   	/* Move our oldest request to the slab-cache (if not in use!) */
>   	rq = list_first_entry(&tl->requests, typeof(*rq), link);
> @@ -1049,16 +1042,38 @@ i915_request_create(struct intel_context *ce)
>   	intel_context_enter(ce);
>   	rq = __i915_request_create(ce, GFP_KERNEL);
>   	intel_context_exit(ce); /* active reference transferred to request */
> -	if (IS_ERR(rq))
> -		goto err_unlock;
>   
>   	/* Check that we do not interrupt ourselves with a new request */
>   	rq->cookie = lockdep_pin_lock(&tl->mutex);
>   
>   	return rq;
> +}
> +
> +struct i915_request *
> +i915_request_create_locked(struct intel_context *ce)
> +{
> +	intel_context_assert_timeline_is_locked(ce->timeline);
> +
> +	if (intel_context_throttle(ce))
> +		return ERR_PTR(-EINTR);
> +
> +	return __i915_request_create_locked(ce);
> +}
> +
> +struct i915_request *
> +i915_request_create(struct intel_context *ce)
> +{
> +	struct i915_request *rq;
> +	struct intel_timeline *tl;
> +
> +	tl = intel_context_timeline_lock(ce);
> +	if (IS_ERR(tl))
> +		return ERR_CAST(tl);
> +
> +	rq = __i915_request_create_locked(ce);
> +	if (IS_ERR(rq))
> +		intel_context_timeline_unlock(tl);
>   
> -err_unlock:
> -	intel_context_timeline_unlock(tl);
>   	return rq;
>   }
>   
> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h
> index f5e1bb5e857aa..bb48bd4605c03 100644
> --- a/drivers/gpu/drm/i915/i915_request.h
> +++ b/drivers/gpu/drm/i915/i915_request.h
> @@ -374,6 +374,8 @@ struct i915_request * __must_check
>   __i915_request_create(struct intel_context *ce, gfp_t gfp);
>   struct i915_request * __must_check
>   i915_request_create(struct intel_context *ce);
> +struct i915_request * __must_check
> +i915_request_create_locked(struct intel_context *ce);
>   
>   void __i915_request_skip(struct i915_request *rq);
>   bool i915_request_set_error_once(struct i915_request *rq, int error);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains Andi Shyti
  2023-03-10 10:03   ` Matthew Auld
@ 2023-04-11  6:39   ` Das, Nirmoy
  2023-04-11 14:35     ` Rodrigo Vivi
  1 sibling, 1 reply; 16+ messages in thread
From: Das, Nirmoy @ 2023-04-11  6:39 UTC (permalink / raw)
  To: Andi Shyti, intel-gfx, dri-devel, stable
  Cc: Andi Shyti, Chris Wilson, Matthew Auld, Maciej Patelczyk


On 3/8/2023 10:41 AM, Andi Shyti wrote:
> Currently, when we perform operations such as clearing or copying
> large blocks of memory, we generate multiple requests that are
> executed in a chain.
>
> However, if one of these requests fails, we may not realize it
> unless it happens to be the last request in the chain. This is
> because errors are not properly propagated.
>
> For this we need to keep propagating the chain of fence
> notification in order to always reach the final fence associated
> to the final request.
>
> To address this issue, we need to ensure that the chain of fence
> notifications is always propagated so that we can reach the final
> fence associated with the last request. By doing so, we will be
> able to detect any memory operation  failures and determine
> whether the memory is still invalid.
>
> On copy and clear migration signal fences upon completion.
>
> On copy and clear migration, signal fences upon request
> completion to ensure that we have a reliable perpetuation of the
> operation outcome.
>
> Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration")
> Reported-by: Matthew Auld <matthew.auld@intel.com>
> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> Cc: stable@vger.kernel.org
> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
With  Matt's comment regarding missing lock in 
intel_context_migrate_clear addressed, this is:

Acked-by: Nirmoy Das <nirmoy.das@intel.com>

> ---
>   drivers/gpu/drm/i915/gt/intel_migrate.c | 41 ++++++++++++++++++-------
>   1 file changed, 30 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
> index 3f638f1987968..0031e7b1b4704 100644
> --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> @@ -742,13 +742,19 @@ intel_context_migrate_copy(struct intel_context *ce,
>   			dst_offset = 2 * CHUNK_SZ;
>   	}
>   
> +	/*
> +	 * While building the chain of requests, we need to ensure
> +	 * that no one can sneak into the timeline unnoticed.
> +	 */
> +	mutex_lock(&ce->timeline->mutex);
> +
>   	do {
>   		int len;
>   
> -		rq = i915_request_create(ce);
> +		rq = i915_request_create_locked(ce);
>   		if (IS_ERR(rq)) {
>   			err = PTR_ERR(rq);
> -			goto out_ce;
> +			break;
>   		}
>   
>   		if (deps) {
> @@ -878,10 +884,14 @@ intel_context_migrate_copy(struct intel_context *ce,
>   
>   		/* Arbitration is re-enabled between requests. */
>   out_rq:
> -		if (*out)
> +		i915_sw_fence_await(&rq->submit);
> +		i915_request_get(rq);
> +		i915_request_add_locked(rq);
> +		if (*out) {
> +			i915_sw_fence_complete(&(*out)->submit);
>   			i915_request_put(*out);
> -		*out = i915_request_get(rq);
> -		i915_request_add(rq);
> +		}
> +		*out = rq;
>   
>   		if (err)
>   			break;
> @@ -905,7 +915,10 @@ intel_context_migrate_copy(struct intel_context *ce,
>   		cond_resched();
>   	} while (1);
>   
> -out_ce:
> +	mutex_unlock(&ce->timeline->mutex);
> +
> +	if (*out)
> +		i915_sw_fence_complete(&(*out)->submit);
>   	return err;
>   }
>   
> @@ -1005,7 +1018,7 @@ intel_context_migrate_clear(struct intel_context *ce,
>   		rq = i915_request_create(ce);
>   		if (IS_ERR(rq)) {
>   			err = PTR_ERR(rq);
> -			goto out_ce;
> +			break;
>   		}
>   
>   		if (deps) {
> @@ -1056,17 +1069,23 @@ intel_context_migrate_clear(struct intel_context *ce,
>   
>   		/* Arbitration is re-enabled between requests. */
>   out_rq:
> -		if (*out)
> -			i915_request_put(*out);
> -		*out = i915_request_get(rq);
> +		i915_sw_fence_await(&rq->submit);
> +		i915_request_get(rq);
>   		i915_request_add(rq);
> +		if (*out) {
> +			i915_sw_fence_complete(&(*out)->submit);
> +			i915_request_put(*out);
> +		}
> +		*out = rq;
> +
>   		if (err || !it.sg || !sg_dma_len(it.sg))
>   			break;
>   
>   		cond_resched();
>   	} while (1);
>   
> -out_ce:
> +	if (*out)
> +		i915_sw_fence_complete(&(*out)->submit);
>   	return err;
>   }
>   

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH v4 1/5] drm/i915: Throttle for ringspace prior to taking the timeline mutex
  2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 1/5] drm/i915: Throttle for ringspace prior to taking the timeline mutex Andi Shyti
@ 2023-04-11  8:58   ` Andrzej Hajda
  0 siblings, 0 replies; 16+ messages in thread
From: Andrzej Hajda @ 2023-04-11  8:58 UTC (permalink / raw)
  To: Andi Shyti, intel-gfx, dri-devel, stable
  Cc: Maciej Patelczyk, Chris Wilson, Andi Shyti, Matthew Auld

On 08.03.2023 10:41, Andi Shyti wrote:
> From: Chris Wilson <chris@chris-wilson.co.uk>
> 
> Before taking exclusive ownership of the ring for emitting the request,
> wait for space in the ring to become available. This allows others to
> take the timeline->mutex to make forward progresses while userspace is
> blocked.
> 
> In particular, this allows regular clients to issue requests on the
> kernel context, potentially filling the ring, but allow the higher
> priority heartbeats and pulses to still be submitted without being
> blocked by the less critical work.
> 
> Signed-off-by: Chris Wilson <chris.p.wilson@linux.intel.com>
> Cc: Maciej Patelczyk <maciej.patelczyk@intel.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>

Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>

Regards
Andrzej
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c | 41 +++++++++++++++++++++++++
>   drivers/gpu/drm/i915/gt/intel_context.h |  2 ++
>   drivers/gpu/drm/i915/i915_request.c     |  3 ++
>   3 files changed, 46 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index 2aa63ec521b89..59cd612a23561 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -626,6 +626,47 @@ bool intel_context_revoke(struct intel_context *ce)
>   	return ret;
>   }
>   
> +int intel_context_throttle(const struct intel_context *ce)
> +{
> +	const struct intel_ring *ring = ce->ring;
> +	const struct intel_timeline *tl = ce->timeline;
> +	struct i915_request *rq;
> +	int err = 0;
> +
> +	if (READ_ONCE(ring->space) >= SZ_1K)
> +		return 0;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_reverse(rq, &tl->requests, link) {
> +		if (__i915_request_is_complete(rq))
> +			break;
> +
> +		if (rq->ring != ring)
> +			continue;
> +
> +		/* Wait until there will be enough space following that rq */
> +		if (__intel_ring_space(rq->postfix,
> +				       ring->emit,
> +				       ring->size) < ring->size / 2) {
> +			if (i915_request_get_rcu(rq)) {
> +				rcu_read_unlock();
> +
> +				if (i915_request_wait(rq,
> +						      I915_WAIT_INTERRUPTIBLE,
> +						      MAX_SCHEDULE_TIMEOUT) < 0)
> +					err = -EINTR;
> +
> +				rcu_read_lock();
> +				i915_request_put(rq);
> +			}
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	return err;
> +}
> +
>   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>   #include "selftest_context.c"
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index 0a8d553da3f43..f919a66cebf5b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -226,6 +226,8 @@ static inline void intel_context_exit(struct intel_context *ce)
>   		ce->ops->exit(ce);
>   }
>   
> +int intel_context_throttle(const struct intel_context *ce);
> +
>   static inline struct intel_context *intel_context_get(struct intel_context *ce)
>   {
>   	kref_get(&ce->ref);
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 630a732aaecca..72aed544f8714 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1034,6 +1034,9 @@ i915_request_create(struct intel_context *ce)
>   	struct i915_request *rq;
>   	struct intel_timeline *tl;
>   
> +	if (intel_context_throttle(ce))
> +		return ERR_PTR(-EINTR);
> +
>   	tl = intel_context_timeline_lock(ce);
>   	if (IS_ERR(tl))
>   		return ERR_CAST(tl);


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains
  2023-04-11  6:39   ` Das, Nirmoy
@ 2023-04-11 14:35     ` Rodrigo Vivi
  2023-04-12 10:56       ` Andi Shyti
  0 siblings, 1 reply; 16+ messages in thread
From: Rodrigo Vivi @ 2023-04-11 14:35 UTC (permalink / raw)
  To: Das, Nirmoy, Joonas Lahtinen, Tvrtko Ursulin
  Cc: Andi Shyti, intel-gfx, stable, Maciej Patelczyk, dri-devel,
	Chris Wilson, Matthew Auld

On Tue, Apr 11, 2023 at 08:39:00AM +0200, Das, Nirmoy wrote:
> 
> On 3/8/2023 10:41 AM, Andi Shyti wrote:
> > Currently, when we perform operations such as clearing or copying
> > large blocks of memory, we generate multiple requests that are
> > executed in a chain.
> > 
> > However, if one of these requests fails, we may not realize it
> > unless it happens to be the last request in the chain. This is
> > because errors are not properly propagated.
> > 
> > For this we need to keep propagating the chain of fence
> > notification in order to always reach the final fence associated
> > to the final request.
> > 
> > To address this issue, we need to ensure that the chain of fence
> > notifications is always propagated so that we can reach the final
> > fence associated with the last request. By doing so, we will be
> > able to detect any memory operation  failures and determine
> > whether the memory is still invalid.
> > 
> > On copy and clear migration signal fences upon completion.
> > 
> > On copy and clear migration, signal fences upon request
> > completion to ensure that we have a reliable perpetuation of the
> > operation outcome.
> > 
> > Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration")
> > Reported-by: Matthew Auld <matthew.auld@intel.com>
> > Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > Cc: stable@vger.kernel.org
> > Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> With  Matt's comment regarding missing lock in intel_context_migrate_clear
> addressed, this is:
> 
> Acked-by: Nirmoy Das <nirmoy.das@intel.com>

Nack!

Please get some ack from Joonas or Tvrtko before merging this series.

It is a big series targeting stable o.O where the revisions in the cover
letter are not helping me to be confident that this is the right approach
instead of simply reverting the original offending commit:

cf586021642d ("drm/i915/gt: Pipelined page migration")

It looks to me that we are adding magic on top of magic to workaround
the deadlocks, but then adding more waits inside locks... And this with
the hang checks vs heartbeats, is this really an issue on current upstream
code? or was only on DII?

Where was the bug report to start with?

> 
> > ---
> >   drivers/gpu/drm/i915/gt/intel_migrate.c | 41 ++++++++++++++++++-------
> >   1 file changed, 30 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > index 3f638f1987968..0031e7b1b4704 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > @@ -742,13 +742,19 @@ intel_context_migrate_copy(struct intel_context *ce,
> >   			dst_offset = 2 * CHUNK_SZ;
> >   	}
> > +	/*
> > +	 * While building the chain of requests, we need to ensure
> > +	 * that no one can sneak into the timeline unnoticed.
> > +	 */
> > +	mutex_lock(&ce->timeline->mutex);
> > +
> >   	do {
> >   		int len;
> > -		rq = i915_request_create(ce);
> > +		rq = i915_request_create_locked(ce);
> >   		if (IS_ERR(rq)) {
> >   			err = PTR_ERR(rq);
> > -			goto out_ce;
> > +			break;
> >   		}
> >   		if (deps) {
> > @@ -878,10 +884,14 @@ intel_context_migrate_copy(struct intel_context *ce,
> >   		/* Arbitration is re-enabled between requests. */
> >   out_rq:
> > -		if (*out)
> > +		i915_sw_fence_await(&rq->submit);
> > +		i915_request_get(rq);
> > +		i915_request_add_locked(rq);
> > +		if (*out) {
> > +			i915_sw_fence_complete(&(*out)->submit);
> >   			i915_request_put(*out);
> > -		*out = i915_request_get(rq);
> > -		i915_request_add(rq);
> > +		}
> > +		*out = rq;
> >   		if (err)
> >   			break;
> > @@ -905,7 +915,10 @@ intel_context_migrate_copy(struct intel_context *ce,
> >   		cond_resched();
> >   	} while (1);
> > -out_ce:
> > +	mutex_unlock(&ce->timeline->mutex);
> > +
> > +	if (*out)
> > +		i915_sw_fence_complete(&(*out)->submit);
> >   	return err;
> >   }
> > @@ -1005,7 +1018,7 @@ intel_context_migrate_clear(struct intel_context *ce,
> >   		rq = i915_request_create(ce);
> >   		if (IS_ERR(rq)) {
> >   			err = PTR_ERR(rq);
> > -			goto out_ce;
> > +			break;
> >   		}
> >   		if (deps) {
> > @@ -1056,17 +1069,23 @@ intel_context_migrate_clear(struct intel_context *ce,
> >   		/* Arbitration is re-enabled between requests. */
> >   out_rq:
> > -		if (*out)
> > -			i915_request_put(*out);
> > -		*out = i915_request_get(rq);
> > +		i915_sw_fence_await(&rq->submit);
> > +		i915_request_get(rq);
> >   		i915_request_add(rq);
> > +		if (*out) {
> > +			i915_sw_fence_complete(&(*out)->submit);
> > +			i915_request_put(*out);
> > +		}
> > +		*out = rq;
> > +
> >   		if (err || !it.sg || !sg_dma_len(it.sg))
> >   			break;
> >   		cond_resched();
> >   	} while (1);
> > -out_ce:
> > +	if (*out)
> > +		i915_sw_fence_complete(&(*out)->submit);
> >   	return err;
> >   }

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains
  2023-04-11 14:35     ` Rodrigo Vivi
@ 2023-04-12 10:56       ` Andi Shyti
  2023-04-12 13:10         ` Rodrigo Vivi
  0 siblings, 1 reply; 16+ messages in thread
From: Andi Shyti @ 2023-04-12 10:56 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Andi Shyti, intel-gfx, Matthew Auld, dri-devel, Maciej Patelczyk,
	stable, Chris Wilson, Das, Nirmoy

Hi Rodrigo,

> > > Currently, when we perform operations such as clearing or copying
> > > large blocks of memory, we generate multiple requests that are
> > > executed in a chain.
> > > 
> > > However, if one of these requests fails, we may not realize it
> > > unless it happens to be the last request in the chain. This is
> > > because errors are not properly propagated.
> > > 
> > > For this we need to keep propagating the chain of fence
> > > notification in order to always reach the final fence associated
> > > to the final request.
> > > 
> > > To address this issue, we need to ensure that the chain of fence
> > > notifications is always propagated so that we can reach the final
> > > fence associated with the last request. By doing so, we will be
> > > able to detect any memory operation  failures and determine
> > > whether the memory is still invalid.
> > > 
> > > On copy and clear migration signal fences upon completion.
> > > 
> > > On copy and clear migration, signal fences upon request
> > > completion to ensure that we have a reliable perpetuation of the
> > > operation outcome.
> > > 
> > > Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration")
> > > Reported-by: Matthew Auld <matthew.auld@intel.com>
> > > Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > > Cc: stable@vger.kernel.org
> > > Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> > With  Matt's comment regarding missing lock in intel_context_migrate_clear
> > addressed, this is:
> > 
> > Acked-by: Nirmoy Das <nirmoy.das@intel.com>
> 
> Nack!
> 
> Please get some ack from Joonas or Tvrtko before merging this series.

There is no architectural change... of course, Joonas and Tvrtko
are more than welcome (and actually invited) to look into this
patch.

And, btw, there are still some discussions ongoing on this whole
series, so that I'm not going to merge it any time soon. I'm just
happy to revive the discussion.

> It is a big series targeting stable o.O where the revisions in the cover
> letter are not helping me to be confident that this is the right approach
> instead of simply reverting the original offending commit:
> 
> cf586021642d ("drm/i915/gt: Pipelined page migration")

Why should we remove all the migration completely? What about the
copy?

> It looks to me that we are adding magic on top of magic to workaround
> the deadlocks, but then adding more waits inside locks... And this with
> the hang checks vs heartbeats, is this really an issue on current upstream
> code? or was only on DII?

There is no real magic happening here. It's just that the error
message was not reaching the end of the operation while this
patch is passing it over.

> Where was the bug report to start with?

Matt has reported this, I will give to you the necessary links to
it offline.

Thanks for looking into this,
Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains
  2023-04-12 10:56       ` Andi Shyti
@ 2023-04-12 13:10         ` Rodrigo Vivi
  2023-04-13 11:25           ` Tvrtko Ursulin
  0 siblings, 1 reply; 16+ messages in thread
From: Rodrigo Vivi @ 2023-04-12 13:10 UTC (permalink / raw)
  To: Andi Shyti
  Cc: Maciej Patelczyk, intel-gfx, dri-devel, Matthew Auld, stable,
	Das, Nirmoy, Chris Wilson, Andi Shyti

On Wed, Apr 12, 2023 at 12:56:26PM +0200, Andi Shyti wrote:
> Hi Rodrigo,
> 
> > > > Currently, when we perform operations such as clearing or copying
> > > > large blocks of memory, we generate multiple requests that are
> > > > executed in a chain.
> > > > 
> > > > However, if one of these requests fails, we may not realize it
> > > > unless it happens to be the last request in the chain. This is
> > > > because errors are not properly propagated.
> > > > 
> > > > For this we need to keep propagating the chain of fence
> > > > notification in order to always reach the final fence associated
> > > > to the final request.
> > > > 
> > > > To address this issue, we need to ensure that the chain of fence
> > > > notifications is always propagated so that we can reach the final
> > > > fence associated with the last request. By doing so, we will be
> > > > able to detect any memory operation  failures and determine
> > > > whether the memory is still invalid.
> > > > 
> > > > On copy and clear migration signal fences upon completion.
> > > > 
> > > > On copy and clear migration, signal fences upon request
> > > > completion to ensure that we have a reliable perpetuation of the
> > > > operation outcome.
> > > > 
> > > > Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration")
> > > > Reported-by: Matthew Auld <matthew.auld@intel.com>
> > > > Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > > > Cc: stable@vger.kernel.org
> > > > Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> > > With  Matt's comment regarding missing lock in intel_context_migrate_clear
> > > addressed, this is:
> > > 
> > > Acked-by: Nirmoy Das <nirmoy.das@intel.com>
> > 
> > Nack!
> > 
> > Please get some ack from Joonas or Tvrtko before merging this series.
> 
> There is no architectural change... of course, Joonas and Tvrtko
> are more than welcome (and actually invited) to look into this
> patch.
> 
> And, btw, there are still some discussions ongoing on this whole
> series, so that I'm not going to merge it any time soon. I'm just
> happy to revive the discussion.
> 
> > It is a big series targeting stable o.O where the revisions in the cover
> > letter are not helping me to be confident that this is the right approach
> > instead of simply reverting the original offending commit:
> > 
> > cf586021642d ("drm/i915/gt: Pipelined page migration")
> 
> Why should we remove all the migration completely? What about the
> copy?

Is there any other alternative that doesn't hurt the Linux stable rules?

I honestly fail to see this one here is "obviously corrected and tested"
and it looks to me that it has more "than 100 lines, with context".

Does this series really "fix only one thing" with 5 patches?

> 
> > It looks to me that we are adding magic on top of magic to workaround
> > the deadlocks, but then adding more waits inside locks... And this with
> > the hang checks vs heartbeats, is this really an issue on current upstream
> > code? or was only on DII?
> 
> There is no real magic happening here. It's just that the error
> message was not reaching the end of the operation while this
> patch is passing it over.
> 
> > Where was the bug report to start with?
> 
> Matt has reported this, I will give to you the necessary links to
> it offline.

It would be really good to have a report to see if this is
"real bug that bothers people (not a, “This could be a problem…” type thing)."

All quotes above are from:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html

> 
> Thanks for looking into this,
> Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains
  2023-04-12 13:10         ` Rodrigo Vivi
@ 2023-04-13 11:25           ` Tvrtko Ursulin
  0 siblings, 0 replies; 16+ messages in thread
From: Tvrtko Ursulin @ 2023-04-13 11:25 UTC (permalink / raw)
  To: Rodrigo Vivi, Andi Shyti
  Cc: Maciej Patelczyk, intel-gfx, Matthew Auld, stable, dri-devel,
	Das, Nirmoy, Chris Wilson, Andi Shyti


On 12/04/2023 14:10, Rodrigo Vivi wrote:
> On Wed, Apr 12, 2023 at 12:56:26PM +0200, Andi Shyti wrote:
>> Hi Rodrigo,
>>
>>>>> Currently, when we perform operations such as clearing or copying
>>>>> large blocks of memory, we generate multiple requests that are
>>>>> executed in a chain.
>>>>>
>>>>> However, if one of these requests fails, we may not realize it
>>>>> unless it happens to be the last request in the chain. This is
>>>>> because errors are not properly propagated.
>>>>>
>>>>> For this we need to keep propagating the chain of fence
>>>>> notification in order to always reach the final fence associated
>>>>> to the final request.
>>>>>
>>>>> To address this issue, we need to ensure that the chain of fence
>>>>> notifications is always propagated so that we can reach the final
>>>>> fence associated with the last request. By doing so, we will be
>>>>> able to detect any memory operation  failures and determine
>>>>> whether the memory is still invalid.
>>>>>
>>>>> On copy and clear migration signal fences upon completion.
>>>>>
>>>>> On copy and clear migration, signal fences upon request
>>>>> completion to ensure that we have a reliable perpetuation of the
>>>>> operation outcome.
>>>>>
>>>>> Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration")
>>>>> Reported-by: Matthew Auld <matthew.auld@intel.com>
>>>>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>> Cc: stable@vger.kernel.org

Try to find from which kernel version this needs to go in. For instance 
if we look at cf586021642d80 it would be v5.15+, but actually in that 
commit there are no users apart from selftests. So I think find the 
first user which can be user facing and mark the appropriate kernel 
version in the stable tag.

>>>>> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
>>>> With  Matt's comment regarding missing lock in intel_context_migrate_clear
>>>> addressed, this is:
>>>>
>>>> Acked-by: Nirmoy Das <nirmoy.das@intel.com>
>>>
>>> Nack!
>>>
>>> Please get some ack from Joonas or Tvrtko before merging this series.
>>
>> There is no architectural change... of course, Joonas and Tvrtko
>> are more than welcome (and actually invited) to look into this
>> patch.
>>
>> And, btw, there are still some discussions ongoing on this whole
>> series, so that I'm not going to merge it any time soon. I'm just
>> happy to revive the discussion.
>>
>>> It is a big series targeting stable o.O where the revisions in the cover
>>> letter are not helping me to be confident that this is the right approach
>>> instead of simply reverting the original offending commit:
>>>
>>> cf586021642d ("drm/i915/gt: Pipelined page migration")
>>
>> Why should we remove all the migration completely? What about the
>> copy?
> 
> Is there any other alternative that doesn't hurt the Linux stable rules?
> 
> I honestly fail to see this one here is "obviously corrected and tested"
> and it looks to me that it has more "than 100 lines, with context".
> 
> Does this series really "fix only one thing" with 5 patches?

This is challenging.

Fix to me looks needed on the high level (haven't read the patch details 
yet), but when a series sent to stable can go quite badly and we had 
such problem very recently with only a two patch series. And it is also 
indeed quite large.

Reverting cf586021642d80 definitely isn't an option because stuff 
depends on the code added by it and would need an alternative 
implementation. Losing async clear/migrate which would be bad and could 
also a large patch to implement the alternative.

So since I think we are indeed stuck with fixing this - would it be 
better to squash it all into one patch for easier backporting?

We can also look if there are ways to make the diff smaller.

Regards,

Tvrtko

>>> It looks to me that we are adding magic on top of magic to workaround
>>> the deadlocks, but then adding more waits inside locks... And this with
>>> the hang checks vs heartbeats, is this really an issue on current upstream
>>> code? or was only on DII?
>>
>> There is no real magic happening here. It's just that the error
>> message was not reaching the end of the operation while this
>> patch is passing it over.
>>
>>> Where was the bug report to start with?
>>
>> Matt has reported this, I will give to you the necessary links to
>> it offline.
> 
> It would be really good to have a report to see if this is
> "real bug that bothers people (not a, “This could be a problem…” type thing)."
> 
> All quotes above are from:
> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> 
>>
>> Thanks for looking into this,
>> Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-04-13 11:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-08  9:41 [Intel-gfx] [PATCH v4 0/5] Fix error propagation amongst request Andi Shyti
2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 1/5] drm/i915: Throttle for ringspace prior to taking the timeline mutex Andi Shyti
2023-04-11  8:58   ` Andrzej Hajda
2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 2/5] drm/i915/gt: Add intel_context_timeline_is_locked helper Andi Shyti
2023-04-11  6:30   ` Das, Nirmoy
2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 3/5] drm/i915: Create the locked version of the request create Andi Shyti
2023-04-11  6:30   ` Das, Nirmoy
2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 4/5] drm/i915: Create the locked version of the request add Andi Shyti
2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains Andi Shyti
2023-03-10 10:03   ` Matthew Auld
2023-04-11  6:39   ` Das, Nirmoy
2023-04-11 14:35     ` Rodrigo Vivi
2023-04-12 10:56       ` Andi Shyti
2023-04-12 13:10         ` Rodrigo Vivi
2023-04-13 11:25           ` Tvrtko Ursulin
2023-03-08 11:22 ` [Intel-gfx] ✗ Fi.CI.BAT: failure for Fix error propagation amongst request (rev2) Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).