All of lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-gfx] [PATCH 1/4] drm/i915/gem: Hold request reference for canceling an active context
@ 2020-09-16  9:42 Chris Wilson
  2020-09-16  9:42   ` [Intel-gfx] " Chris Wilson
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-16  9:42 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

We have to be very careful while walking the timeline->requests list
under the RCU guard, as the requests (and so rq->link) use
SLAB_TYPESAFE_BY_RCU and so the requests may be reallocated within an
rcu grace period. As the requests are reallocated, they are removed from
one list and placed on another, and if we are iterating over that
request at that moment, the list iteration jumps from one list to the
next and promptly gets confused. Verify we hold the request reference
to ensure that the request is not added to a new list behind our backs.

<4> [582.745252] general protection fault, probably for non-canonical address 0xcccccccccccccd5c: 0000 [#1] PREEMPT SMP PTI
<4> [582.745297] CPU: 0 PID: 1475 Comm: gem_ctx_persist Not tainted 5.9.0-rc1-CI-CI_DRM_8908+ #1
<4> [582.745304] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4> [582.745317] RIP: 0010:__lock_acquire+0x2c3/0x1f40
<4> [582.745323] Code: 00 65 8b 05 c7 8a ef 7e 85 c0 0f 85 b4 07 00 00 44 8b 9d c4 08 00 00 45 85 db 0f 84 0f 01 00 00 ba 05 00 00 00 e9 c8 06 00 00 <48> 81 3f c0 89 c7 82 b8 00 00 00 00 41 0f 45 c0 83 fe 01 41 89 c3
<4> [582.745334] RSP: 0018:ffffc9000461bc40 EFLAGS: 00010002
<4> [582.745340] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
<4> [582.745345] RDX: 0000000000000000 RSI: 0000000000000000 RDI: cccccccccccccd5c
<4> [582.745350] RBP: ffff8881ec4a2880 R08: 0000000000000001 R09: 0000000000000001
<4> [582.745356] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
<4> [582.745361] R13: 0000000000000000 R14: 0000000000000000 R15: cccccccccccccd5c
<4> [582.745367] FS:  00007fb44da78e40(0000) GS:ffff888278000000(0000) knlGS:0000000000000000
<4> [582.745373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [582.745378] CR2: 00007fb44daad040 CR3: 0000000268428000 CR4: 0000000000350ef0
<4> [582.745383] Call Trace:
<4> [582.745390]  ? __lock_acquire+0x913/0x1f40
<4> [582.745397]  lock_acquire+0xb5/0x3c0
<4> [582.745526]  ? kill_engines+0x19a/0x4b0 [i915]
<4> [582.745533]  ? find_held_lock+0x2d/0x90
<4> [582.745541]  _raw_spin_lock_irq+0x30/0x40
<4> [582.745635]  ? kill_engines+0x19a/0x4b0 [i915]
<4> [582.745727]  kill_engines+0x19a/0x4b0 [i915]
<4> [582.745820]  context_close+0x195/0x410 [i915]
<4> [582.745912]  i915_gem_context_close+0x5b/0x160 [i915]
<4> [582.745994]  i915_driver_postclose+0x14/0x40 [i915]
<4> [582.746003]  drm_file_free.part.13+0x240/0x290
<4> [582.746009]  drm_release_noglobal+0x16/0x50
<4> [582.746016]  __fput+0xa5/0x250
<4> [582.746021]  task_work_run+0x6e/0xb0
<4> [582.746028]  exit_to_user_mode_prepare+0x178/0x180
<4> [582.746034]  syscall_exit_to_user_mode+0x36/0x220
<4> [582.746040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
<4> [582.746045] RIP: 0033:0x7fb44d1dc421
<4> [582.746050] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 ea cf 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
<4> [582.746062] RSP: 002b:00007ffed2e83818 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
<4> [582.746069] RAX: 0000000000000000 RBX: 0000556410bfe840 RCX: 00007fb44d1dc421
<4> [582.746075] RDX: 000000000000000a RSI: 00000000c0406469 RDI: 0000000000000008
<4> [582.746080] RBP: 0000000000000008 R08: 00007fb44d1c51cc R09: 00007fb44d1c5240
<4> [582.746086] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000fffffffb
<4> [582.746091] R13: 0000000000000006 R14: 0000000000000000 R15: 000000000000000a
<4> [582.746099] Modules linked in: vgem mei_hdcp snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio btusb btrtl btbcm btintel x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul bluetooth ghash_clmulni_intel ecdh_generic ecc i915 r8169 realtek mei_me mei snd_hda_intel i2c_hid snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm pinctrl_geminilake pinctrl_intel prime_numbers [last unloaded: test_drm_mm]

Fixes: 09a3054f38db ("drm/i915/gem: Reduce context termination list iteration guard to RCU")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 25 ++++++++++++++++-----
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index cf5ecbde9e06..a548626fa8bc 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -460,8 +460,8 @@ __active_engine(struct i915_request *rq, struct intel_engine_cs **active)
 		spin_lock(&locked->active.lock);
 	}
 
-	if (!i915_request_completed(rq)) {
-		if (i915_request_is_active(rq) && rq->fence.error != -EIO)
+	if (i915_request_is_active(rq)) {
+		if (!i915_request_completed(rq))
 			*active = locked;
 		ret = true;
 	}
@@ -479,13 +479,26 @@ static struct intel_engine_cs *active_engine(struct intel_context *ce)
 	if (!ce->timeline)
 		return NULL;
 
+	/*
+	 * rq->link is only SLAB_TYPESAFE_BY_RCU, we need to hold a reference
+	 * to the request to prevent it being transferred to a new timeline
+	 * (and onto a new timeline->requests list).
+	 */
 	rcu_read_lock();
-	list_for_each_entry_rcu(rq, &ce->timeline->requests, link) {
-		if (i915_request_is_active(rq) && i915_request_completed(rq))
-			continue;
+	list_for_each_entry_reverse(rq, &ce->timeline->requests, link) {
+		bool found;
+
+		/* timeline is already completed upto this point? */
+		if (!i915_request_get_rcu(rq))
+			break;
 
 		/* Check with the backend if the request is inflight */
-		if (__active_engine(rq, &engine))
+		found = true;
+		if (likely(rcu_access_pointer(rq->timeline) == ce->timeline))
+			found = __active_engine(rq, &engine);
+
+		i915_request_put(rq);
+		if (found)
 			break;
 	}
 	rcu_read_unlock();
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/4] drm/i915: Cancel outstanding work after disabling heartbeats on an engine
  2020-09-16  9:42 [Intel-gfx] [PATCH 1/4] drm/i915/gem: Hold request reference for canceling an active context Chris Wilson
@ 2020-09-16  9:42   ` Chris Wilson
  2020-09-16  9:42   ` [Intel-gfx] " Chris Wilson
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-16  9:42 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Joonas Lahtinen, stable

We only allow persistent requests to remain on the GPU past the closure
of their containing context (and process) so long as they are continuously
checked for hangs or allow other requests to preempt them, as we need to
ensure forward progress of the system. If we allow persistent contexts
to remain on the system after the the hangcheck mechanism is disabled,
the system may grind to a halt. On disabling the mechanism, we sent a
pulse along the engine to remove all executing contexts from the engine
which would check for hung contexts -- but we did not prevent those
contexts from being resubmitted if they survived the final hangcheck.

Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
Testcase: igt/gem_ctx_persistence/heartbeat-stop
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.7+
---
 drivers/gpu/drm/i915/gt/intel_engine.h | 9 +++++++++
 drivers/gpu/drm/i915/i915_request.c    | 5 +++++
 2 files changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index 08e2c000dcc3..7c3a1012e702 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -337,4 +337,13 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
 	return intel_engine_has_preemption(engine);
 }
 
+static inline bool
+intel_engine_has_heartbeat(const struct intel_engine_cs *engine)
+{
+	if (!IS_ACTIVE(CONFIG_DRM_I915_HEARTBEAT_INTERVAL))
+		return false;
+
+	return READ_ONCE(engine->props.heartbeat_interval_ms);
+}
+
 #endif /* _INTEL_RINGBUFFER_H_ */
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 436ce368ddaa..0e813819b041 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -542,8 +542,13 @@ bool __i915_request_submit(struct i915_request *request)
 	if (i915_request_completed(request))
 		goto xfer;
 
+	if (unlikely(intel_context_is_closed(request->context) &&
+		     !intel_engine_has_heartbeat(engine)))
+		intel_context_set_banned(request->context);
+
 	if (unlikely(intel_context_is_banned(request->context)))
 		i915_request_set_error_once(request, -EIO);
+
 	if (unlikely(fatal_error(request->fence.error)))
 		__i915_request_skip(request);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH 2/4] drm/i915: Cancel outstanding work after disabling heartbeats on an engine
@ 2020-09-16  9:42   ` Chris Wilson
  0 siblings, 0 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-16  9:42 UTC (permalink / raw)
  To: intel-gfx; +Cc: stable, Chris Wilson

We only allow persistent requests to remain on the GPU past the closure
of their containing context (and process) so long as they are continuously
checked for hangs or allow other requests to preempt them, as we need to
ensure forward progress of the system. If we allow persistent contexts
to remain on the system after the the hangcheck mechanism is disabled,
the system may grind to a halt. On disabling the mechanism, we sent a
pulse along the engine to remove all executing contexts from the engine
which would check for hung contexts -- but we did not prevent those
contexts from being resubmitted if they survived the final hangcheck.

Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
Testcase: igt/gem_ctx_persistence/heartbeat-stop
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.7+
---
 drivers/gpu/drm/i915/gt/intel_engine.h | 9 +++++++++
 drivers/gpu/drm/i915/i915_request.c    | 5 +++++
 2 files changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index 08e2c000dcc3..7c3a1012e702 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -337,4 +337,13 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
 	return intel_engine_has_preemption(engine);
 }
 
+static inline bool
+intel_engine_has_heartbeat(const struct intel_engine_cs *engine)
+{
+	if (!IS_ACTIVE(CONFIG_DRM_I915_HEARTBEAT_INTERVAL))
+		return false;
+
+	return READ_ONCE(engine->props.heartbeat_interval_ms);
+}
+
 #endif /* _INTEL_RINGBUFFER_H_ */
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 436ce368ddaa..0e813819b041 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -542,8 +542,13 @@ bool __i915_request_submit(struct i915_request *request)
 	if (i915_request_completed(request))
 		goto xfer;
 
+	if (unlikely(intel_context_is_closed(request->context) &&
+		     !intel_engine_has_heartbeat(engine)))
+		intel_context_set_banned(request->context);
+
 	if (unlikely(intel_context_is_banned(request->context)))
 		i915_request_set_error_once(request, -EIO);
+
 	if (unlikely(fatal_error(request->fence.error)))
 		__i915_request_skip(request);
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/4] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
  2020-09-16  9:42 [Intel-gfx] [PATCH 1/4] drm/i915/gem: Hold request reference for canceling an active context Chris Wilson
@ 2020-09-16  9:42   ` Chris Wilson
  2020-09-16  9:42   ` [Intel-gfx] " Chris Wilson
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-16  9:42 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Joonas Lahtinen, stable

Currently, we check we can send a pulse prior to disabling the
heartbeat to verify that we can change the heartbeat, but since we may
re-evaluate execution upon changing the heartbeat interval we need another
pulse afterwards to refresh execution.

Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.7+
---
 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 8ffdf676c0a0..d09df370f7cd 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -192,10 +192,12 @@ int intel_engine_set_heartbeat(struct intel_engine_cs *engine,
 	WRITE_ONCE(engine->props.heartbeat_interval_ms, delay);
 
 	if (intel_engine_pm_get_if_awake(engine)) {
-		if (delay)
+		if (delay) {
 			intel_engine_unpark_heartbeat(engine);
-		else
+		} else {
 			intel_engine_park_heartbeat(engine);
+			intel_engine_pulse(engine); /* recheck execution */
+		}
 		intel_engine_pm_put(engine);
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH 3/4] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
@ 2020-09-16  9:42   ` Chris Wilson
  0 siblings, 0 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-16  9:42 UTC (permalink / raw)
  To: intel-gfx; +Cc: stable, Chris Wilson

Currently, we check we can send a pulse prior to disabling the
heartbeat to verify that we can change the heartbeat, but since we may
re-evaluate execution upon changing the heartbeat interval we need another
pulse afterwards to refresh execution.

Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.7+
---
 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 8ffdf676c0a0..d09df370f7cd 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -192,10 +192,12 @@ int intel_engine_set_heartbeat(struct intel_engine_cs *engine,
 	WRITE_ONCE(engine->props.heartbeat_interval_ms, delay);
 
 	if (intel_engine_pm_get_if_awake(engine)) {
-		if (delay)
+		if (delay) {
 			intel_engine_unpark_heartbeat(engine);
-		else
+		} else {
 			intel_engine_park_heartbeat(engine);
+			intel_engine_pulse(engine); /* recheck execution */
+		}
 		intel_engine_pm_put(engine);
 	}
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/4] drm/i915/gem: Always test execution status on closing the context
  2020-09-16  9:42 [Intel-gfx] [PATCH 1/4] drm/i915/gem: Hold request reference for canceling an active context Chris Wilson
@ 2020-09-16  9:42   ` Chris Wilson
  2020-09-16  9:42   ` [Intel-gfx] " Chris Wilson
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-16  9:42 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Joonas Lahtinen, stable

Verify that if a context is active at the time it is closed, that it is
either persistent and preemptible (with hangcheck running) or it shall
be removed from execution.

Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
Testcase: igt/gem_ctx_persistence/heartbeat-close
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.7+
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 48 +++++----------------
 1 file changed, 10 insertions(+), 38 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index a548626fa8bc..4fd38101bb56 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -390,24 +390,6 @@ __context_engines_static(const struct i915_gem_context *ctx)
 	return rcu_dereference_protected(ctx->engines, true);
 }
 
-static bool __reset_engine(struct intel_engine_cs *engine)
-{
-	struct intel_gt *gt = engine->gt;
-	bool success = false;
-
-	if (!intel_has_reset_engine(gt))
-		return false;
-
-	if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
-			      &gt->reset.flags)) {
-		success = intel_engine_reset(engine, NULL) == 0;
-		clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
-				      &gt->reset.flags);
-	}
-
-	return success;
-}
-
 static void __reset_context(struct i915_gem_context *ctx,
 			    struct intel_engine_cs *engine)
 {
@@ -431,12 +413,7 @@ static bool __cancel_engine(struct intel_engine_cs *engine)
 	 * kill the banned context, we fallback to doing a local reset
 	 * instead.
 	 */
-	if (IS_ACTIVE(CONFIG_DRM_I915_PREEMPT_TIMEOUT) &&
-	    !intel_engine_pulse(engine))
-		return true;
-
-	/* If we are unable to send a pulse, try resetting this engine. */
-	return __reset_engine(engine);
+	return intel_engine_pulse(engine) == 0;
 }
 
 static bool
@@ -506,7 +483,7 @@ static struct intel_engine_cs *active_engine(struct intel_context *ce)
 	return engine;
 }
 
-static void kill_engines(struct i915_gem_engines *engines)
+static void kill_engines(struct i915_gem_engines *engines, bool ban)
 {
 	struct i915_gem_engines_iter it;
 	struct intel_context *ce;
@@ -521,7 +498,7 @@ static void kill_engines(struct i915_gem_engines *engines)
 	for_each_gem_engine(ce, engines, it) {
 		struct intel_engine_cs *engine;
 
-		if (intel_context_set_banned(ce))
+		if (ban && intel_context_set_banned(ce))
 			continue;
 
 		/*
@@ -534,7 +511,7 @@ static void kill_engines(struct i915_gem_engines *engines)
 		engine = active_engine(ce);
 
 		/* First attempt to gracefully cancel the context */
-		if (engine && !__cancel_engine(engine))
+		if (engine && !__cancel_engine(engine) && ban)
 			/*
 			 * If we are unable to send a preemptive pulse to bump
 			 * the context from the GPU, we have to resort to a full
@@ -544,8 +521,10 @@ static void kill_engines(struct i915_gem_engines *engines)
 	}
 }
 
-static void kill_stale_engines(struct i915_gem_context *ctx)
+static void kill_context(struct i915_gem_context *ctx)
 {
+	bool ban = (!i915_gem_context_is_persistent(ctx) ||
+		    !ctx->i915->params.enable_hangcheck);
 	struct i915_gem_engines *pos, *next;
 
 	spin_lock_irq(&ctx->stale.lock);
@@ -558,7 +537,7 @@ static void kill_stale_engines(struct i915_gem_context *ctx)
 
 		spin_unlock_irq(&ctx->stale.lock);
 
-		kill_engines(pos);
+		kill_engines(pos, ban);
 
 		spin_lock_irq(&ctx->stale.lock);
 		GEM_BUG_ON(i915_sw_fence_signaled(&pos->fence));
@@ -570,11 +549,6 @@ static void kill_stale_engines(struct i915_gem_context *ctx)
 	spin_unlock_irq(&ctx->stale.lock);
 }
 
-static void kill_context(struct i915_gem_context *ctx)
-{
-	kill_stale_engines(ctx);
-}
-
 static void engines_idle_release(struct i915_gem_context *ctx,
 				 struct i915_gem_engines *engines)
 {
@@ -609,7 +583,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
 
 kill:
 	if (list_empty(&engines->link)) /* raced, already closed */
-		kill_engines(engines);
+		kill_engines(engines, true);
 
 	i915_sw_fence_commit(&engines->fence);
 }
@@ -667,9 +641,7 @@ static void context_close(struct i915_gem_context *ctx)
 	 * case we opt to forcibly kill off all remaining requests on
 	 * context close.
 	 */
-	if (!i915_gem_context_is_persistent(ctx) ||
-	    !ctx->i915->params.enable_hangcheck)
-		kill_context(ctx);
+	kill_context(ctx);
 
 	i915_gem_context_put(ctx);
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH 4/4] drm/i915/gem: Always test execution status on closing the context
@ 2020-09-16  9:42   ` Chris Wilson
  0 siblings, 0 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-16  9:42 UTC (permalink / raw)
  To: intel-gfx; +Cc: stable, Chris Wilson

Verify that if a context is active at the time it is closed, that it is
either persistent and preemptible (with hangcheck running) or it shall
be removed from execution.

Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
Testcase: igt/gem_ctx_persistence/heartbeat-close
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.7+
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 48 +++++----------------
 1 file changed, 10 insertions(+), 38 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index a548626fa8bc..4fd38101bb56 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -390,24 +390,6 @@ __context_engines_static(const struct i915_gem_context *ctx)
 	return rcu_dereference_protected(ctx->engines, true);
 }
 
-static bool __reset_engine(struct intel_engine_cs *engine)
-{
-	struct intel_gt *gt = engine->gt;
-	bool success = false;
-
-	if (!intel_has_reset_engine(gt))
-		return false;
-
-	if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
-			      &gt->reset.flags)) {
-		success = intel_engine_reset(engine, NULL) == 0;
-		clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
-				      &gt->reset.flags);
-	}
-
-	return success;
-}
-
 static void __reset_context(struct i915_gem_context *ctx,
 			    struct intel_engine_cs *engine)
 {
@@ -431,12 +413,7 @@ static bool __cancel_engine(struct intel_engine_cs *engine)
 	 * kill the banned context, we fallback to doing a local reset
 	 * instead.
 	 */
-	if (IS_ACTIVE(CONFIG_DRM_I915_PREEMPT_TIMEOUT) &&
-	    !intel_engine_pulse(engine))
-		return true;
-
-	/* If we are unable to send a pulse, try resetting this engine. */
-	return __reset_engine(engine);
+	return intel_engine_pulse(engine) == 0;
 }
 
 static bool
@@ -506,7 +483,7 @@ static struct intel_engine_cs *active_engine(struct intel_context *ce)
 	return engine;
 }
 
-static void kill_engines(struct i915_gem_engines *engines)
+static void kill_engines(struct i915_gem_engines *engines, bool ban)
 {
 	struct i915_gem_engines_iter it;
 	struct intel_context *ce;
@@ -521,7 +498,7 @@ static void kill_engines(struct i915_gem_engines *engines)
 	for_each_gem_engine(ce, engines, it) {
 		struct intel_engine_cs *engine;
 
-		if (intel_context_set_banned(ce))
+		if (ban && intel_context_set_banned(ce))
 			continue;
 
 		/*
@@ -534,7 +511,7 @@ static void kill_engines(struct i915_gem_engines *engines)
 		engine = active_engine(ce);
 
 		/* First attempt to gracefully cancel the context */
-		if (engine && !__cancel_engine(engine))
+		if (engine && !__cancel_engine(engine) && ban)
 			/*
 			 * If we are unable to send a preemptive pulse to bump
 			 * the context from the GPU, we have to resort to a full
@@ -544,8 +521,10 @@ static void kill_engines(struct i915_gem_engines *engines)
 	}
 }
 
-static void kill_stale_engines(struct i915_gem_context *ctx)
+static void kill_context(struct i915_gem_context *ctx)
 {
+	bool ban = (!i915_gem_context_is_persistent(ctx) ||
+		    !ctx->i915->params.enable_hangcheck);
 	struct i915_gem_engines *pos, *next;
 
 	spin_lock_irq(&ctx->stale.lock);
@@ -558,7 +537,7 @@ static void kill_stale_engines(struct i915_gem_context *ctx)
 
 		spin_unlock_irq(&ctx->stale.lock);
 
-		kill_engines(pos);
+		kill_engines(pos, ban);
 
 		spin_lock_irq(&ctx->stale.lock);
 		GEM_BUG_ON(i915_sw_fence_signaled(&pos->fence));
@@ -570,11 +549,6 @@ static void kill_stale_engines(struct i915_gem_context *ctx)
 	spin_unlock_irq(&ctx->stale.lock);
 }
 
-static void kill_context(struct i915_gem_context *ctx)
-{
-	kill_stale_engines(ctx);
-}
-
 static void engines_idle_release(struct i915_gem_context *ctx,
 				 struct i915_gem_engines *engines)
 {
@@ -609,7 +583,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
 
 kill:
 	if (list_empty(&engines->link)) /* raced, already closed */
-		kill_engines(engines);
+		kill_engines(engines, true);
 
 	i915_sw_fence_commit(&engines->fence);
 }
@@ -667,9 +641,7 @@ static void context_close(struct i915_gem_context *ctx)
 	 * case we opt to forcibly kill off all remaining requests on
 	 * context close.
 	 */
-	if (!i915_gem_context_is_persistent(ctx) ||
-	    !ctx->i915->params.enable_hangcheck)
-		kill_context(ctx);
+	kill_context(ctx);
 
 	i915_gem_context_put(ctx);
 }
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/4] drm/i915/gem: Hold request reference for canceling an active context
  2020-09-16  9:42 [Intel-gfx] [PATCH 1/4] drm/i915/gem: Hold request reference for canceling an active context Chris Wilson
                   ` (2 preceding siblings ...)
  2020-09-16  9:42   ` [Intel-gfx] " Chris Wilson
@ 2020-09-16 13:03 ` Patchwork
  2020-09-16 13:04 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Patchwork @ 2020-09-16 13:03 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/4] drm/i915/gem: Hold request reference for canceling an active context
URL   : https://patchwork.freedesktop.org/series/81728/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
8e90943e9601 drm/i915/gem: Hold request reference for canceling an active context
-:16: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#16: 
<4> [582.745252] general protection fault, probably for non-canonical address 0xcccccccccccccd5c: 0000 [#1] PREEMPT SMP PTI

total: 0 errors, 1 warnings, 0 checks, 40 lines checked
8066b7124e28 drm/i915: Cancel outstanding work after disabling heartbeats on an engine
532d48821928 drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
5d8ace4dbe64 drm/i915/gem: Always test execution status on closing the context


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for series starting with [1/4] drm/i915/gem: Hold request reference for canceling an active context
  2020-09-16  9:42 [Intel-gfx] [PATCH 1/4] drm/i915/gem: Hold request reference for canceling an active context Chris Wilson
                   ` (3 preceding siblings ...)
  2020-09-16 13:03 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/4] drm/i915/gem: Hold request reference for canceling an active context Patchwork
@ 2020-09-16 13:04 ` Patchwork
  2020-09-16 13:28 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Patchwork @ 2020-09-16 13:04 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/4] drm/i915/gem: Hold request reference for canceling an active context
URL   : https://patchwork.freedesktop.org/series/81728/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
-
+drivers/gpu/drm/i915/gt/intel_reset.c:1311:5: warning: context imbalance in 'intel_gt_reset_trylock' - different lock contexts for basic block
+drivers/gpu/drm/i915/gvt/mmio.c:290:23: warning: memcpy with byte count of 279040
+drivers/gpu/drm/i915/i915_perf.c:1440:15: warning: memset with byte count of 16777216
+drivers/gpu/drm/i915/i915_perf.c:1494:15: warning: memset with byte count of 16777216
+./include/linux/seqlock.h:752:24: warning: trying to copy expression type 31
+./include/linux/seqlock.h:778:16: warning: trying to copy expression type 31
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen11_fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen12_fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen6_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen8_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen8_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:409:9: warning: context imbalance in 'gen8_write8' - different lock contexts for basic block


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for series starting with [1/4] drm/i915/gem: Hold request reference for canceling an active context
  2020-09-16  9:42 [Intel-gfx] [PATCH 1/4] drm/i915/gem: Hold request reference for canceling an active context Chris Wilson
                   ` (4 preceding siblings ...)
  2020-09-16 13:04 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
@ 2020-09-16 13:28 ` Patchwork
  2020-09-16 17:19 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
  2020-09-24 13:34 ` [Intel-gfx] [PATCH 1/4] " Tvrtko Ursulin
  7 siblings, 0 replies; 30+ messages in thread
From: Patchwork @ 2020-09-16 13:28 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 6753 bytes --]

== Series Details ==

Series: series starting with [1/4] drm/i915/gem: Hold request reference for canceling an active context
URL   : https://patchwork.freedesktop.org/series/81728/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_9019 -> Patchwork_18510
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/index.html

Known issues
------------

  Here are the changes found in Patchwork_18510 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_render_tiled_blits@basic:
    - fi-tgl-y:           [PASS][1] -> [DMESG-WARN][2] ([i915#402])
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/fi-tgl-y/igt@gem_render_tiled_blits@basic.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/fi-tgl-y/igt@gem_render_tiled_blits@basic.html

  * igt@i915_pm_rpm@basic-pci-d3-state:
    - fi-bsw-n3050:       [PASS][3] -> [DMESG-WARN][4] ([i915#1982])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/fi-bsw-n3050/igt@i915_pm_rpm@basic-pci-d3-state.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/fi-bsw-n3050/igt@i915_pm_rpm@basic-pci-d3-state.html

  * igt@vgem_basic@unload:
    - fi-kbl-x1275:       [PASS][5] -> [DMESG-WARN][6] ([i915#62] / [i915#92] / [i915#95])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/fi-kbl-x1275/igt@vgem_basic@unload.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/fi-kbl-x1275/igt@vgem_basic@unload.html

  
#### Possible fixes ####

  * {igt@core_hotunplug@unbind-rebind}:
    - fi-kbl-x1275:       [DMESG-WARN][7] ([i915#62] / [i915#92] / [i915#95]) -> [PASS][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/fi-kbl-x1275/igt@core_hotunplug@unbind-rebind.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/fi-kbl-x1275/igt@core_hotunplug@unbind-rebind.html

  * igt@i915_module_load@reload:
    - fi-byt-j1900:       [DMESG-WARN][9] ([i915#1982]) -> [PASS][10] +1 similar issue
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/fi-byt-j1900/igt@i915_module_load@reload.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/fi-byt-j1900/igt@i915_module_load@reload.html
    - fi-apl-guc:         [DMESG-WARN][11] ([i915#1635] / [i915#1982]) -> [PASS][12] +1 similar issue
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/fi-apl-guc/igt@i915_module_load@reload.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/fi-apl-guc/igt@i915_module_load@reload.html

  * igt@kms_busy@basic@modeset:
    - fi-tgl-y:           [DMESG-WARN][13] ([i915#1982]) -> [PASS][14] +2 similar issues
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/fi-tgl-y/igt@kms_busy@basic@modeset.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/fi-tgl-y/igt@kms_busy@basic@modeset.html

  * igt@kms_flip@basic-flip-vs-wf_vblank@c-edp1:
    - fi-icl-u2:          [DMESG-WARN][15] ([i915#1982]) -> [PASS][16]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/fi-icl-u2/igt@kms_flip@basic-flip-vs-wf_vblank@c-edp1.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/fi-icl-u2/igt@kms_flip@basic-flip-vs-wf_vblank@c-edp1.html

  * igt@vgem_basic@setversion:
    - fi-tgl-y:           [DMESG-WARN][17] ([i915#402]) -> [PASS][18] +1 similar issue
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/fi-tgl-y/igt@vgem_basic@setversion.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/fi-tgl-y/igt@vgem_basic@setversion.html

  
#### Warnings ####

  * igt@gem_exec_suspend@basic-s3:
    - fi-tgl-y:           [DMESG-WARN][19] ([i915#2411]) -> [DMESG-WARN][20] ([i915#2411] / [i915#402])
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/fi-tgl-y/igt@gem_exec_suspend@basic-s3.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/fi-tgl-y/igt@gem_exec_suspend@basic-s3.html

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-c:
    - fi-kbl-x1275:       [DMESG-WARN][21] ([i915#62] / [i915#92] / [i915#95]) -> [DMESG-WARN][22] ([i915#62] / [i915#92]) +4 similar issues
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/fi-kbl-x1275/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-c.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/fi-kbl-x1275/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-c.html

  * igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a:
    - fi-kbl-x1275:       [DMESG-WARN][23] ([i915#62] / [i915#92]) -> [DMESG-WARN][24] ([i915#62] / [i915#92] / [i915#95]) +2 similar issues
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/fi-kbl-x1275/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/fi-kbl-x1275/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [i915#1635]: https://gitlab.freedesktop.org/drm/intel/issues/1635
  [i915#1982]: https://gitlab.freedesktop.org/drm/intel/issues/1982
  [i915#2411]: https://gitlab.freedesktop.org/drm/intel/issues/2411
  [i915#289]: https://gitlab.freedesktop.org/drm/intel/issues/289
  [i915#402]: https://gitlab.freedesktop.org/drm/intel/issues/402
  [i915#62]: https://gitlab.freedesktop.org/drm/intel/issues/62
  [i915#92]: https://gitlab.freedesktop.org/drm/intel/issues/92
  [i915#95]: https://gitlab.freedesktop.org/drm/intel/issues/95


Participating hosts (45 -> 39)
------------------------------

  Additional (1): fi-skl-6700k2 
  Missing    (7): fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-ctg-p8600 fi-byt-clapper fi-bdw-samus 


Build changes
-------------

  * Linux: CI_DRM_9019 -> Patchwork_18510

  CI-20190529: 20190529
  CI_DRM_9019: 038c228475ce10a6f9cc4052250a1315f3c7c627 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_5786: 222051026b978ebbc0dc58db62d7a1f29728f95f @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_18510: 5d8ace4dbe64e8d24bb034c6801f5efed78b3994 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

5d8ace4dbe64 drm/i915/gem: Always test execution status on closing the context
532d48821928 drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
8066b7124e28 drm/i915: Cancel outstanding work after disabling heartbeats on an engine
8e90943e9601 drm/i915/gem: Hold request reference for canceling an active context

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/index.html

[-- Attachment #1.2: Type: text/html, Size: 8911 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Intel-gfx] ✗ Fi.CI.IGT: failure for series starting with [1/4] drm/i915/gem: Hold request reference for canceling an active context
  2020-09-16  9:42 [Intel-gfx] [PATCH 1/4] drm/i915/gem: Hold request reference for canceling an active context Chris Wilson
                   ` (5 preceding siblings ...)
  2020-09-16 13:28 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
@ 2020-09-16 17:19 ` Patchwork
  2020-09-24 13:34 ` [Intel-gfx] [PATCH 1/4] " Tvrtko Ursulin
  7 siblings, 0 replies; 30+ messages in thread
From: Patchwork @ 2020-09-16 17:19 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 15014 bytes --]

== Series Details ==

Series: series starting with [1/4] drm/i915/gem: Hold request reference for canceling an active context
URL   : https://patchwork.freedesktop.org/series/81728/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_9019_full -> Patchwork_18510_full
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_18510_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_18510_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_18510_full:

### IGT changes ###

#### Possible regressions ####

  * igt@gem_exec_schedule@preempt-hang@rcs0:
    - shard-tglb:         [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-tglb1/igt@gem_exec_schedule@preempt-hang@rcs0.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-tglb8/igt@gem_exec_schedule@preempt-hang@rcs0.html

  * igt@runner@aborted:
    - shard-kbl:          NOTRUN -> [FAIL][3]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-kbl1/igt@runner@aborted.html

  
#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * {igt@core_hotunplug@hotrebind-lateclose}:
    - shard-snb:          [FAIL][4] ([i915#2469]) -> [FAIL][5]
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-snb5/igt@core_hotunplug@hotrebind-lateclose.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-snb4/igt@core_hotunplug@hotrebind-lateclose.html
    - shard-iclb:         [FAIL][6] ([i915#2469]) -> [FAIL][7]
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-iclb2/igt@core_hotunplug@hotrebind-lateclose.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-iclb6/igt@core_hotunplug@hotrebind-lateclose.html
    - shard-tglb:         [DMESG-FAIL][8] ([i915#2469]) -> [FAIL][9]
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-tglb5/igt@core_hotunplug@hotrebind-lateclose.html
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-tglb6/igt@core_hotunplug@hotrebind-lateclose.html
    - shard-glk:          [FAIL][10] ([i915#2469]) -> [FAIL][11]
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-glk2/igt@core_hotunplug@hotrebind-lateclose.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-glk3/igt@core_hotunplug@hotrebind-lateclose.html
    - shard-skl:          [FAIL][12] ([i915#2469]) -> [FAIL][13]
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-skl7/igt@core_hotunplug@hotrebind-lateclose.html
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-skl1/igt@core_hotunplug@hotrebind-lateclose.html
    - shard-kbl:          [FAIL][14] ([i915#2469]) -> [FAIL][15]
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-kbl1/igt@core_hotunplug@hotrebind-lateclose.html
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-kbl3/igt@core_hotunplug@hotrebind-lateclose.html

  * {igt@core_hotunplug@unbind-rebind}:
    - shard-kbl:          [PASS][16] -> [INCOMPLETE][17]
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-kbl6/igt@core_hotunplug@unbind-rebind.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-kbl1/igt@core_hotunplug@unbind-rebind.html

  
Known issues
------------

  Here are the changes found in Patchwork_18510_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@kms_big_fb@x-tiled-8bpp-rotate-0:
    - shard-apl:          [PASS][18] -> [DMESG-WARN][19] ([i915#1635] / [i915#1982]) +1 similar issue
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-apl2/igt@kms_big_fb@x-tiled-8bpp-rotate-0.html
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-apl4/igt@kms_big_fb@x-tiled-8bpp-rotate-0.html

  * igt@kms_cursor_crc@pipe-b-cursor-suspend:
    - shard-skl:          [PASS][20] -> [INCOMPLETE][21] ([i915#300])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-skl9/igt@kms_cursor_crc@pipe-b-cursor-suspend.html
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-skl5/igt@kms_cursor_crc@pipe-b-cursor-suspend.html

  * igt@kms_flip@flip-vs-absolute-wf_vblank-interruptible@c-edp1:
    - shard-skl:          [PASS][22] -> [FAIL][23] ([i915#2122])
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-skl4/igt@kms_flip@flip-vs-absolute-wf_vblank-interruptible@c-edp1.html
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-skl8/igt@kms_flip@flip-vs-absolute-wf_vblank-interruptible@c-edp1.html

  * igt@kms_flip@flip-vs-suspend-interruptible@a-dp1:
    - shard-kbl:          [PASS][24] -> [DMESG-WARN][25] ([i915#180]) +5 similar issues
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-kbl6/igt@kms_flip@flip-vs-suspend-interruptible@a-dp1.html
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-kbl4/igt@kms_flip@flip-vs-suspend-interruptible@a-dp1.html

  * igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-shrfb-draw-mmap-cpu:
    - shard-iclb:         [PASS][26] -> [DMESG-WARN][27] ([i915#1982])
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-iclb1/igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-shrfb-draw-mmap-cpu.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-iclb3/igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-shrfb-draw-mmap-cpu.html

  * igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-pri-indfb-draw-mmap-wc:
    - shard-tglb:         [PASS][28] -> [DMESG-WARN][29] ([i915#1982]) +3 similar issues
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-tglb7/igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-pri-indfb-draw-mmap-wc.html
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-tglb8/igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-pri-indfb-draw-mmap-wc.html

  * igt@kms_hdr@bpc-switch:
    - shard-skl:          [PASS][30] -> [FAIL][31] ([i915#1188])
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-skl10/igt@kms_hdr@bpc-switch.html
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-skl8/igt@kms_hdr@bpc-switch.html

  * igt@kms_plane@plane-panning-top-left-pipe-c-planes:
    - shard-skl:          [PASS][32] -> [DMESG-WARN][33] ([i915#1982]) +15 similar issues
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-skl6/igt@kms_plane@plane-panning-top-left-pipe-c-planes.html
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-skl1/igt@kms_plane@plane-panning-top-left-pipe-c-planes.html

  * igt@kms_plane_alpha_blend@pipe-a-constant-alpha-min:
    - shard-skl:          [PASS][34] -> [FAIL][35] ([fdo#108145] / [i915#265]) +1 similar issue
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-skl2/igt@kms_plane_alpha_blend@pipe-a-constant-alpha-min.html
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-skl7/igt@kms_plane_alpha_blend@pipe-a-constant-alpha-min.html

  * igt@kms_psr@psr2_sprite_mmap_gtt:
    - shard-iclb:         [PASS][36] -> [SKIP][37] ([fdo#109441])
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-iclb2/igt@kms_psr@psr2_sprite_mmap_gtt.html
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-iclb6/igt@kms_psr@psr2_sprite_mmap_gtt.html

  * igt@kms_setmode@basic:
    - shard-apl:          [PASS][38] -> [FAIL][39] ([i915#1635] / [i915#31])
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-apl7/igt@kms_setmode@basic.html
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-apl7/igt@kms_setmode@basic.html

  * igt@kms_vblank@pipe-b-ts-continuation-suspend:
    - shard-skl:          [PASS][40] -> [INCOMPLETE][41] ([i915#198])
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-skl6/igt@kms_vblank@pipe-b-ts-continuation-suspend.html
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-skl9/igt@kms_vblank@pipe-b-ts-continuation-suspend.html

  
#### Possible fixes ####

  * igt@gem_exec_reloc@basic-many-active@bcs0:
    - shard-glk:          [FAIL][42] ([i915#2389]) -> [PASS][43]
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-glk6/igt@gem_exec_reloc@basic-many-active@bcs0.html
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-glk9/igt@gem_exec_reloc@basic-many-active@bcs0.html

  * igt@gem_exec_whisper@basic-fds-priority-all:
    - shard-glk:          [DMESG-WARN][44] ([i915#118] / [i915#95]) -> [PASS][45]
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-glk2/igt@gem_exec_whisper@basic-fds-priority-all.html
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-glk6/igt@gem_exec_whisper@basic-fds-priority-all.html

  * igt@gen9_exec_parse@cmd-crossing-page:
    - shard-skl:          [DMESG-WARN][46] ([i915#1982]) -> [PASS][47] +3 similar issues
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-skl2/igt@gen9_exec_parse@cmd-crossing-page.html
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-skl7/igt@gen9_exec_parse@cmd-crossing-page.html

  * igt@i915_pm_dc@dc6-psr:
    - shard-iclb:         [FAIL][48] ([i915#454]) -> [PASS][49]
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-iclb4/igt@i915_pm_dc@dc6-psr.html
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-iclb4/igt@i915_pm_dc@dc6-psr.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic:
    - shard-skl:          [FAIL][50] ([i915#2346]) -> [PASS][51]
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-skl7/igt@kms_cursor_legacy@flip-vs-cursor-atomic.html
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-skl6/igt@kms_cursor_legacy@flip-vs-cursor-atomic.html

  * igt@kms_flip@flip-vs-suspend@a-vga1:
    - shard-snb:          [DMESG-WARN][52] ([i915#42]) -> [PASS][53]
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-snb7/igt@kms_flip@flip-vs-suspend@a-vga1.html
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-snb4/igt@kms_flip@flip-vs-suspend@a-vga1.html

  * igt@kms_flip@plain-flip-ts-check-interruptible@a-dp1:
    - shard-kbl:          [DMESG-WARN][54] ([i915#1982]) -> [PASS][55]
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-kbl4/igt@kms_flip@plain-flip-ts-check-interruptible@a-dp1.html
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-kbl1/igt@kms_flip@plain-flip-ts-check-interruptible@a-dp1.html

  * igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-indfb-draw-mmap-gtt:
    - shard-glk:          [FAIL][56] ([i915#49]) -> [PASS][57]
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-glk4/igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-indfb-draw-mmap-gtt.html
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-glk4/igt@kms_frontbuffer_tracking@fbc-1p-offscren-pri-indfb-draw-mmap-gtt.html

  * igt@kms_frontbuffer_tracking@psr-suspend:
    - shard-skl:          [INCOMPLETE][58] ([i915#123]) -> [PASS][59]
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-skl5/igt@kms_frontbuffer_tracking@psr-suspend.html
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-skl8/igt@kms_frontbuffer_tracking@psr-suspend.html

  * igt@kms_hdr@bpc-switch-dpms:
    - shard-skl:          [FAIL][60] ([i915#1188]) -> [PASS][61]
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-skl7/igt@kms_hdr@bpc-switch-dpms.html
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-skl1/igt@kms_hdr@bpc-switch-dpms.html

  * igt@kms_hdr@bpc-switch-suspend:
    - shard-kbl:          [DMESG-WARN][62] ([i915#180]) -> [PASS][63]
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-kbl4/igt@kms_hdr@bpc-switch-suspend.html
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-kbl7/igt@kms_hdr@bpc-switch-suspend.html

  * igt@kms_psr@psr2_cursor_render:
    - shard-iclb:         [SKIP][64] ([fdo#109441]) -> [PASS][65] +1 similar issue
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9019/shard-iclb3/igt@kms_psr@psr2_cursor_render.html
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/shard-iclb2/igt@kms_psr@psr2_cursor_render.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#108145]: https://bugs.freedesktop.org/show_bug.cgi?id=108145
  [fdo#109441]: https://bugs.freedesktop.org/show_bug.cgi?id=109441
  [i915#118]: https://gitlab.freedesktop.org/drm/intel/issues/118
  [i915#1188]: https://gitlab.freedesktop.org/drm/intel/issues/1188
  [i915#123]: https://gitlab.freedesktop.org/drm/intel/issues/123
  [i915#1635]: https://gitlab.freedesktop.org/drm/intel/issues/1635
  [i915#180]: https://gitlab.freedesktop.org/drm/intel/issues/180
  [i915#198]: https://gitlab.freedesktop.org/drm/intel/issues/198
  [i915#1982]: https://gitlab.freedesktop.org/drm/intel/issues/1982
  [i915#2122]: https://gitlab.freedesktop.org/drm/intel/issues/2122
  [i915#2346]: https://gitlab.freedesktop.org/drm/intel/issues/2346
  [i915#2389]: https://gitlab.freedesktop.org/drm/intel/issues/2389
  [i915#2469]: https://gitlab.freedesktop.org/drm/intel/issues/2469
  [i915#265]: https://gitlab.freedesktop.org/drm/intel/issues/265
  [i915#300]: https://gitlab.freedesktop.org/drm/intel/issues/300
  [i915#31]: https://gitlab.freedesktop.org/drm/intel/issues/31
  [i915#42]: https://gitlab.freedesktop.org/drm/intel/issues/42
  [i915#454]: https://gitlab.freedesktop.org/drm/intel/issues/454
  [i915#49]: https://gitlab.freedesktop.org/drm/intel/issues/49
  [i915#95]: https://gitlab.freedesktop.org/drm/intel/issues/95


Participating hosts (11 -> 11)
------------------------------

  No changes in participating hosts


Build changes
-------------

  * Linux: CI_DRM_9019 -> Patchwork_18510

  CI-20190529: 20190529
  CI_DRM_9019: 038c228475ce10a6f9cc4052250a1315f3c7c627 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_5786: 222051026b978ebbc0dc58db62d7a1f29728f95f @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_18510: 5d8ace4dbe64e8d24bb034c6801f5efed78b3994 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18510/index.html

[-- Attachment #1.2: Type: text/html, Size: 17361 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 1/4] drm/i915/gem: Hold request reference for canceling an active context
  2020-09-16  9:42 [Intel-gfx] [PATCH 1/4] drm/i915/gem: Hold request reference for canceling an active context Chris Wilson
                   ` (6 preceding siblings ...)
  2020-09-16 17:19 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
@ 2020-09-24 13:34 ` Tvrtko Ursulin
  7 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2020-09-24 13:34 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 16/09/2020 10:42, Chris Wilson wrote:
> We have to be very careful while walking the timeline->requests list
> under the RCU guard, as the requests (and so rq->link) use
> SLAB_TYPESAFE_BY_RCU and so the requests may be reallocated within an
> rcu grace period. As the requests are reallocated, they are removed from
> one list and placed on another, and if we are iterating over that
> request at that moment, the list iteration jumps from one list to the
> next and promptly gets confused. Verify we hold the request reference
> to ensure that the request is not added to a new list behind our backs.
> 
> <4> [582.745252] general protection fault, probably for non-canonical address 0xcccccccccccccd5c: 0000 [#1] PREEMPT SMP PTI
> <4> [582.745297] CPU: 0 PID: 1475 Comm: gem_ctx_persist Not tainted 5.9.0-rc1-CI-CI_DRM_8908+ #1
> <4> [582.745304] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
> <4> [582.745317] RIP: 0010:__lock_acquire+0x2c3/0x1f40
> <4> [582.745323] Code: 00 65 8b 05 c7 8a ef 7e 85 c0 0f 85 b4 07 00 00 44 8b 9d c4 08 00 00 45 85 db 0f 84 0f 01 00 00 ba 05 00 00 00 e9 c8 06 00 00 <48> 81 3f c0 89 c7 82 b8 00 00 00 00 41 0f 45 c0 83 fe 01 41 89 c3
> <4> [582.745334] RSP: 0018:ffffc9000461bc40 EFLAGS: 00010002
> <4> [582.745340] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
> <4> [582.745345] RDX: 0000000000000000 RSI: 0000000000000000 RDI: cccccccccccccd5c
> <4> [582.745350] RBP: ffff8881ec4a2880 R08: 0000000000000001 R09: 0000000000000001
> <4> [582.745356] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> <4> [582.745361] R13: 0000000000000000 R14: 0000000000000000 R15: cccccccccccccd5c
> <4> [582.745367] FS:  00007fb44da78e40(0000) GS:ffff888278000000(0000) knlGS:0000000000000000
> <4> [582.745373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4> [582.745378] CR2: 00007fb44daad040 CR3: 0000000268428000 CR4: 0000000000350ef0
> <4> [582.745383] Call Trace:
> <4> [582.745390]  ? __lock_acquire+0x913/0x1f40
> <4> [582.745397]  lock_acquire+0xb5/0x3c0
> <4> [582.745526]  ? kill_engines+0x19a/0x4b0 [i915]
> <4> [582.745533]  ? find_held_lock+0x2d/0x90
> <4> [582.745541]  _raw_spin_lock_irq+0x30/0x40
> <4> [582.745635]  ? kill_engines+0x19a/0x4b0 [i915]
> <4> [582.745727]  kill_engines+0x19a/0x4b0 [i915]
> <4> [582.745820]  context_close+0x195/0x410 [i915]
> <4> [582.745912]  i915_gem_context_close+0x5b/0x160 [i915]
> <4> [582.745994]  i915_driver_postclose+0x14/0x40 [i915]
> <4> [582.746003]  drm_file_free.part.13+0x240/0x290
> <4> [582.746009]  drm_release_noglobal+0x16/0x50
> <4> [582.746016]  __fput+0xa5/0x250
> <4> [582.746021]  task_work_run+0x6e/0xb0
> <4> [582.746028]  exit_to_user_mode_prepare+0x178/0x180
> <4> [582.746034]  syscall_exit_to_user_mode+0x36/0x220
> <4> [582.746040]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> <4> [582.746045] RIP: 0033:0x7fb44d1dc421
> <4> [582.746050] Code: f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 8b 05 ea cf 20 00 85 c0 75 16 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10
> <4> [582.746062] RSP: 002b:00007ffed2e83818 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> <4> [582.746069] RAX: 0000000000000000 RBX: 0000556410bfe840 RCX: 00007fb44d1dc421
> <4> [582.746075] RDX: 000000000000000a RSI: 00000000c0406469 RDI: 0000000000000008
> <4> [582.746080] RBP: 0000000000000008 R08: 00007fb44d1c51cc R09: 00007fb44d1c5240
> <4> [582.746086] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000fffffffb
> <4> [582.746091] R13: 0000000000000006 R14: 0000000000000000 R15: 000000000000000a
> <4> [582.746099] Modules linked in: vgem mei_hdcp snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio btusb btrtl btbcm btintel x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul bluetooth ghash_clmulni_intel ecdh_generic ecc i915 r8169 realtek mei_me mei snd_hda_intel i2c_hid snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm pinctrl_geminilake pinctrl_intel prime_numbers [last unloaded: test_drm_mm]
> 
> Fixes: 09a3054f38db ("drm/i915/gem: Reduce context termination list iteration guard to RCU")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_context.c | 25 ++++++++++++++++-----
>   1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index cf5ecbde9e06..a548626fa8bc 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -460,8 +460,8 @@ __active_engine(struct i915_request *rq, struct intel_engine_cs **active)
>   		spin_lock(&locked->active.lock);
>   	}
>   
> -	if (!i915_request_completed(rq)) {
> -		if (i915_request_is_active(rq) && rq->fence.error != -EIO)
> +	if (i915_request_is_active(rq)) {
> +		if (!i915_request_completed(rq))
>   			*active = locked;
>   		ret = true;
>   	}
> @@ -479,13 +479,26 @@ static struct intel_engine_cs *active_engine(struct intel_context *ce)
>   	if (!ce->timeline)
>   		return NULL;
>   
> +	/*
> +	 * rq->link is only SLAB_TYPESAFE_BY_RCU, we need to hold a reference
> +	 * to the request to prevent it being transferred to a new timeline
> +	 * (and onto a new timeline->requests list).
> +	 */
>   	rcu_read_lock();
> -	list_for_each_entry_rcu(rq, &ce->timeline->requests, link) {
> -		if (i915_request_is_active(rq) && i915_request_completed(rq))
> -			continue;
> +	list_for_each_entry_reverse(rq, &ce->timeline->requests, link) {
> +		bool found;
> +
> +		/* timeline is already completed upto this point? */
> +		if (!i915_request_get_rcu(rq))
> +			break;
>   
>   		/* Check with the backend if the request is inflight */
> -		if (__active_engine(rq, &engine))
> +		found = true;
> +		if (likely(rcu_access_pointer(rq->timeline) == ce->timeline))

So the timeline check is equivalent to request_completed in a way which 
I think holds and now I see comment above documents it.

> +			found = __active_engine(rq, &engine);
> +
> +		i915_request_put(rq);
> +		if (found)
>   			break;
>   	}
>   	rcu_read_unlock();
> 

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 2/4] drm/i915: Cancel outstanding work after disabling heartbeats on an engine
  2020-09-16  9:42   ` [Intel-gfx] " Chris Wilson
@ 2020-09-24 13:35     ` Tvrtko Ursulin
  -1 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2020-09-24 13:35 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable


On 16/09/2020 10:42, Chris Wilson wrote:
> We only allow persistent requests to remain on the GPU past the closure
> of their containing context (and process) so long as they are continuously
> checked for hangs or allow other requests to preempt them, as we need to
> ensure forward progress of the system. If we allow persistent contexts
> to remain on the system after the the hangcheck mechanism is disabled,
> the system may grind to a halt. On disabling the mechanism, we sent a
> pulse along the engine to remove all executing contexts from the engine
> which would check for hung contexts -- but we did not prevent those
> contexts from being resubmitted if they survived the final hangcheck.
> 
> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> Testcase: igt/gem_ctx_persistence/heartbeat-stop
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v5.7+
> ---
>   drivers/gpu/drm/i915/gt/intel_engine.h | 9 +++++++++
>   drivers/gpu/drm/i915/i915_request.c    | 5 +++++
>   2 files changed, 14 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 08e2c000dcc3..7c3a1012e702 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -337,4 +337,13 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
>   	return intel_engine_has_preemption(engine);
>   }
>   
> +static inline bool
> +intel_engine_has_heartbeat(const struct intel_engine_cs *engine)
> +{
> +	if (!IS_ACTIVE(CONFIG_DRM_I915_HEARTBEAT_INTERVAL))
> +		return false;
> +
> +	return READ_ONCE(engine->props.heartbeat_interval_ms);
> +}
> +
>   #endif /* _INTEL_RINGBUFFER_H_ */
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 436ce368ddaa..0e813819b041 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -542,8 +542,13 @@ bool __i915_request_submit(struct i915_request *request)
>   	if (i915_request_completed(request))
>   		goto xfer;
>   
> +	if (unlikely(intel_context_is_closed(request->context) &&
> +		     !intel_engine_has_heartbeat(engine)))
> +		intel_context_set_banned(request->context);
> +
>   	if (unlikely(intel_context_is_banned(request->context)))
>   		i915_request_set_error_once(request, -EIO);
> +
>   	if (unlikely(fatal_error(request->fence.error)))
>   		__i915_request_skip(request);
>   
> 

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 2/4] drm/i915: Cancel outstanding work after disabling heartbeats on an engine
@ 2020-09-24 13:35     ` Tvrtko Ursulin
  0 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2020-09-24 13:35 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable


On 16/09/2020 10:42, Chris Wilson wrote:
> We only allow persistent requests to remain on the GPU past the closure
> of their containing context (and process) so long as they are continuously
> checked for hangs or allow other requests to preempt them, as we need to
> ensure forward progress of the system. If we allow persistent contexts
> to remain on the system after the the hangcheck mechanism is disabled,
> the system may grind to a halt. On disabling the mechanism, we sent a
> pulse along the engine to remove all executing contexts from the engine
> which would check for hung contexts -- but we did not prevent those
> contexts from being resubmitted if they survived the final hangcheck.
> 
> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> Testcase: igt/gem_ctx_persistence/heartbeat-stop
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v5.7+
> ---
>   drivers/gpu/drm/i915/gt/intel_engine.h | 9 +++++++++
>   drivers/gpu/drm/i915/i915_request.c    | 5 +++++
>   2 files changed, 14 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 08e2c000dcc3..7c3a1012e702 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -337,4 +337,13 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
>   	return intel_engine_has_preemption(engine);
>   }
>   
> +static inline bool
> +intel_engine_has_heartbeat(const struct intel_engine_cs *engine)
> +{
> +	if (!IS_ACTIVE(CONFIG_DRM_I915_HEARTBEAT_INTERVAL))
> +		return false;
> +
> +	return READ_ONCE(engine->props.heartbeat_interval_ms);
> +}
> +
>   #endif /* _INTEL_RINGBUFFER_H_ */
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 436ce368ddaa..0e813819b041 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -542,8 +542,13 @@ bool __i915_request_submit(struct i915_request *request)
>   	if (i915_request_completed(request))
>   		goto xfer;
>   
> +	if (unlikely(intel_context_is_closed(request->context) &&
> +		     !intel_engine_has_heartbeat(engine)))
> +		intel_context_set_banned(request->context);
> +
>   	if (unlikely(intel_context_is_banned(request->context)))
>   		i915_request_set_error_once(request, -EIO);
> +
>   	if (unlikely(fatal_error(request->fence.error)))
>   		__i915_request_skip(request);
>   
> 

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
  2020-09-16  9:42   ` [Intel-gfx] " Chris Wilson
@ 2020-09-24 13:43     ` Tvrtko Ursulin
  -1 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2020-09-24 13:43 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable


On 16/09/2020 10:42, Chris Wilson wrote:
> Currently, we check we can send a pulse prior to disabling the
> heartbeat to verify that we can change the heartbeat, but since we may
> re-evaluate execution upon changing the heartbeat interval we need another
> pulse afterwards to refresh execution.
> 
> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v5.7+
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> index 8ffdf676c0a0..d09df370f7cd 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> @@ -192,10 +192,12 @@ int intel_engine_set_heartbeat(struct intel_engine_cs *engine,
>   	WRITE_ONCE(engine->props.heartbeat_interval_ms, delay);
>   
>   	if (intel_engine_pm_get_if_awake(engine)) {
> -		if (delay)
> +		if (delay) {
>   			intel_engine_unpark_heartbeat(engine);
> -		else
> +		} else {
>   			intel_engine_park_heartbeat(engine);
> +			intel_engine_pulse(engine); /* recheck execution */
> +		}
>   		intel_engine_pm_put(engine);
>   	}
>   
> 

I did not immediately get this one. Do we really need two pulses or 
maybe we could re-order the code a bit and just undo the heartbeat park 
if pulse after parking did not work?

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
@ 2020-09-24 13:43     ` Tvrtko Ursulin
  0 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2020-09-24 13:43 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable


On 16/09/2020 10:42, Chris Wilson wrote:
> Currently, we check we can send a pulse prior to disabling the
> heartbeat to verify that we can change the heartbeat, but since we may
> re-evaluate execution upon changing the heartbeat interval we need another
> pulse afterwards to refresh execution.
> 
> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v5.7+
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> index 8ffdf676c0a0..d09df370f7cd 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> @@ -192,10 +192,12 @@ int intel_engine_set_heartbeat(struct intel_engine_cs *engine,
>   	WRITE_ONCE(engine->props.heartbeat_interval_ms, delay);
>   
>   	if (intel_engine_pm_get_if_awake(engine)) {
> -		if (delay)
> +		if (delay) {
>   			intel_engine_unpark_heartbeat(engine);
> -		else
> +		} else {
>   			intel_engine_park_heartbeat(engine);
> +			intel_engine_pulse(engine); /* recheck execution */
> +		}
>   		intel_engine_pm_put(engine);
>   	}
>   
> 

I did not immediately get this one. Do we really need two pulses or 
maybe we could re-order the code a bit and just undo the heartbeat park 
if pulse after parking did not work?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 4/4] drm/i915/gem: Always test execution status on closing the context
  2020-09-16  9:42   ` [Intel-gfx] " Chris Wilson
@ 2020-09-24 14:26     ` Tvrtko Ursulin
  -1 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2020-09-24 14:26 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable


On 16/09/2020 10:42, Chris Wilson wrote:
> Verify that if a context is active at the time it is closed, that it is
> either persistent and preemptible (with hangcheck running) or it shall
> be removed from execution.
> 
> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> Testcase: igt/gem_ctx_persistence/heartbeat-close
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v5.7+
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_context.c | 48 +++++----------------
>   1 file changed, 10 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index a548626fa8bc..4fd38101bb56 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -390,24 +390,6 @@ __context_engines_static(const struct i915_gem_context *ctx)
>   	return rcu_dereference_protected(ctx->engines, true);
>   }
>   
> -static bool __reset_engine(struct intel_engine_cs *engine)
> -{
> -	struct intel_gt *gt = engine->gt;
> -	bool success = false;
> -
> -	if (!intel_has_reset_engine(gt))
> -		return false;
> -
> -	if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
> -			      &gt->reset.flags)) {
> -		success = intel_engine_reset(engine, NULL) == 0;
> -		clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
> -				      &gt->reset.flags);
> -	}
> -
> -	return success;
> -}
> -
>   static void __reset_context(struct i915_gem_context *ctx,
>   			    struct intel_engine_cs *engine)
>   {
> @@ -431,12 +413,7 @@ static bool __cancel_engine(struct intel_engine_cs *engine)
>   	 * kill the banned context, we fallback to doing a local reset
>   	 * instead.
>   	 */
> -	if (IS_ACTIVE(CONFIG_DRM_I915_PREEMPT_TIMEOUT) &&
> -	    !intel_engine_pulse(engine))
> -		return true;
> -
> -	/* If we are unable to send a pulse, try resetting this engine. */
> -	return __reset_engine(engine);
> +	return intel_engine_pulse(engine) == 0;

Where is the path now which actually resets the currently running 
workload (engine) of a non-persistent context? Pulse will be sent and 
then rely on hangcheck but otherwise let it run?

Regards,

Tvrtko

>   }
>   
>   static bool
> @@ -506,7 +483,7 @@ static struct intel_engine_cs *active_engine(struct intel_context *ce)
>   	return engine;
>   }
>   
> -static void kill_engines(struct i915_gem_engines *engines)
> +static void kill_engines(struct i915_gem_engines *engines, bool ban)
>   {
>   	struct i915_gem_engines_iter it;
>   	struct intel_context *ce;
> @@ -521,7 +498,7 @@ static void kill_engines(struct i915_gem_engines *engines)
>   	for_each_gem_engine(ce, engines, it) {
>   		struct intel_engine_cs *engine;
>   
> -		if (intel_context_set_banned(ce))
> +		if (ban && intel_context_set_banned(ce))
>   			continue;
>   
>   		/*
> @@ -534,7 +511,7 @@ static void kill_engines(struct i915_gem_engines *engines)
>   		engine = active_engine(ce);
>   
>   		/* First attempt to gracefully cancel the context */
> -		if (engine && !__cancel_engine(engine))
> +		if (engine && !__cancel_engine(engine) && ban)
>   			/*
>   			 * If we are unable to send a preemptive pulse to bump
>   			 * the context from the GPU, we have to resort to a full
> @@ -544,8 +521,10 @@ static void kill_engines(struct i915_gem_engines *engines)
>   	}
>   }
>   
> -static void kill_stale_engines(struct i915_gem_context *ctx)
> +static void kill_context(struct i915_gem_context *ctx)
>   {
> +	bool ban = (!i915_gem_context_is_persistent(ctx) ||
> +		    !ctx->i915->params.enable_hangcheck);
>   	struct i915_gem_engines *pos, *next;
>   
>   	spin_lock_irq(&ctx->stale.lock);
> @@ -558,7 +537,7 @@ static void kill_stale_engines(struct i915_gem_context *ctx)
>   
>   		spin_unlock_irq(&ctx->stale.lock);
>   
> -		kill_engines(pos);
> +		kill_engines(pos, ban);
>   
>   		spin_lock_irq(&ctx->stale.lock);
>   		GEM_BUG_ON(i915_sw_fence_signaled(&pos->fence));
> @@ -570,11 +549,6 @@ static void kill_stale_engines(struct i915_gem_context *ctx)
>   	spin_unlock_irq(&ctx->stale.lock);
>   }
>   
> -static void kill_context(struct i915_gem_context *ctx)
> -{
> -	kill_stale_engines(ctx);
> -}
> -
>   static void engines_idle_release(struct i915_gem_context *ctx,
>   				 struct i915_gem_engines *engines)
>   {
> @@ -609,7 +583,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
>   
>   kill:
>   	if (list_empty(&engines->link)) /* raced, already closed */
> -		kill_engines(engines);
> +		kill_engines(engines, true);
>   
>   	i915_sw_fence_commit(&engines->fence);
>   }
> @@ -667,9 +641,7 @@ static void context_close(struct i915_gem_context *ctx)
>   	 * case we opt to forcibly kill off all remaining requests on
>   	 * context close.
>   	 */
> -	if (!i915_gem_context_is_persistent(ctx) ||
> -	    !ctx->i915->params.enable_hangcheck)
> -		kill_context(ctx);
> +	kill_context(ctx);
>   
>   	i915_gem_context_put(ctx);
>   }
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 4/4] drm/i915/gem: Always test execution status on closing the context
@ 2020-09-24 14:26     ` Tvrtko Ursulin
  0 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2020-09-24 14:26 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable


On 16/09/2020 10:42, Chris Wilson wrote:
> Verify that if a context is active at the time it is closed, that it is
> either persistent and preemptible (with hangcheck running) or it shall
> be removed from execution.
> 
> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> Testcase: igt/gem_ctx_persistence/heartbeat-close
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v5.7+
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_context.c | 48 +++++----------------
>   1 file changed, 10 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index a548626fa8bc..4fd38101bb56 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -390,24 +390,6 @@ __context_engines_static(const struct i915_gem_context *ctx)
>   	return rcu_dereference_protected(ctx->engines, true);
>   }
>   
> -static bool __reset_engine(struct intel_engine_cs *engine)
> -{
> -	struct intel_gt *gt = engine->gt;
> -	bool success = false;
> -
> -	if (!intel_has_reset_engine(gt))
> -		return false;
> -
> -	if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
> -			      &gt->reset.flags)) {
> -		success = intel_engine_reset(engine, NULL) == 0;
> -		clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
> -				      &gt->reset.flags);
> -	}
> -
> -	return success;
> -}
> -
>   static void __reset_context(struct i915_gem_context *ctx,
>   			    struct intel_engine_cs *engine)
>   {
> @@ -431,12 +413,7 @@ static bool __cancel_engine(struct intel_engine_cs *engine)
>   	 * kill the banned context, we fallback to doing a local reset
>   	 * instead.
>   	 */
> -	if (IS_ACTIVE(CONFIG_DRM_I915_PREEMPT_TIMEOUT) &&
> -	    !intel_engine_pulse(engine))
> -		return true;
> -
> -	/* If we are unable to send a pulse, try resetting this engine. */
> -	return __reset_engine(engine);
> +	return intel_engine_pulse(engine) == 0;

Where is the path now which actually resets the currently running 
workload (engine) of a non-persistent context? Pulse will be sent and 
then rely on hangcheck but otherwise let it run?

Regards,

Tvrtko

>   }
>   
>   static bool
> @@ -506,7 +483,7 @@ static struct intel_engine_cs *active_engine(struct intel_context *ce)
>   	return engine;
>   }
>   
> -static void kill_engines(struct i915_gem_engines *engines)
> +static void kill_engines(struct i915_gem_engines *engines, bool ban)
>   {
>   	struct i915_gem_engines_iter it;
>   	struct intel_context *ce;
> @@ -521,7 +498,7 @@ static void kill_engines(struct i915_gem_engines *engines)
>   	for_each_gem_engine(ce, engines, it) {
>   		struct intel_engine_cs *engine;
>   
> -		if (intel_context_set_banned(ce))
> +		if (ban && intel_context_set_banned(ce))
>   			continue;
>   
>   		/*
> @@ -534,7 +511,7 @@ static void kill_engines(struct i915_gem_engines *engines)
>   		engine = active_engine(ce);
>   
>   		/* First attempt to gracefully cancel the context */
> -		if (engine && !__cancel_engine(engine))
> +		if (engine && !__cancel_engine(engine) && ban)
>   			/*
>   			 * If we are unable to send a preemptive pulse to bump
>   			 * the context from the GPU, we have to resort to a full
> @@ -544,8 +521,10 @@ static void kill_engines(struct i915_gem_engines *engines)
>   	}
>   }
>   
> -static void kill_stale_engines(struct i915_gem_context *ctx)
> +static void kill_context(struct i915_gem_context *ctx)
>   {
> +	bool ban = (!i915_gem_context_is_persistent(ctx) ||
> +		    !ctx->i915->params.enable_hangcheck);
>   	struct i915_gem_engines *pos, *next;
>   
>   	spin_lock_irq(&ctx->stale.lock);
> @@ -558,7 +537,7 @@ static void kill_stale_engines(struct i915_gem_context *ctx)
>   
>   		spin_unlock_irq(&ctx->stale.lock);
>   
> -		kill_engines(pos);
> +		kill_engines(pos, ban);
>   
>   		spin_lock_irq(&ctx->stale.lock);
>   		GEM_BUG_ON(i915_sw_fence_signaled(&pos->fence));
> @@ -570,11 +549,6 @@ static void kill_stale_engines(struct i915_gem_context *ctx)
>   	spin_unlock_irq(&ctx->stale.lock);
>   }
>   
> -static void kill_context(struct i915_gem_context *ctx)
> -{
> -	kill_stale_engines(ctx);
> -}
> -
>   static void engines_idle_release(struct i915_gem_context *ctx,
>   				 struct i915_gem_engines *engines)
>   {
> @@ -609,7 +583,7 @@ static void engines_idle_release(struct i915_gem_context *ctx,
>   
>   kill:
>   	if (list_empty(&engines->link)) /* raced, already closed */
> -		kill_engines(engines);
> +		kill_engines(engines, true);
>   
>   	i915_sw_fence_commit(&engines->fence);
>   }
> @@ -667,9 +641,7 @@ static void context_close(struct i915_gem_context *ctx)
>   	 * case we opt to forcibly kill off all remaining requests on
>   	 * context close.
>   	 */
> -	if (!i915_gem_context_is_persistent(ctx) ||
> -	    !ctx->i915->params.enable_hangcheck)
> -		kill_context(ctx);
> +	kill_context(ctx);
>   
>   	i915_gem_context_put(ctx);
>   }
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
  2020-09-24 13:43     ` Tvrtko Ursulin
@ 2020-09-25 10:01       ` Chris Wilson
  -1 siblings, 0 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-25 10:01 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: stable

Quoting Tvrtko Ursulin (2020-09-24 14:43:08)
> 
> On 16/09/2020 10:42, Chris Wilson wrote:
> > Currently, we check we can send a pulse prior to disabling the
> > heartbeat to verify that we can change the heartbeat, but since we may
> > re-evaluate execution upon changing the heartbeat interval we need another
> > pulse afterwards to refresh execution.
> > 
> > Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: <stable@vger.kernel.org> # v5.7+
> > ---
> >   drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 6 ++++--
> >   1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> > index 8ffdf676c0a0..d09df370f7cd 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> > @@ -192,10 +192,12 @@ int intel_engine_set_heartbeat(struct intel_engine_cs *engine,
> >       WRITE_ONCE(engine->props.heartbeat_interval_ms, delay);
> >   
> >       if (intel_engine_pm_get_if_awake(engine)) {
> > -             if (delay)
> > +             if (delay) {
> >                       intel_engine_unpark_heartbeat(engine);
> > -             else
> > +             } else {
> >                       intel_engine_park_heartbeat(engine);
> > +                     intel_engine_pulse(engine); /* recheck execution */
> > +             }
> >               intel_engine_pm_put(engine);
> >       }
> >   
> > 
> 
> I did not immediately get this one. Do we really need two pulses or 
> maybe we could re-order the code a bit and just undo the heartbeat park 
> if pulse after parking did not work?

We use the first pulse to determine if it's legal for the parameter to
be changed (checking we support the preemptive pulse to remove
non-persistent contexts). Then the second pulse after changing the
parameter to flush the changes through.

I like checking for support before making the change, although we could
try and fixup after failure, there would still be a window where the
change would be visible to the system. We don't need to use the pulse per
se for that check, that's pure convenience as it performs the checking
already.
-Chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
@ 2020-09-25 10:01       ` Chris Wilson
  0 siblings, 0 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-25 10:01 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: stable

Quoting Tvrtko Ursulin (2020-09-24 14:43:08)
> 
> On 16/09/2020 10:42, Chris Wilson wrote:
> > Currently, we check we can send a pulse prior to disabling the
> > heartbeat to verify that we can change the heartbeat, but since we may
> > re-evaluate execution upon changing the heartbeat interval we need another
> > pulse afterwards to refresh execution.
> > 
> > Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: <stable@vger.kernel.org> # v5.7+
> > ---
> >   drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 6 ++++--
> >   1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> > index 8ffdf676c0a0..d09df370f7cd 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> > @@ -192,10 +192,12 @@ int intel_engine_set_heartbeat(struct intel_engine_cs *engine,
> >       WRITE_ONCE(engine->props.heartbeat_interval_ms, delay);
> >   
> >       if (intel_engine_pm_get_if_awake(engine)) {
> > -             if (delay)
> > +             if (delay) {
> >                       intel_engine_unpark_heartbeat(engine);
> > -             else
> > +             } else {
> >                       intel_engine_park_heartbeat(engine);
> > +                     intel_engine_pulse(engine); /* recheck execution */
> > +             }
> >               intel_engine_pm_put(engine);
> >       }
> >   
> > 
> 
> I did not immediately get this one. Do we really need two pulses or 
> maybe we could re-order the code a bit and just undo the heartbeat park 
> if pulse after parking did not work?

We use the first pulse to determine if it's legal for the parameter to
be changed (checking we support the preemptive pulse to remove
non-persistent contexts). Then the second pulse after changing the
parameter to flush the changes through.

I like checking for support before making the change, although we could
try and fixup after failure, there would still be a window where the
change would be visible to the system. We don't need to use the pulse per
se for that check, that's pure convenience as it performs the checking
already.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 4/4] drm/i915/gem: Always test execution status on closing the context
  2020-09-24 14:26     ` Tvrtko Ursulin
@ 2020-09-25 10:05       ` Chris Wilson
  -1 siblings, 0 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-25 10:05 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: stable

Quoting Tvrtko Ursulin (2020-09-24 15:26:56)
> 
> On 16/09/2020 10:42, Chris Wilson wrote:
> > Verify that if a context is active at the time it is closed, that it is
> > either persistent and preemptible (with hangcheck running) or it shall
> > be removed from execution.
> > 
> > Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> > Testcase: igt/gem_ctx_persistence/heartbeat-close
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: <stable@vger.kernel.org> # v5.7+
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_context.c | 48 +++++----------------
> >   1 file changed, 10 insertions(+), 38 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > index a548626fa8bc..4fd38101bb56 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > @@ -390,24 +390,6 @@ __context_engines_static(const struct i915_gem_context *ctx)
> >       return rcu_dereference_protected(ctx->engines, true);
> >   }
> >   
> > -static bool __reset_engine(struct intel_engine_cs *engine)
> > -{
> > -     struct intel_gt *gt = engine->gt;
> > -     bool success = false;
> > -
> > -     if (!intel_has_reset_engine(gt))
> > -             return false;
> > -
> > -     if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
> > -                           &gt->reset.flags)) {
> > -             success = intel_engine_reset(engine, NULL) == 0;
> > -             clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
> > -                                   &gt->reset.flags);
> > -     }
> > -
> > -     return success;
> > -}
> > -
> >   static void __reset_context(struct i915_gem_context *ctx,
> >                           struct intel_engine_cs *engine)
> >   {
> > @@ -431,12 +413,7 @@ static bool __cancel_engine(struct intel_engine_cs *engine)
> >        * kill the banned context, we fallback to doing a local reset
> >        * instead.
> >        */
> > -     if (IS_ACTIVE(CONFIG_DRM_I915_PREEMPT_TIMEOUT) &&
> > -         !intel_engine_pulse(engine))
> > -             return true;
> > -
> > -     /* If we are unable to send a pulse, try resetting this engine. */
> > -     return __reset_engine(engine);
> > +     return intel_engine_pulse(engine) == 0;
> 
> Where is the path now which actually resets the currently running 
> workload (engine) of a non-persistent context? Pulse will be sent and 
> then rely on hangcheck but otherwise let it run?

If the pulse fails, we just call intel_handle_error() on the engine. On
looking at this code again, I could not justify the open-coding of the
engine reset fragment of the general error handler, especially as we end
up taking that route anyway for device resets. (Unlike inside the
tasklet, there's no atomicity concerns on this engine-reset path.)
-Chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 4/4] drm/i915/gem: Always test execution status on closing the context
@ 2020-09-25 10:05       ` Chris Wilson
  0 siblings, 0 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-25 10:05 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: stable

Quoting Tvrtko Ursulin (2020-09-24 15:26:56)
> 
> On 16/09/2020 10:42, Chris Wilson wrote:
> > Verify that if a context is active at the time it is closed, that it is
> > either persistent and preemptible (with hangcheck running) or it shall
> > be removed from execution.
> > 
> > Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> > Testcase: igt/gem_ctx_persistence/heartbeat-close
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: <stable@vger.kernel.org> # v5.7+
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_context.c | 48 +++++----------------
> >   1 file changed, 10 insertions(+), 38 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > index a548626fa8bc..4fd38101bb56 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > @@ -390,24 +390,6 @@ __context_engines_static(const struct i915_gem_context *ctx)
> >       return rcu_dereference_protected(ctx->engines, true);
> >   }
> >   
> > -static bool __reset_engine(struct intel_engine_cs *engine)
> > -{
> > -     struct intel_gt *gt = engine->gt;
> > -     bool success = false;
> > -
> > -     if (!intel_has_reset_engine(gt))
> > -             return false;
> > -
> > -     if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
> > -                           &gt->reset.flags)) {
> > -             success = intel_engine_reset(engine, NULL) == 0;
> > -             clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
> > -                                   &gt->reset.flags);
> > -     }
> > -
> > -     return success;
> > -}
> > -
> >   static void __reset_context(struct i915_gem_context *ctx,
> >                           struct intel_engine_cs *engine)
> >   {
> > @@ -431,12 +413,7 @@ static bool __cancel_engine(struct intel_engine_cs *engine)
> >        * kill the banned context, we fallback to doing a local reset
> >        * instead.
> >        */
> > -     if (IS_ACTIVE(CONFIG_DRM_I915_PREEMPT_TIMEOUT) &&
> > -         !intel_engine_pulse(engine))
> > -             return true;
> > -
> > -     /* If we are unable to send a pulse, try resetting this engine. */
> > -     return __reset_engine(engine);
> > +     return intel_engine_pulse(engine) == 0;
> 
> Where is the path now which actually resets the currently running 
> workload (engine) of a non-persistent context? Pulse will be sent and 
> then rely on hangcheck but otherwise let it run?

If the pulse fails, we just call intel_handle_error() on the engine. On
looking at this code again, I could not justify the open-coding of the
engine reset fragment of the general error handler, especially as we end
up taking that route anyway for device resets. (Unlike inside the
tasklet, there's no atomicity concerns on this engine-reset path.)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] drm/i915: Cancel outstanding work after disabling heartbeats on an engine
  2020-09-16  9:42   ` [Intel-gfx] " Chris Wilson
@ 2020-09-25 11:04     ` Joonas Lahtinen
  -1 siblings, 0 replies; 30+ messages in thread
From: Joonas Lahtinen @ 2020-09-25 11:04 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Chris Wilson, stable

Quoting Chris Wilson (2020-09-16 12:42:17)
> We only allow persistent requests to remain on the GPU past the closure
> of their containing context (and process) so long as they are continuously
> checked for hangs or allow other requests to preempt them, as we need to
> ensure forward progress of the system. If we allow persistent contexts
> to remain on the system after the the hangcheck mechanism is disabled,
> the system may grind to a halt. On disabling the mechanism, we sent a
> pulse along the engine to remove all executing contexts from the engine
> which would check for hung contexts -- but we did not prevent those
> contexts from being resubmitted if they survived the final hangcheck.
> 
> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> Testcase: igt/gem_ctx_persistence/heartbeat-stop
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v5.7+

Definitely makes sense to ensure.

Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Regards, Joonas

> ---
>  drivers/gpu/drm/i915/gt/intel_engine.h | 9 +++++++++
>  drivers/gpu/drm/i915/i915_request.c    | 5 +++++
>  2 files changed, 14 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 08e2c000dcc3..7c3a1012e702 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -337,4 +337,13 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
>         return intel_engine_has_preemption(engine);
>  }
>  
> +static inline bool
> +intel_engine_has_heartbeat(const struct intel_engine_cs *engine)
> +{
> +       if (!IS_ACTIVE(CONFIG_DRM_I915_HEARTBEAT_INTERVAL))
> +               return false;
> +
> +       return READ_ONCE(engine->props.heartbeat_interval_ms);
> +}
> +
>  #endif /* _INTEL_RINGBUFFER_H_ */
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 436ce368ddaa..0e813819b041 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -542,8 +542,13 @@ bool __i915_request_submit(struct i915_request *request)
>         if (i915_request_completed(request))
>                 goto xfer;
>  
> +       if (unlikely(intel_context_is_closed(request->context) &&
> +                    !intel_engine_has_heartbeat(engine)))
> +               intel_context_set_banned(request->context);
> +
>         if (unlikely(intel_context_is_banned(request->context)))
>                 i915_request_set_error_once(request, -EIO);
> +
>         if (unlikely(fatal_error(request->fence.error)))
>                 __i915_request_skip(request);
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 2/4] drm/i915: Cancel outstanding work after disabling heartbeats on an engine
@ 2020-09-25 11:04     ` Joonas Lahtinen
  0 siblings, 0 replies; 30+ messages in thread
From: Joonas Lahtinen @ 2020-09-25 11:04 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable, Chris Wilson

Quoting Chris Wilson (2020-09-16 12:42:17)
> We only allow persistent requests to remain on the GPU past the closure
> of their containing context (and process) so long as they are continuously
> checked for hangs or allow other requests to preempt them, as we need to
> ensure forward progress of the system. If we allow persistent contexts
> to remain on the system after the the hangcheck mechanism is disabled,
> the system may grind to a halt. On disabling the mechanism, we sent a
> pulse along the engine to remove all executing contexts from the engine
> which would check for hung contexts -- but we did not prevent those
> contexts from being resubmitted if they survived the final hangcheck.
> 
> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> Testcase: igt/gem_ctx_persistence/heartbeat-stop
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v5.7+

Definitely makes sense to ensure.

Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Regards, Joonas

> ---
>  drivers/gpu/drm/i915/gt/intel_engine.h | 9 +++++++++
>  drivers/gpu/drm/i915/i915_request.c    | 5 +++++
>  2 files changed, 14 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 08e2c000dcc3..7c3a1012e702 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -337,4 +337,13 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
>         return intel_engine_has_preemption(engine);
>  }
>  
> +static inline bool
> +intel_engine_has_heartbeat(const struct intel_engine_cs *engine)
> +{
> +       if (!IS_ACTIVE(CONFIG_DRM_I915_HEARTBEAT_INTERVAL))
> +               return false;
> +
> +       return READ_ONCE(engine->props.heartbeat_interval_ms);
> +}
> +
>  #endif /* _INTEL_RINGBUFFER_H_ */
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 436ce368ddaa..0e813819b041 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -542,8 +542,13 @@ bool __i915_request_submit(struct i915_request *request)
>         if (i915_request_completed(request))
>                 goto xfer;
>  
> +       if (unlikely(intel_context_is_closed(request->context) &&
> +                    !intel_engine_has_heartbeat(engine)))
> +               intel_context_set_banned(request->context);
> +
>         if (unlikely(intel_context_is_banned(request->context)))
>                 i915_request_set_error_once(request, -EIO);
> +
>         if (unlikely(fatal_error(request->fence.error)))
>                 __i915_request_skip(request);
>  
> -- 
> 2.20.1
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
  2020-09-25 10:01       ` Chris Wilson
@ 2020-09-25 13:19         ` Tvrtko Ursulin
  -1 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2020-09-25 13:19 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable


On 25/09/2020 11:01, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-09-24 14:43:08)
>>
>> On 16/09/2020 10:42, Chris Wilson wrote:
>>> Currently, we check we can send a pulse prior to disabling the
>>> heartbeat to verify that we can change the heartbeat, but since we may
>>> re-evaluate execution upon changing the heartbeat interval we need another
>>> pulse afterwards to refresh execution.
>>>
>>> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: <stable@vger.kernel.org> # v5.7+
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 6 ++++--
>>>    1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>> index 8ffdf676c0a0..d09df370f7cd 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>> @@ -192,10 +192,12 @@ int intel_engine_set_heartbeat(struct intel_engine_cs *engine,
>>>        WRITE_ONCE(engine->props.heartbeat_interval_ms, delay);
>>>    
>>>        if (intel_engine_pm_get_if_awake(engine)) {
>>> -             if (delay)
>>> +             if (delay) {
>>>                        intel_engine_unpark_heartbeat(engine);
>>> -             else
>>> +             } else {
>>>                        intel_engine_park_heartbeat(engine);
>>> +                     intel_engine_pulse(engine); /* recheck execution */
>>> +             }
>>>                intel_engine_pm_put(engine);
>>>        }
>>>    
>>>
>>
>> I did not immediately get this one. Do we really need two pulses or
>> maybe we could re-order the code a bit and just undo the heartbeat park
>> if pulse after parking did not work?
> 
> We use the first pulse to determine if it's legal for the parameter to
> be changed (checking we support the preemptive pulse to remove
> non-persistent contexts). Then the second pulse after changing the
> parameter to flush the changes through.
> 
> I like checking for support before making the change, although we could
> try and fixup after failure, there would still be a window where the
> change would be visible to the system. We don't need to use the pulse per
> se for that check, that's pure convenience as it performs the checking
> already.

Hm second pulse also has a problem that sneaky user can nerf it with a 
precisely timed SIGINT on itself. It's a bit ridiculous isn't it? :)

Have engine preemption check open coded first and uninterruptible 
flavour of pulse sending? It's also not good since we do want it to be 
interruptible.. Unwind the change and report error back to write(2) if 
intel_engine_pulse failed for any reason?

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
@ 2020-09-25 13:19         ` Tvrtko Ursulin
  0 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2020-09-25 13:19 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable


On 25/09/2020 11:01, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-09-24 14:43:08)
>>
>> On 16/09/2020 10:42, Chris Wilson wrote:
>>> Currently, we check we can send a pulse prior to disabling the
>>> heartbeat to verify that we can change the heartbeat, but since we may
>>> re-evaluate execution upon changing the heartbeat interval we need another
>>> pulse afterwards to refresh execution.
>>>
>>> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: <stable@vger.kernel.org> # v5.7+
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 6 ++++--
>>>    1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>> index 8ffdf676c0a0..d09df370f7cd 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>>> @@ -192,10 +192,12 @@ int intel_engine_set_heartbeat(struct intel_engine_cs *engine,
>>>        WRITE_ONCE(engine->props.heartbeat_interval_ms, delay);
>>>    
>>>        if (intel_engine_pm_get_if_awake(engine)) {
>>> -             if (delay)
>>> +             if (delay) {
>>>                        intel_engine_unpark_heartbeat(engine);
>>> -             else
>>> +             } else {
>>>                        intel_engine_park_heartbeat(engine);
>>> +                     intel_engine_pulse(engine); /* recheck execution */
>>> +             }
>>>                intel_engine_pm_put(engine);
>>>        }
>>>    
>>>
>>
>> I did not immediately get this one. Do we really need two pulses or
>> maybe we could re-order the code a bit and just undo the heartbeat park
>> if pulse after parking did not work?
> 
> We use the first pulse to determine if it's legal for the parameter to
> be changed (checking we support the preemptive pulse to remove
> non-persistent contexts). Then the second pulse after changing the
> parameter to flush the changes through.
> 
> I like checking for support before making the change, although we could
> try and fixup after failure, there would still be a window where the
> change would be visible to the system. We don't need to use the pulse per
> se for that check, that's pure convenience as it performs the checking
> already.

Hm second pulse also has a problem that sneaky user can nerf it with a 
precisely timed SIGINT on itself. It's a bit ridiculous isn't it? :)

Have engine preemption check open coded first and uninterruptible 
flavour of pulse sending? It's also not good since we do want it to be 
interruptible.. Unwind the change and report error back to write(2) if 
intel_engine_pulse failed for any reason?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 4/4] drm/i915/gem: Always test execution status on closing the context
  2020-09-25 10:05       ` Chris Wilson
@ 2020-09-25 13:23         ` Tvrtko Ursulin
  -1 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2020-09-25 13:23 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable


On 25/09/2020 11:05, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-09-24 15:26:56)
>>
>> On 16/09/2020 10:42, Chris Wilson wrote:
>>> Verify that if a context is active at the time it is closed, that it is
>>> either persistent and preemptible (with hangcheck running) or it shall
>>> be removed from execution.
>>>
>>> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
>>> Testcase: igt/gem_ctx_persistence/heartbeat-close
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: <stable@vger.kernel.org> # v5.7+
>>> ---
>>>    drivers/gpu/drm/i915/gem/i915_gem_context.c | 48 +++++----------------
>>>    1 file changed, 10 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> index a548626fa8bc..4fd38101bb56 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> @@ -390,24 +390,6 @@ __context_engines_static(const struct i915_gem_context *ctx)
>>>        return rcu_dereference_protected(ctx->engines, true);
>>>    }
>>>    
>>> -static bool __reset_engine(struct intel_engine_cs *engine)
>>> -{
>>> -     struct intel_gt *gt = engine->gt;
>>> -     bool success = false;
>>> -
>>> -     if (!intel_has_reset_engine(gt))
>>> -             return false;
>>> -
>>> -     if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
>>> -                           &gt->reset.flags)) {
>>> -             success = intel_engine_reset(engine, NULL) == 0;
>>> -             clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
>>> -                                   &gt->reset.flags);
>>> -     }
>>> -
>>> -     return success;
>>> -}
>>> -
>>>    static void __reset_context(struct i915_gem_context *ctx,
>>>                            struct intel_engine_cs *engine)
>>>    {
>>> @@ -431,12 +413,7 @@ static bool __cancel_engine(struct intel_engine_cs *engine)
>>>         * kill the banned context, we fallback to doing a local reset
>>>         * instead.
>>>         */
>>> -     if (IS_ACTIVE(CONFIG_DRM_I915_PREEMPT_TIMEOUT) &&
>>> -         !intel_engine_pulse(engine))
>>> -             return true;
>>> -
>>> -     /* If we are unable to send a pulse, try resetting this engine. */
>>> -     return __reset_engine(engine);
>>> +     return intel_engine_pulse(engine) == 0;
>>
>> Where is the path now which actually resets the currently running
>> workload (engine) of a non-persistent context? Pulse will be sent and
>> then rely on hangcheck but otherwise let it run?
> 
> If the pulse fails, we just call intel_handle_error() on the engine. On
> looking at this code again, I could not justify the open-coding of the
> engine reset fragment of the general error handler, especially as we end
> up taking that route anyway for device resets. (Unlike inside the
> tasklet, there's no atomicity concerns on this engine-reset path.)

I think yesterday I got lost in boolean logic and flows here. Today it 
looks fine. Bool ban will be true and code will indeed enter the 
__kill_context path.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 4/4] drm/i915/gem: Always test execution status on closing the context
@ 2020-09-25 13:23         ` Tvrtko Ursulin
  0 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2020-09-25 13:23 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: stable


On 25/09/2020 11:05, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-09-24 15:26:56)
>>
>> On 16/09/2020 10:42, Chris Wilson wrote:
>>> Verify that if a context is active at the time it is closed, that it is
>>> either persistent and preemptible (with hangcheck running) or it shall
>>> be removed from execution.
>>>
>>> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
>>> Testcase: igt/gem_ctx_persistence/heartbeat-close
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: <stable@vger.kernel.org> # v5.7+
>>> ---
>>>    drivers/gpu/drm/i915/gem/i915_gem_context.c | 48 +++++----------------
>>>    1 file changed, 10 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> index a548626fa8bc..4fd38101bb56 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> @@ -390,24 +390,6 @@ __context_engines_static(const struct i915_gem_context *ctx)
>>>        return rcu_dereference_protected(ctx->engines, true);
>>>    }
>>>    
>>> -static bool __reset_engine(struct intel_engine_cs *engine)
>>> -{
>>> -     struct intel_gt *gt = engine->gt;
>>> -     bool success = false;
>>> -
>>> -     if (!intel_has_reset_engine(gt))
>>> -             return false;
>>> -
>>> -     if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
>>> -                           &gt->reset.flags)) {
>>> -             success = intel_engine_reset(engine, NULL) == 0;
>>> -             clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
>>> -                                   &gt->reset.flags);
>>> -     }
>>> -
>>> -     return success;
>>> -}
>>> -
>>>    static void __reset_context(struct i915_gem_context *ctx,
>>>                            struct intel_engine_cs *engine)
>>>    {
>>> @@ -431,12 +413,7 @@ static bool __cancel_engine(struct intel_engine_cs *engine)
>>>         * kill the banned context, we fallback to doing a local reset
>>>         * instead.
>>>         */
>>> -     if (IS_ACTIVE(CONFIG_DRM_I915_PREEMPT_TIMEOUT) &&
>>> -         !intel_engine_pulse(engine))
>>> -             return true;
>>> -
>>> -     /* If we are unable to send a pulse, try resetting this engine. */
>>> -     return __reset_engine(engine);
>>> +     return intel_engine_pulse(engine) == 0;
>>
>> Where is the path now which actually resets the currently running
>> workload (engine) of a non-persistent context? Pulse will be sent and
>> then rely on hangcheck but otherwise let it run?
> 
> If the pulse fails, we just call intel_handle_error() on the engine. On
> looking at this code again, I could not justify the open-coding of the
> engine reset fragment of the general error handler, especially as we end
> up taking that route anyway for device resets. (Unlike inside the
> tasklet, there's no atomicity concerns on this engine-reset path.)

I think yesterday I got lost in boolean logic and flows here. Today it 
looks fine. Bool ban will be true and code will indeed enter the 
__kill_context path.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko



_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
  2020-09-25 13:19         ` Tvrtko Ursulin
@ 2020-09-25 14:13           ` Chris Wilson
  -1 siblings, 0 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-25 14:13 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: stable

Quoting Tvrtko Ursulin (2020-09-25 14:19:22)
> 
> On 25/09/2020 11:01, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-09-24 14:43:08)
> >>
> >> On 16/09/2020 10:42, Chris Wilson wrote:
> >>> Currently, we check we can send a pulse prior to disabling the
> >>> heartbeat to verify that we can change the heartbeat, but since we may
> >>> re-evaluate execution upon changing the heartbeat interval we need another
> >>> pulse afterwards to refresh execution.
> >>>
> >>> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>> Cc: <stable@vger.kernel.org> # v5.7+
> >>> ---
> >>>    drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 6 ++++--
> >>>    1 file changed, 4 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> >>> index 8ffdf676c0a0..d09df370f7cd 100644
> >>> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> >>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> >>> @@ -192,10 +192,12 @@ int intel_engine_set_heartbeat(struct intel_engine_cs *engine,
> >>>        WRITE_ONCE(engine->props.heartbeat_interval_ms, delay);
> >>>    
> >>>        if (intel_engine_pm_get_if_awake(engine)) {
> >>> -             if (delay)
> >>> +             if (delay) {
> >>>                        intel_engine_unpark_heartbeat(engine);
> >>> -             else
> >>> +             } else {
> >>>                        intel_engine_park_heartbeat(engine);
> >>> +                     intel_engine_pulse(engine); /* recheck execution */
> >>> +             }
> >>>                intel_engine_pm_put(engine);
> >>>        }
> >>>    
> >>>
> >>
> >> I did not immediately get this one. Do we really need two pulses or
> >> maybe we could re-order the code a bit and just undo the heartbeat park
> >> if pulse after parking did not work?
> > 
> > We use the first pulse to determine if it's legal for the parameter to
> > be changed (checking we support the preemptive pulse to remove
> > non-persistent contexts). Then the second pulse after changing the
> > parameter to flush the changes through.
> > 
> > I like checking for support before making the change, although we could
> > try and fixup after failure, there would still be a window where the
> > change would be visible to the system. We don't need to use the pulse per
> > se for that check, that's pure convenience as it performs the checking
> > already.
> 
> Hm second pulse also has a problem that sneaky user can nerf it with a 
> precisely timed SIGINT on itself. It's a bit ridiculous isn't it? :)

Sufficient pause for concern. In both this and the kill_context path, we
are using the pulse to evict hostile contexts, that could be reasonable
grounds to have the pulse not be interrupted.
 
> Have engine preemption check open coded first and uninterruptible 
> flavour of pulse sending? It's also not good since we do want it to be 
> interruptible..

Right, allowing it to be interruptible does ensure we have an escape
plan. That escape plan is often invaluable. :|

> Unwind the change and report error back to write(2) if 
> intel_engine_pulse failed for any reason?

The concern then is how atomic do we want the change to appear to be.
Would it be sufficient to only rollback after error if the value still
matches (multiple users), cmpxchg for paranoia. Then anything that
observes the intermediate value will just have to obey the new rules
temporarily, as they would have to in future if the change succeeded.

	new = strtoull_from_user();
	/* validate new */

	old = xchg(&heartbeat, new);
	/* apply changes */
	if (err)
		cmpxchg(&heartbeat, new, old);

If heartbeat is changed in the interval, the pulse either uses the new
or the new-new value, fail/pass regardless. That could mean we report an
error for another heartbeat value from a second user. That's arguing for
a mutex guard for the sysfs writes.
-Chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat
@ 2020-09-25 14:13           ` Chris Wilson
  0 siblings, 0 replies; 30+ messages in thread
From: Chris Wilson @ 2020-09-25 14:13 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: stable

Quoting Tvrtko Ursulin (2020-09-25 14:19:22)
> 
> On 25/09/2020 11:01, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-09-24 14:43:08)
> >>
> >> On 16/09/2020 10:42, Chris Wilson wrote:
> >>> Currently, we check we can send a pulse prior to disabling the
> >>> heartbeat to verify that we can change the heartbeat, but since we may
> >>> re-evaluate execution upon changing the heartbeat interval we need another
> >>> pulse afterwards to refresh execution.
> >>>
> >>> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>> Cc: <stable@vger.kernel.org> # v5.7+
> >>> ---
> >>>    drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 6 ++++--
> >>>    1 file changed, 4 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> >>> index 8ffdf676c0a0..d09df370f7cd 100644
> >>> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> >>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> >>> @@ -192,10 +192,12 @@ int intel_engine_set_heartbeat(struct intel_engine_cs *engine,
> >>>        WRITE_ONCE(engine->props.heartbeat_interval_ms, delay);
> >>>    
> >>>        if (intel_engine_pm_get_if_awake(engine)) {
> >>> -             if (delay)
> >>> +             if (delay) {
> >>>                        intel_engine_unpark_heartbeat(engine);
> >>> -             else
> >>> +             } else {
> >>>                        intel_engine_park_heartbeat(engine);
> >>> +                     intel_engine_pulse(engine); /* recheck execution */
> >>> +             }
> >>>                intel_engine_pm_put(engine);
> >>>        }
> >>>    
> >>>
> >>
> >> I did not immediately get this one. Do we really need two pulses or
> >> maybe we could re-order the code a bit and just undo the heartbeat park
> >> if pulse after parking did not work?
> > 
> > We use the first pulse to determine if it's legal for the parameter to
> > be changed (checking we support the preemptive pulse to remove
> > non-persistent contexts). Then the second pulse after changing the
> > parameter to flush the changes through.
> > 
> > I like checking for support before making the change, although we could
> > try and fixup after failure, there would still be a window where the
> > change would be visible to the system. We don't need to use the pulse per
> > se for that check, that's pure convenience as it performs the checking
> > already.
> 
> Hm second pulse also has a problem that sneaky user can nerf it with a 
> precisely timed SIGINT on itself. It's a bit ridiculous isn't it? :)

Sufficient pause for concern. In both this and the kill_context path, we
are using the pulse to evict hostile contexts, that could be reasonable
grounds to have the pulse not be interrupted.
 
> Have engine preemption check open coded first and uninterruptible 
> flavour of pulse sending? It's also not good since we do want it to be 
> interruptible..

Right, allowing it to be interruptible does ensure we have an escape
plan. That escape plan is often invaluable. :|

> Unwind the change and report error back to write(2) if 
> intel_engine_pulse failed for any reason?

The concern then is how atomic do we want the change to appear to be.
Would it be sufficient to only rollback after error if the value still
matches (multiple users), cmpxchg for paranoia. Then anything that
observes the intermediate value will just have to obey the new rules
temporarily, as they would have to in future if the change succeeded.

	new = strtoull_from_user();
	/* validate new */

	old = xchg(&heartbeat, new);
	/* apply changes */
	if (err)
		cmpxchg(&heartbeat, new, old);

If heartbeat is changed in the interval, the pulse either uses the new
or the new-new value, fail/pass regardless. That could mean we report an
error for another heartbeat value from a second user. That's arguing for
a mutex guard for the sysfs writes.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-09-25 17:08 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-16  9:42 [Intel-gfx] [PATCH 1/4] drm/i915/gem: Hold request reference for canceling an active context Chris Wilson
2020-09-16  9:42 ` [PATCH 2/4] drm/i915: Cancel outstanding work after disabling heartbeats on an engine Chris Wilson
2020-09-16  9:42   ` [Intel-gfx] " Chris Wilson
2020-09-24 13:35   ` Tvrtko Ursulin
2020-09-24 13:35     ` Tvrtko Ursulin
2020-09-25 11:04   ` Joonas Lahtinen
2020-09-25 11:04     ` [Intel-gfx] " Joonas Lahtinen
2020-09-16  9:42 ` [PATCH 3/4] drm/i915/gt: Always send a pulse down the engine after disabling heartbeat Chris Wilson
2020-09-16  9:42   ` [Intel-gfx] " Chris Wilson
2020-09-24 13:43   ` Tvrtko Ursulin
2020-09-24 13:43     ` Tvrtko Ursulin
2020-09-25 10:01     ` Chris Wilson
2020-09-25 10:01       ` Chris Wilson
2020-09-25 13:19       ` Tvrtko Ursulin
2020-09-25 13:19         ` Tvrtko Ursulin
2020-09-25 14:13         ` Chris Wilson
2020-09-25 14:13           ` Chris Wilson
2020-09-16  9:42 ` [PATCH 4/4] drm/i915/gem: Always test execution status on closing the context Chris Wilson
2020-09-16  9:42   ` [Intel-gfx] " Chris Wilson
2020-09-24 14:26   ` Tvrtko Ursulin
2020-09-24 14:26     ` Tvrtko Ursulin
2020-09-25 10:05     ` Chris Wilson
2020-09-25 10:05       ` Chris Wilson
2020-09-25 13:23       ` Tvrtko Ursulin
2020-09-25 13:23         ` Tvrtko Ursulin
2020-09-16 13:03 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/4] drm/i915/gem: Hold request reference for canceling an active context Patchwork
2020-09-16 13:04 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2020-09-16 13:28 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2020-09-16 17:19 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2020-09-24 13:34 ` [Intel-gfx] [PATCH 1/4] " Tvrtko Ursulin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.