All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/i915/selftests: Make unbannable contexts for reset handling
@ 2019-02-18 14:50 Chris Wilson
  2019-02-18 14:50 ` [PATCH 2/2] drm/i915: Use time based guilty context banning Chris Wilson
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Chris Wilson @ 2019-02-18 14:50 UTC (permalink / raw)
  To: intel-gfx

igt_ctx_sseu was caught using bannable contexts and in the course of
resetting rapidly to run its test was banned. Don't let ourselves ban
the test!

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/selftests/i915_gem_context.c  | 1 +
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c   | 2 ++
 drivers/gpu/drm/i915/selftests/intel_workarounds.c | 2 ++
 3 files changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/selftests/i915_gem_context.c
index 7eb58a9d1319..b7b97c57ad05 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_context.c
@@ -967,6 +967,7 @@ __igt_ctx_sseu(struct drm_i915_private *i915,
 		ret = PTR_ERR(ctx);
 		goto out_unlock;
 	}
+	i915_gem_context_clear_bannable(ctx); /* to reset and beyond! */
 
 	obj = i915_gem_object_create_internal(i915, PAGE_SIZE);
 	if (IS_ERR(obj)) {
diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index 74e743b101d9..c32bc31192ae 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -56,6 +56,8 @@ static int hang_init(struct hang *h, struct drm_i915_private *i915)
 	if (IS_ERR(h->ctx))
 		return PTR_ERR(h->ctx);
 
+	GEM_BUG_ON(i915_gem_context_is_bannable(h->ctx));
+
 	h->hws = i915_gem_object_create_internal(i915, PAGE_SIZE);
 	if (IS_ERR(h->hws)) {
 		err = PTR_ERR(h->hws);
diff --git a/drivers/gpu/drm/i915/selftests/intel_workarounds.c b/drivers/gpu/drm/i915/selftests/intel_workarounds.c
index d6bb2005024d..fb479a2c04fb 100644
--- a/drivers/gpu/drm/i915/selftests/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/selftests/intel_workarounds.c
@@ -236,6 +236,8 @@ switch_to_scratch_context(struct intel_engine_cs *engine,
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
 
+	GEM_BUG_ON(i915_gem_context_is_bannable(ctx));
+
 	rq = ERR_PTR(-ENODEV);
 	with_intel_runtime_pm(engine->i915, wakeref)
 		rq = igt_spinner_create_request(spin, ctx, engine, MI_NOOP);
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] drm/i915: Use time based guilty context banning
  2019-02-18 14:50 [PATCH 1/2] drm/i915/selftests: Make unbannable contexts for reset handling Chris Wilson
@ 2019-02-18 14:50 ` Chris Wilson
  2019-02-19 11:22   ` Mika Kuoppala
  2019-02-18 15:18 ` [PATCH 1/2] drm/i915/selftests: Make unbannable contexts for reset handling Mika Kuoppala
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Chris Wilson @ 2019-02-18 14:50 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

Currently, we accumulate each time a context hangs the GPU, offset
against the number of requests it submits, and if that score exceeds a
certain threshold, we ban that context from submitting any more requests
(cancelling any work in flight). In contrast, we use a simple timer on
the file, that if we see more than a 9 hangs faster than 60s apart in
total across all of its contexts, we will ban the client from creating
any more contexts. This leads to a confusing situation where the file
may be banned before the context, so lets use a simple timer scheme for
each.

If the context submits 3 hanging requests within a 120s period, declare
it forbidden to ever send more requests.

This has the advantage of not being easy to repair by simply sending
empty requests, but has the disadvantage that if the context is idle
then it is forgiven. However, if the context is idle, it is not
disrupting the system, but a hog can evade the request counting and
cause much more severe disruption to the system.

Updating ban_score from request retirement is dubious as the retirement
is purposely not in sync with request submission (i.e. we try and batch
retirement to reduce overhead and avoid latency on submission), which
leads to surprising situations where we can forgive a hang immediately
due to a backlog of requests from before the hang being retired
afterwards.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c |  4 ++++
 drivers/gpu/drm/i915/i915_gem_context.h |  9 +++----
 drivers/gpu/drm/i915/i915_gpu_error.c   | 31 +++++++------------------
 drivers/gpu/drm/i915/i915_gpu_error.h   |  3 ---
 drivers/gpu/drm/i915/i915_request.c     |  2 --
 drivers/gpu/drm/i915/i915_reset.c       | 27 ++++++++++++---------
 6 files changed, 33 insertions(+), 43 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index da21c843fed8..7541c6f961b3 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -355,6 +355,7 @@ __create_hw_context(struct drm_i915_private *dev_priv,
 	struct i915_gem_context *ctx;
 	unsigned int n;
 	int ret;
+	int i;
 
 	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
 	if (ctx == NULL)
@@ -407,6 +408,9 @@ __create_hw_context(struct drm_i915_private *dev_priv,
 	ctx->desc_template =
 		default_desc_template(dev_priv, dev_priv->mm.aliasing_ppgtt);
 
+	for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp); i++)
+		ctx->hang_timestamp[i] = jiffies - CONTEXT_FAST_HANG_JIFFIES;
+
 	return ctx;
 
 err_pid:
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 071108d34ae0..dc6c58f38cfa 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -209,10 +209,11 @@ struct i915_gem_context {
 	 */
 	atomic_t active_count;
 
-#define CONTEXT_SCORE_GUILTY		10
-#define CONTEXT_SCORE_BAN_THRESHOLD	40
-	/** ban_score: Accumulated score of all hangs caused by this context. */
-	atomic_t ban_score;
+	/**
+	 * @hang_timestamp: The last time(s) this context caused a GPU hang
+	 */
+	unsigned long hang_timestamp[2];
+#define CONTEXT_FAST_HANG_JIFFIES (120 * HZ) /* 3 hangs within 120s? Banned! */
 
 	/** remap_slice: Bitmask of cache lines that need remapping */
 	u8 remap_slice;
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 9a65341fec09..3f6eddb6f6de 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -434,11 +434,6 @@ static void error_print_instdone(struct drm_i915_error_state_buf *m,
 			   ee->instdone.row[slice][subslice]);
 }
 
-static const char *bannable(const struct drm_i915_error_context *ctx)
-{
-	return ctx->bannable ? "" : " (unbannable)";
-}
-
 static void error_print_request(struct drm_i915_error_state_buf *m,
 				const char *prefix,
 				const struct drm_i915_error_request *erq,
@@ -447,9 +442,8 @@ static void error_print_request(struct drm_i915_error_state_buf *m,
 	if (!erq->seqno)
 		return;
 
-	err_printf(m, "%s pid %d, ban score %d, seqno %8x:%08x%s%s, prio %d, emitted %dms, start %08x, head %08x, tail %08x\n",
-		   prefix, erq->pid, erq->ban_score,
-		   erq->context, erq->seqno,
+	err_printf(m, "%s pid %d, seqno %8x:%08x%s%s, prio %d, emitted %dms, start %08x, head %08x, tail %08x\n",
+		   prefix, erq->pid, erq->context, erq->seqno,
 		   test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
 			    &erq->flags) ? "!" : "",
 		   test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
@@ -463,10 +457,9 @@ static void error_print_context(struct drm_i915_error_state_buf *m,
 				const char *header,
 				const struct drm_i915_error_context *ctx)
 {
-	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, ban score %d%s guilty %d active %d\n",
+	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, guilty %d active %d\n",
 		   header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id,
-		   ctx->sched_attr.priority, ctx->ban_score, bannable(ctx),
-		   ctx->guilty, ctx->active);
+		   ctx->sched_attr.priority, ctx->guilty, ctx->active);
 }
 
 static void error_print_engine(struct drm_i915_error_state_buf *m,
@@ -688,12 +681,10 @@ static void __err_print_to_sgl(struct drm_i915_error_state_buf *m,
 		if (!error->engine[i].context.pid)
 			continue;
 
-		err_printf(m, "Active process (on ring %s): %s [%d], score %d%s\n",
+		err_printf(m, "Active process (on ring %s): %s [%d]\n",
 			   engine_name(m->i915, i),
 			   error->engine[i].context.comm,
-			   error->engine[i].context.pid,
-			   error->engine[i].context.ban_score,
-			   bannable(&error->engine[i].context));
+			   error->engine[i].context.pid);
 	}
 	err_printf(m, "Reset count: %u\n", error->reset_count);
 	err_printf(m, "Suspend count: %u\n", error->suspend_count);
@@ -779,13 +770,11 @@ static void __err_print_to_sgl(struct drm_i915_error_state_buf *m,
 		if (obj) {
 			err_puts(m, m->i915->engine[i]->name);
 			if (ee->context.pid)
-				err_printf(m, " (submitted by %s [%d], ctx %d [%d], score %d%s)",
+				err_printf(m, " (submitted by %s [%d], ctx %d [%d])",
 					   ee->context.comm,
 					   ee->context.pid,
 					   ee->context.handle,
-					   ee->context.hw_id,
-					   ee->context.ban_score,
-					   bannable(&ee->context));
+					   ee->context.hw_id);
 			err_printf(m, " --- gtt_offset = 0x%08x %08x\n",
 				   upper_32_bits(obj->gtt_offset),
 				   lower_32_bits(obj->gtt_offset));
@@ -1301,8 +1290,6 @@ static void record_request(struct i915_request *request,
 	erq->flags = request->fence.flags;
 	erq->context = ctx->hw_id;
 	erq->sched_attr = request->sched.attr;
-	erq->ban_score = atomic_read(&ctx->ban_score);
-	erq->seqno = request->global_seqno;
 	erq->jiffies = request->emitted_jiffies;
 	erq->start = i915_ggtt_offset(request->ring->vma);
 	erq->head = request->head;
@@ -1396,8 +1383,6 @@ static void record_context(struct drm_i915_error_context *e,
 	e->handle = ctx->user_handle;
 	e->hw_id = ctx->hw_id;
 	e->sched_attr = ctx->sched;
-	e->ban_score = atomic_read(&ctx->ban_score);
-	e->bannable = i915_gem_context_is_bannable(ctx);
 	e->guilty = atomic_read(&ctx->guilty_count);
 	e->active = atomic_read(&ctx->active_count);
 }
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index afa3adb28f02..94eaf8ab9051 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -122,10 +122,8 @@ struct i915_gpu_state {
 			pid_t pid;
 			u32 handle;
 			u32 hw_id;
-			int ban_score;
 			int active;
 			int guilty;
-			bool bannable;
 			struct i915_sched_attr sched_attr;
 		} context;
 
@@ -149,7 +147,6 @@ struct i915_gpu_state {
 			long jiffies;
 			pid_t pid;
 			u32 context;
-			int ban_score;
 			u32 seqno;
 			u32 start;
 			u32 head;
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 5ab4e1c01618..a61e3a4fc9dc 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -290,8 +290,6 @@ static void i915_request_retire(struct i915_request *request)
 
 	i915_request_remove_from_client(request);
 
-	/* Retirement decays the ban score as it is a sign of ctx progress */
-	atomic_dec_if_positive(&request->gem_context->ban_score);
 	intel_context_unpin(request->hw_context);
 
 	__retire_engine_upto(request->engine, request);
diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
index 1911e00d2581..bae88a4ea924 100644
--- a/drivers/gpu/drm/i915/i915_reset.c
+++ b/drivers/gpu/drm/i915/i915_reset.c
@@ -59,24 +59,29 @@ static void client_mark_guilty(struct drm_i915_file_private *file_priv,
 
 static bool context_mark_guilty(struct i915_gem_context *ctx)
 {
-	unsigned int score;
-	bool banned, bannable;
+	unsigned long prev_hang;
+	bool banned;
+	int i;
 
 	atomic_inc(&ctx->guilty_count);
 
-	bannable = i915_gem_context_is_bannable(ctx);
-	score = atomic_add_return(CONTEXT_SCORE_GUILTY, &ctx->ban_score);
-	banned = (!i915_gem_context_is_recoverable(ctx) ||
-		  score >= CONTEXT_SCORE_BAN_THRESHOLD);
-
 	/* Cool contexts don't accumulate client ban score */
-	if (!bannable)
+	if (!i915_gem_context_is_bannable(ctx))
 		return false;
 
+	/* Record the timestamp for the last N hangs */
+	prev_hang = ctx->hang_timestamp[0];
+	for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp) - 1; i++)
+		ctx->hang_timestamp[i] = ctx->hang_timestamp[i + 1];
+	ctx->hang_timestamp[i] = jiffies;
+
+	/* If we have hung N+1 times in rapid succession, we ban the context! */
+	banned = !i915_gem_context_is_recoverable(ctx);
+	if (time_before(jiffies, prev_hang + CONTEXT_FAST_HANG_JIFFIES))
+		banned = true;
 	if (banned) {
-		DRM_DEBUG_DRIVER("context %s: guilty %d, score %u, banned\n",
-				 ctx->name, atomic_read(&ctx->guilty_count),
-				 score);
+		DRM_DEBUG_DRIVER("context %s: guilty %d, banned\n",
+				 ctx->name, atomic_read(&ctx->guilty_count));
 		i915_gem_context_set_banned(ctx);
 	}
 
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] drm/i915/selftests: Make unbannable contexts for reset handling
  2019-02-18 14:50 [PATCH 1/2] drm/i915/selftests: Make unbannable contexts for reset handling Chris Wilson
  2019-02-18 14:50 ` [PATCH 2/2] drm/i915: Use time based guilty context banning Chris Wilson
@ 2019-02-18 15:18 ` Mika Kuoppala
  2019-02-18 15:52 ` ✓ Fi.CI.BAT: success for series starting with [1/2] " Patchwork
  2019-02-18 19:06 ` ✓ Fi.CI.IGT: " Patchwork
  3 siblings, 0 replies; 7+ messages in thread
From: Mika Kuoppala @ 2019-02-18 15:18 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> igt_ctx_sseu was caught using bannable contexts and in the course of
> resetting rapidly to run its test was banned. Don't let ourselves ban
> the test!
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

> ---
>  drivers/gpu/drm/i915/selftests/i915_gem_context.c  | 1 +
>  drivers/gpu/drm/i915/selftests/intel_hangcheck.c   | 2 ++
>  drivers/gpu/drm/i915/selftests/intel_workarounds.c | 2 ++
>  3 files changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/selftests/i915_gem_context.c
> index 7eb58a9d1319..b7b97c57ad05 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_gem_context.c
> @@ -967,6 +967,7 @@ __igt_ctx_sseu(struct drm_i915_private *i915,
>  		ret = PTR_ERR(ctx);
>  		goto out_unlock;
>  	}
> +	i915_gem_context_clear_bannable(ctx); /* to reset and beyond! */
>  
>  	obj = i915_gem_object_create_internal(i915, PAGE_SIZE);
>  	if (IS_ERR(obj)) {
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 74e743b101d9..c32bc31192ae 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -56,6 +56,8 @@ static int hang_init(struct hang *h, struct drm_i915_private *i915)
>  	if (IS_ERR(h->ctx))
>  		return PTR_ERR(h->ctx);
>  
> +	GEM_BUG_ON(i915_gem_context_is_bannable(h->ctx));
> +
>  	h->hws = i915_gem_object_create_internal(i915, PAGE_SIZE);
>  	if (IS_ERR(h->hws)) {
>  		err = PTR_ERR(h->hws);
> diff --git a/drivers/gpu/drm/i915/selftests/intel_workarounds.c b/drivers/gpu/drm/i915/selftests/intel_workarounds.c
> index d6bb2005024d..fb479a2c04fb 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_workarounds.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_workarounds.c
> @@ -236,6 +236,8 @@ switch_to_scratch_context(struct intel_engine_cs *engine,
>  	if (IS_ERR(ctx))
>  		return PTR_ERR(ctx);
>  
> +	GEM_BUG_ON(i915_gem_context_is_bannable(ctx));
> +
>  	rq = ERR_PTR(-ENODEV);
>  	with_intel_runtime_pm(engine->i915, wakeref)
>  		rq = igt_spinner_create_request(spin, ctx, engine, MI_NOOP);
> -- 
> 2.20.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ✓ Fi.CI.BAT: success for series starting with [1/2] drm/i915/selftests: Make unbannable contexts for reset handling
  2019-02-18 14:50 [PATCH 1/2] drm/i915/selftests: Make unbannable contexts for reset handling Chris Wilson
  2019-02-18 14:50 ` [PATCH 2/2] drm/i915: Use time based guilty context banning Chris Wilson
  2019-02-18 15:18 ` [PATCH 1/2] drm/i915/selftests: Make unbannable contexts for reset handling Mika Kuoppala
@ 2019-02-18 15:52 ` Patchwork
  2019-02-18 19:06 ` ✓ Fi.CI.IGT: " Patchwork
  3 siblings, 0 replies; 7+ messages in thread
From: Patchwork @ 2019-02-18 15:52 UTC (permalink / raw)
  To: intel-gfx

== Series Details ==

Series: series starting with [1/2] drm/i915/selftests: Make unbannable contexts for reset handling
URL   : https://patchwork.freedesktop.org/series/56852/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_5626 -> Patchwork_12248
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/56852/revisions/1/mbox/

Known issues
------------

  Here are the changes found in Patchwork_12248 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@kms_chamelium@common-hpd-after-suspend:
    - fi-kbl-7567u:       PASS -> WARN [fdo#109380]

  * igt@pm_rpm@basic-rte:
    - fi-bsw-kefka:       PASS -> FAIL [fdo#108800]

  
#### Possible fixes ####

  * igt@kms_chamelium@hdmi-hpd-fast:
    - fi-kbl-7500u:       FAIL [fdo#109485] -> PASS

  * igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
    - fi-blb-e6850:       INCOMPLETE [fdo#107718] -> PASS

  * igt@pm_rpm@module-reload:
    - {fi-icl-y}:         INCOMPLETE [fdo#108840] -> PASS

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#107718]: https://bugs.freedesktop.org/show_bug.cgi?id=107718
  [fdo#108569]: https://bugs.freedesktop.org/show_bug.cgi?id=108569
  [fdo#108800]: https://bugs.freedesktop.org/show_bug.cgi?id=108800
  [fdo#108840]: https://bugs.freedesktop.org/show_bug.cgi?id=108840
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109278]: https://bugs.freedesktop.org/show_bug.cgi?id=109278
  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [fdo#109380]: https://bugs.freedesktop.org/show_bug.cgi?id=109380
  [fdo#109485]: https://bugs.freedesktop.org/show_bug.cgi?id=109485
  [fdo#109567]: https://bugs.freedesktop.org/show_bug.cgi?id=109567


Participating hosts (47 -> 41)
------------------------------

  Additional (1): fi-bwr-2160 
  Missing    (7): fi-kbl-soraka fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-gdg-551 fi-bdw-samus 


Build changes
-------------

    * Linux: CI_DRM_5626 -> Patchwork_12248

  CI_DRM_5626: 40f27a9645a0cae536df6bc26e0d91766b47eb2f @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4835: 784b3c535cb066dafb05d52170851b9dfb262f98 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_12248: fd6f5f58096696285861e36875e9ce274493ccb8 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

fd6f5f580966 drm/i915: Use time based guilty context banning
ccebb9a1ea0e drm/i915/selftests: Make unbannable contexts for reset handling

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12248/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ✓ Fi.CI.IGT: success for series starting with [1/2] drm/i915/selftests: Make unbannable contexts for reset handling
  2019-02-18 14:50 [PATCH 1/2] drm/i915/selftests: Make unbannable contexts for reset handling Chris Wilson
                   ` (2 preceding siblings ...)
  2019-02-18 15:52 ` ✓ Fi.CI.BAT: success for series starting with [1/2] " Patchwork
@ 2019-02-18 19:06 ` Patchwork
  3 siblings, 0 replies; 7+ messages in thread
From: Patchwork @ 2019-02-18 19:06 UTC (permalink / raw)
  To: intel-gfx

== Series Details ==

Series: series starting with [1/2] drm/i915/selftests: Make unbannable contexts for reset handling
URL   : https://patchwork.freedesktop.org/series/56852/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_5626_full -> Patchwork_12248_full
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  

Known issues
------------

  Here are the changes found in Patchwork_12248_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@kms_busy@extended-modeset-hang-newfb-render-c:
    - shard-iclb:         NOTRUN -> DMESG-WARN [fdo#107956]

  * igt@kms_ccs@pipe-c-crc-sprite-planes-basic:
    - shard-iclb:         NOTRUN -> FAIL [fdo#107725] +4

  * igt@kms_color@pipe-a-ctm-max:
    - shard-iclb:         NOTRUN -> FAIL [fdo#108147]

  * igt@kms_color@pipe-a-degamma:
    - shard-iclb:         NOTRUN -> FAIL [fdo#104782] +3

  * igt@kms_color@pipe-b-ctm-green-to-red:
    - shard-iclb:         NOTRUN -> DMESG-WARN [fdo#109624] +5

  * igt@kms_color@pipe-c-ctm-max:
    - shard-iclb:         NOTRUN -> DMESG-FAIL [fdo#109624] +1

  * igt@kms_cursor_crc@cursor-128x128-sliding:
    - shard-kbl:          NOTRUN -> FAIL [fdo#103232]

  * igt@kms_cursor_crc@cursor-256x256-random:
    - shard-iclb:         NOTRUN -> FAIL [fdo#103232] +9

  * igt@kms_cursor_crc@cursor-256x85-random:
    - shard-apl:          NOTRUN -> FAIL [fdo#103232] +1

  * igt@kms_cursor_crc@cursor-64x21-sliding:
    - shard-apl:          PASS -> FAIL [fdo#103232] +1

  * igt@kms_frontbuffer_tracking@fbc-1p-primscrn-spr-indfb-fullscreen:
    - shard-glk:          PASS -> FAIL [fdo#103167] +2

  * igt@kms_frontbuffer_tracking@fbc-1p-primscrn-spr-indfb-move:
    - shard-kbl:          NOTRUN -> FAIL [fdo#103167]

  * igt@kms_frontbuffer_tracking@psr-1p-primscrn-spr-indfb-draw-mmap-gtt:
    - shard-iclb:         NOTRUN -> FAIL [fdo#103167] +3

  * igt@kms_plane@pixel-format-pipe-a-planes:
    - shard-iclb:         NOTRUN -> FAIL [fdo#103166] +3

  * igt@kms_plane@pixel-format-pipe-b-planes-source-clamping:
    - shard-iclb:         NOTRUN -> FAIL [fdo#108948] +2

  * igt@kms_plane_alpha_blend@pipe-c-alpha-7efc:
    - shard-kbl:          NOTRUN -> FAIL [fdo#108145] / [fdo#108590]

  * igt@kms_plane_scaling@pipe-b-scaler-with-pixel-format:
    - shard-iclb:         NOTRUN -> DMESG-WARN [fdo#107724] +1

  * igt@kms_rotation_crc@multiplane-rotation-cropping-bottom:
    - shard-kbl:          PASS -> DMESG-FAIL [fdo#105763]

  * igt@kms_vblank@pipe-c-ts-continuation-suspend:
    - shard-apl:          PASS -> INCOMPLETE [fdo#103927]

  * igt@pm_backlight@fade_with_suspend:
    - shard-iclb:         NOTRUN -> FAIL [fdo#107847]

  * igt@pm_rps@reset:
    - shard-iclb:         NOTRUN -> FAIL [fdo#108344]

  
#### Possible fixes ####

  * igt@gem_ctx_isolation@vcs0-s3:
    - shard-kbl:          INCOMPLETE [fdo#103665] -> PASS

  * igt@kms_busy@extended-pageflip-hang-newfb-render-b:
    - shard-glk:          DMESG-WARN [fdo#107956] -> PASS

  * igt@kms_cursor_crc@cursor-256x256-suspend:
    - shard-apl:          FAIL [fdo#103191] / [fdo#103232] -> PASS

  * igt@kms_cursor_crc@cursor-64x21-random:
    - shard-apl:          FAIL [fdo#103232] -> PASS +7

  * igt@kms_frontbuffer_tracking@fbc-1p-primscrn-spr-indfb-draw-mmap-cpu:
    - shard-apl:          FAIL [fdo#103167] -> PASS

  * igt@kms_plane@pixel-format-pipe-b-planes-source-clamping:
    - shard-glk:          FAIL [fdo#108948] -> PASS

  * igt@kms_plane_multiple@atomic-pipe-c-tiling-yf:
    - shard-apl:          FAIL [fdo#103166] -> PASS +4

  * igt@kms_rotation_crc@multiplane-rotation:
    - shard-kbl:          DMESG-FAIL [fdo#105763] -> PASS

  * igt@kms_setmode@basic:
    - shard-hsw:          FAIL [fdo#99912] -> PASS

  * igt@kms_universal_plane@universal-plane-pipe-c-functional:
    - shard-glk:          FAIL [fdo#103166] -> PASS +3

  * igt@kms_vblank@pipe-b-ts-continuation-modeset-hang:
    - shard-apl:          FAIL [fdo#104894] -> PASS

  * igt@pm_rc6_residency@rc6-accuracy:
    - shard-kbl:          {SKIP} [fdo#109271] -> PASS
    - shard-snb:          {SKIP} [fdo#109271] -> PASS

  * igt@pm_rpm@gem-idle:
    - shard-iclb:         DMESG-WARN [fdo#107724] -> PASS

  * igt@testdisplay:
    - shard-apl:          INCOMPLETE [fdo#103927] -> PASS

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#103166]: https://bugs.freedesktop.org/show_bug.cgi?id=103166
  [fdo#103167]: https://bugs.freedesktop.org/show_bug.cgi?id=103167
  [fdo#103191]: https://bugs.freedesktop.org/show_bug.cgi?id=103191
  [fdo#103232]: https://bugs.freedesktop.org/show_bug.cgi?id=103232
  [fdo#103665]: https://bugs.freedesktop.org/show_bug.cgi?id=103665
  [fdo#103927]: https://bugs.freedesktop.org/show_bug.cgi?id=103927
  [fdo#104782]: https://bugs.freedesktop.org/show_bug.cgi?id=104782
  [fdo#104894]: https://bugs.freedesktop.org/show_bug.cgi?id=104894
  [fdo#105763]: https://bugs.freedesktop.org/show_bug.cgi?id=105763
  [fdo#107724]: https://bugs.freedesktop.org/show_bug.cgi?id=107724
  [fdo#107725]: https://bugs.freedesktop.org/show_bug.cgi?id=107725
  [fdo#107847]: https://bugs.freedesktop.org/show_bug.cgi?id=107847
  [fdo#107956]: https://bugs.freedesktop.org/show_bug.cgi?id=107956
  [fdo#108145]: https://bugs.freedesktop.org/show_bug.cgi?id=108145
  [fdo#108147]: https://bugs.freedesktop.org/show_bug.cgi?id=108147
  [fdo#108344]: https://bugs.freedesktop.org/show_bug.cgi?id=108344
  [fdo#108590]: https://bugs.freedesktop.org/show_bug.cgi?id=108590
  [fdo#108948]: https://bugs.freedesktop.org/show_bug.cgi?id=108948
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109274]: https://bugs.freedesktop.org/show_bug.cgi?id=109274
  [fdo#109276]: https://bugs.freedesktop.org/show_bug.cgi?id=109276
  [fdo#109277]: https://bugs.freedesktop.org/show_bug.cgi?id=109277
  [fdo#109278]: https://bugs.freedesktop.org/show_bug.cgi?id=109278
  [fdo#109279]: https://bugs.freedesktop.org/show_bug.cgi?id=109279
  [fdo#109280]: https://bugs.freedesktop.org/show_bug.cgi?id=109280
  [fdo#109281]: https://bugs.freedesktop.org/show_bug.cgi?id=109281
  [fdo#109283]: https://bugs.freedesktop.org/show_bug.cgi?id=109283
  [fdo#109284]: https://bugs.freedesktop.org/show_bug.cgi?id=109284
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#109287]: https://bugs.freedesktop.org/show_bug.cgi?id=109287
  [fdo#109289]: https://bugs.freedesktop.org/show_bug.cgi?id=109289
  [fdo#109290]: https://bugs.freedesktop.org/show_bug.cgi?id=109290
  [fdo#109291]: https://bugs.freedesktop.org/show_bug.cgi?id=109291
  [fdo#109292]: https://bugs.freedesktop.org/show_bug.cgi?id=109292
  [fdo#109293]: https://bugs.freedesktop.org/show_bug.cgi?id=109293
  [fdo#109294]: https://bugs.freedesktop.org/show_bug.cgi?id=109294
  [fdo#109295]: https://bugs.freedesktop.org/show_bug.cgi?id=109295
  [fdo#109300]: https://bugs.freedesktop.org/show_bug.cgi?id=109300
  [fdo#109301]: https://bugs.freedesktop.org/show_bug.cgi?id=109301
  [fdo#109308]: https://bugs.freedesktop.org/show_bug.cgi?id=109308
  [fdo#109313]: https://bugs.freedesktop.org/show_bug.cgi?id=109313
  [fdo#109314]: https://bugs.freedesktop.org/show_bug.cgi?id=109314
  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [fdo#109441]: https://bugs.freedesktop.org/show_bug.cgi?id=109441
  [fdo#109502]: https://bugs.freedesktop.org/show_bug.cgi?id=109502
  [fdo#109503]: https://bugs.freedesktop.org/show_bug.cgi?id=109503
  [fdo#109527]: https://bugs.freedesktop.org/show_bug.cgi?id=109527
  [fdo#109624]: https://bugs.freedesktop.org/show_bug.cgi?id=109624
  [fdo#99912]: https://bugs.freedesktop.org/show_bug.cgi?id=99912


Participating hosts (7 -> 6)
------------------------------

  Missing    (1): shard-skl 


Build changes
-------------

    * Linux: CI_DRM_5626 -> Patchwork_12248

  CI_DRM_5626: 40f27a9645a0cae536df6bc26e0d91766b47eb2f @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4835: 784b3c535cb066dafb05d52170851b9dfb262f98 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_12248: fd6f5f58096696285861e36875e9ce274493ccb8 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12248/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] drm/i915: Use time based guilty context banning
  2019-02-18 14:50 ` [PATCH 2/2] drm/i915: Use time based guilty context banning Chris Wilson
@ 2019-02-19 11:22   ` Mika Kuoppala
  2019-02-19 11:36     ` Chris Wilson
  0 siblings, 1 reply; 7+ messages in thread
From: Mika Kuoppala @ 2019-02-19 11:22 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Currently, we accumulate each time a context hangs the GPU, offset
> against the number of requests it submits, and if that score exceeds a
> certain threshold, we ban that context from submitting any more requests
> (cancelling any work in flight). In contrast, we use a simple timer on
> the file, that if we see more than a 9 hangs faster than 60s apart in
> total across all of its contexts, we will ban the client from creating
> any more contexts. This leads to a confusing situation where the file
> may be banned before the context, so lets use a simple timer scheme for
> each.
>
> If the context submits 3 hanging requests within a 120s period, declare
> it forbidden to ever send more requests.
>
> This has the advantage of not being easy to repair by simply sending
> empty requests, but has the disadvantage that if the context is idle
> then it is forgiven. However, if the context is idle, it is not
> disrupting the system, but a hog can evade the request counting and
> cause much more severe disruption to the system.
>
> Updating ban_score from request retirement is dubious as the retirement
> is purposely not in sync with request submission (i.e. we try and batch
> retirement to reduce overhead and avoid latency on submission), which
> leads to surprising situations where we can forgive a hang immediately
> due to a backlog of requests from before the hang being retired
> afterwards.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c |  4 ++++
>  drivers/gpu/drm/i915/i915_gem_context.h |  9 +++----
>  drivers/gpu/drm/i915/i915_gpu_error.c   | 31 +++++++------------------
>  drivers/gpu/drm/i915/i915_gpu_error.h   |  3 ---
>  drivers/gpu/drm/i915/i915_request.c     |  2 --
>  drivers/gpu/drm/i915/i915_reset.c       | 27 ++++++++++++---------
>  6 files changed, 33 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index da21c843fed8..7541c6f961b3 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -355,6 +355,7 @@ __create_hw_context(struct drm_i915_private *dev_priv,
>  	struct i915_gem_context *ctx;
>  	unsigned int n;
>  	int ret;
> +	int i;
>  
>  	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>  	if (ctx == NULL)
> @@ -407,6 +408,9 @@ __create_hw_context(struct drm_i915_private *dev_priv,
>  	ctx->desc_template =
>  		default_desc_template(dev_priv, dev_priv->mm.aliasing_ppgtt);
>  
> +	for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp); i++)
> +		ctx->hang_timestamp[i] = jiffies - CONTEXT_FAST_HANG_JIFFIES;
> +
>  	return ctx;
>  
>  err_pid:
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index 071108d34ae0..dc6c58f38cfa 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -209,10 +209,11 @@ struct i915_gem_context {
>  	 */
>  	atomic_t active_count;
>  
> -#define CONTEXT_SCORE_GUILTY		10
> -#define CONTEXT_SCORE_BAN_THRESHOLD	40
> -	/** ban_score: Accumulated score of all hangs caused by this context. */
> -	atomic_t ban_score;
> +	/**
> +	 * @hang_timestamp: The last time(s) this context caused a GPU hang
> +	 */
> +	unsigned long hang_timestamp[2];
> +#define CONTEXT_FAST_HANG_JIFFIES (120 * HZ) /* 3 hangs within 120s? Banned! */
>  
>  	/** remap_slice: Bitmask of cache lines that need remapping */
>  	u8 remap_slice;
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 9a65341fec09..3f6eddb6f6de 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -434,11 +434,6 @@ static void error_print_instdone(struct drm_i915_error_state_buf *m,
>  			   ee->instdone.row[slice][subslice]);
>  }
>  
> -static const char *bannable(const struct drm_i915_error_context *ctx)
> -{
> -	return ctx->bannable ? "" : " (unbannable)";
> -}
> -
>  static void error_print_request(struct drm_i915_error_state_buf *m,
>  				const char *prefix,
>  				const struct drm_i915_error_request *erq,
> @@ -447,9 +442,8 @@ static void error_print_request(struct drm_i915_error_state_buf *m,
>  	if (!erq->seqno)
>  		return;
>  
> -	err_printf(m, "%s pid %d, ban score %d, seqno %8x:%08x%s%s, prio %d, emitted %dms, start %08x, head %08x, tail %08x\n",
> -		   prefix, erq->pid, erq->ban_score,
> -		   erq->context, erq->seqno,
> +	err_printf(m, "%s pid %d, seqno %8x:%08x%s%s, prio %d, emitted %dms, start %08x, head %08x, tail %08x\n",
> +		   prefix, erq->pid, erq->context, erq->seqno,
>  		   test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
>  			    &erq->flags) ? "!" : "",
>  		   test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
> @@ -463,10 +457,9 @@ static void error_print_context(struct drm_i915_error_state_buf *m,
>  				const char *header,
>  				const struct drm_i915_error_context *ctx)
>  {
> -	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, ban score %d%s guilty %d active %d\n",
> +	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, guilty %d active %d\n",
>  		   header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id,
> -		   ctx->sched_attr.priority, ctx->ban_score, bannable(ctx),
> -		   ctx->guilty, ctx->active);
> +		   ctx->sched_attr.priority, ctx->guilty, ctx->active);
>  }
>  
>  static void error_print_engine(struct drm_i915_error_state_buf *m,
> @@ -688,12 +681,10 @@ static void __err_print_to_sgl(struct drm_i915_error_state_buf *m,
>  		if (!error->engine[i].context.pid)
>  			continue;
>  
> -		err_printf(m, "Active process (on ring %s): %s [%d], score %d%s\n",
> +		err_printf(m, "Active process (on ring %s): %s [%d]\n",
>  			   engine_name(m->i915, i),
>  			   error->engine[i].context.comm,
> -			   error->engine[i].context.pid,
> -			   error->engine[i].context.ban_score,
> -			   bannable(&error->engine[i].context));
> +			   error->engine[i].context.pid);
>  	}
>  	err_printf(m, "Reset count: %u\n", error->reset_count);
>  	err_printf(m, "Suspend count: %u\n", error->suspend_count);
> @@ -779,13 +770,11 @@ static void __err_print_to_sgl(struct drm_i915_error_state_buf *m,
>  		if (obj) {
>  			err_puts(m, m->i915->engine[i]->name);
>  			if (ee->context.pid)
> -				err_printf(m, " (submitted by %s [%d], ctx %d [%d], score %d%s)",
> +				err_printf(m, " (submitted by %s [%d], ctx %d [%d])",
>  					   ee->context.comm,
>  					   ee->context.pid,
>  					   ee->context.handle,
> -					   ee->context.hw_id,
> -					   ee->context.ban_score,
> -					   bannable(&ee->context));
> +					   ee->context.hw_id);
>  			err_printf(m, " --- gtt_offset = 0x%08x %08x\n",
>  				   upper_32_bits(obj->gtt_offset),
>  				   lower_32_bits(obj->gtt_offset));
> @@ -1301,8 +1290,6 @@ static void record_request(struct i915_request *request,
>  	erq->flags = request->fence.flags;
>  	erq->context = ctx->hw_id;
>  	erq->sched_attr = request->sched.attr;
> -	erq->ban_score = atomic_read(&ctx->ban_score);
> -	erq->seqno = request->global_seqno;
>  	erq->jiffies = request->emitted_jiffies;
>  	erq->start = i915_ggtt_offset(request->ring->vma);
>  	erq->head = request->head;
> @@ -1396,8 +1383,6 @@ static void record_context(struct drm_i915_error_context *e,
>  	e->handle = ctx->user_handle;
>  	e->hw_id = ctx->hw_id;
>  	e->sched_attr = ctx->sched;
> -	e->ban_score = atomic_read(&ctx->ban_score);
> -	e->bannable = i915_gem_context_is_bannable(ctx);
>  	e->guilty = atomic_read(&ctx->guilty_count);
>  	e->active = atomic_read(&ctx->active_count);
>  }
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index afa3adb28f02..94eaf8ab9051 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -122,10 +122,8 @@ struct i915_gpu_state {
>  			pid_t pid;
>  			u32 handle;
>  			u32 hw_id;
> -			int ban_score;
>  			int active;
>  			int guilty;
> -			bool bannable;
>  			struct i915_sched_attr sched_attr;
>  		} context;
>  
> @@ -149,7 +147,6 @@ struct i915_gpu_state {
>  			long jiffies;
>  			pid_t pid;
>  			u32 context;
> -			int ban_score;
>  			u32 seqno;
>  			u32 start;
>  			u32 head;
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 5ab4e1c01618..a61e3a4fc9dc 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -290,8 +290,6 @@ static void i915_request_retire(struct i915_request *request)
>  
>  	i915_request_remove_from_client(request);
>  
> -	/* Retirement decays the ban score as it is a sign of ctx progress */
> -	atomic_dec_if_positive(&request->gem_context->ban_score);
>  	intel_context_unpin(request->hw_context);
>  
>  	__retire_engine_upto(request->engine, request);
> diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
> index 1911e00d2581..bae88a4ea924 100644
> --- a/drivers/gpu/drm/i915/i915_reset.c
> +++ b/drivers/gpu/drm/i915/i915_reset.c
> @@ -59,24 +59,29 @@ static void client_mark_guilty(struct drm_i915_file_private *file_priv,
>  
>  static bool context_mark_guilty(struct i915_gem_context *ctx)
>  {
> -	unsigned int score;
> -	bool banned, bannable;
> +	unsigned long prev_hang;
> +	bool banned;
> +	int i;
>  
>  	atomic_inc(&ctx->guilty_count);
>  
> -	bannable = i915_gem_context_is_bannable(ctx);
> -	score = atomic_add_return(CONTEXT_SCORE_GUILTY, &ctx->ban_score);
> -	banned = (!i915_gem_context_is_recoverable(ctx) ||
> -		  score >= CONTEXT_SCORE_BAN_THRESHOLD);
> -
>  	/* Cool contexts don't accumulate client ban score */

This comment becomes misleading and can be removed.

> -	if (!bannable)
> +	if (!i915_gem_context_is_bannable(ctx))
>  		return false;
>  
> +	/* Record the timestamp for the last N hangs */
> +	prev_hang = ctx->hang_timestamp[0];
> +	for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp) - 1; i++)
> +		ctx->hang_timestamp[i] = ctx->hang_timestamp[i + 1];
> +	ctx->hang_timestamp[i] = jiffies;
> +
> +	/* If we have hung N+1 times in rapid succession, we ban the context! */
> +	banned = !i915_gem_context_is_recoverable(ctx);
> +	if (time_before(jiffies, prev_hang + CONTEXT_FAST_HANG_JIFFIES))
> +		banned = true;

Ok, the initialization to jiffies - CONTEXT_FAST_HANG_JIFFIES guarantees
that it cant be banned even if it manages to immediately hang. Which
in itself is difficult feat to accomplish.

>  	if (banned) {
> -		DRM_DEBUG_DRIVER("context %s: guilty %d, score %u, banned\n",
> -				 ctx->name, atomic_read(&ctx->guilty_count),
> -				 score);
> +		DRM_DEBUG_DRIVER("context %s: guilty %d, banned\n",
> +				 ctx->name,
> atomic_read(&ctx->guilty_count));

Now when we know when it previously hanged, we could improve the
logging a bit by saying how long ago. But not sure
about if it worth the trouble.

With the misleading comment removed/corrected,
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>


>  		i915_gem_context_set_banned(ctx);
>  	}
>  
> -- 
> 2.20.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] drm/i915: Use time based guilty context banning
  2019-02-19 11:22   ` Mika Kuoppala
@ 2019-02-19 11:36     ` Chris Wilson
  0 siblings, 0 replies; 7+ messages in thread
From: Chris Wilson @ 2019-02-19 11:36 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2019-02-19 11:22:32)
> > diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
> > index 1911e00d2581..bae88a4ea924 100644
> > --- a/drivers/gpu/drm/i915/i915_reset.c
> > +++ b/drivers/gpu/drm/i915/i915_reset.c
> > @@ -59,24 +59,29 @@ static void client_mark_guilty(struct drm_i915_file_private *file_priv,
> >  
> >  static bool context_mark_guilty(struct i915_gem_context *ctx)
> >  {
> > -     unsigned int score;
> > -     bool banned, bannable;
> > +     unsigned long prev_hang;
> > +     bool banned;
> > +     int i;
> >  
> >       atomic_inc(&ctx->guilty_count);
> >  
> > -     bannable = i915_gem_context_is_bannable(ctx);
> > -     score = atomic_add_return(CONTEXT_SCORE_GUILTY, &ctx->ban_score);
> > -     banned = (!i915_gem_context_is_recoverable(ctx) ||
> > -               score >= CONTEXT_SCORE_BAN_THRESHOLD);
> > -
> >       /* Cool contexts don't accumulate client ban score */
> 
> This comment becomes misleading and can be removed.

/* Cool contexts are too cool to be banned */

> 
> > -     if (!bannable)
> > +     if (!i915_gem_context_is_bannable(ctx))
> >               return false;
> >  
> > +     /* Record the timestamp for the last N hangs */
> > +     prev_hang = ctx->hang_timestamp[0];
> > +     for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp) - 1; i++)
> > +             ctx->hang_timestamp[i] = ctx->hang_timestamp[i + 1];
> > +     ctx->hang_timestamp[i] = jiffies;
> > +
> > +     /* If we have hung N+1 times in rapid succession, we ban the context! */
> > +     banned = !i915_gem_context_is_recoverable(ctx);
> > +     if (time_before(jiffies, prev_hang + CONTEXT_FAST_HANG_JIFFIES))
> > +             banned = true;
> 
> Ok, the initialization to jiffies - CONTEXT_FAST_HANG_JIFFIES guarantees
> that it cant be banned even if it manages to immediately hang. Which
> in itself is difficult feat to accomplish.
> 
> >       if (banned) {
> > -             DRM_DEBUG_DRIVER("context %s: guilty %d, score %u, banned\n",
> > -                              ctx->name, atomic_read(&ctx->guilty_count),
> > -                              score);
> > +             DRM_DEBUG_DRIVER("context %s: guilty %d, banned\n",
> > +                              ctx->name,
> > atomic_read(&ctx->guilty_count));
> 
> Now when we know when it previously hanged, we could improve the
> logging a bit by saying how long ago. But not sure
> about if it worth the trouble.

~o~

Not sure yet if that will be helpful. The secondary hangs do not get
captured in error-state, so for post-mortem debugging we don't know if
the context is banned (yet). For clients trying to run, the first they
know about being banned is -EIO from gem_execbuf. Looking back at dmesg,
there's just a string of GPU resets, already too late for adding
drm.debug=0x2 to dmesg.

-EIO either means client banned or driver is wedged. In either case, the
first course of action is to debug the initial hang. (Or in the case of
mesa recovery patches, to recreate the context and try again.) From the
perspective of the current patches, we are teaching clients how to
handle such events without "distrubing" the user, even less incentive
for large amount of debugging into the secondary banning, fix the
primary hang!

Yeah, I'm thinking that we don't need much more than some driver
debugging to clarify that the context was banned, everything else should
be discoverable though the uapi; and the real debugging is always the
first hang.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-02-19 11:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-18 14:50 [PATCH 1/2] drm/i915/selftests: Make unbannable contexts for reset handling Chris Wilson
2019-02-18 14:50 ` [PATCH 2/2] drm/i915: Use time based guilty context banning Chris Wilson
2019-02-19 11:22   ` Mika Kuoppala
2019-02-19 11:36     ` Chris Wilson
2019-02-18 15:18 ` [PATCH 1/2] drm/i915/selftests: Make unbannable contexts for reset handling Mika Kuoppala
2019-02-18 15:52 ` ✓ Fi.CI.BAT: success for series starting with [1/2] " Patchwork
2019-02-18 19:06 ` ✓ Fi.CI.IGT: " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.