All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] drm/i915/execlists: Remove synchronize_hardirq()
@ 2018-06-04  7:34 Chris Wilson
  2018-06-04  7:34 ` [PATCH 2/3] drm/i915/execlists: Push the tasklet kick after reset to reset_finish Chris Wilson
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Chris Wilson @ 2018-06-04  7:34 UTC (permalink / raw)
  To: intel-gfx; +Cc: mika.kuoppala

This looked to be a sensible precaution to flush an ongoing interrupt
before resetting the IRQ. However, it has a potential to trigger a lockup
and is not strictly required so drop it.

[21411.603173] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
[21411.603174] Modules linked in: i915 intel_gtt drm_kms_helper
[21411.603181] CPU: 1 PID: 672 Comm: kworker/1:2 Tainted: G     U            4.17.0-rc7+ #307
[21411.603181] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016
[21411.603227] Workqueue: events_long i915_hangcheck_elapsed [i915]
[21411.603234] RIP: 0010:__synchronize_hardirq+0x12/0x50
[21411.603235] RSP: 0018:ffffc90000393c48 EFLAGS: 00000006
[21411.603236] RAX: 0000000001440200 RBX: ffff880275b09200 RCX: 000000000000003d
[21411.603237] RDX: 0000000000000000 RSI: 000000000000007d RDI: ffff880275b09200
[21411.603238] RBP: ffff880256e81440 R08: ffff88027647a910 R09: ffff880275b09200
[21411.603239] R10: 0000000000000000 R11: 0000000000000040 R12: ffff880275b092a4
[21411.603239] R13: 0000000000000286 R14: ffff880273a86000 R15: ffff88027004eac0
[21411.603241] FS:  0000000000000000(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
[21411.603242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[21411.603243] CR2: 00007f9cc24d6018 CR3: 0000000002a09000 CR4: 00000000001406e0
[21411.603243] Call Trace:
[21411.603250]  synchronize_hardirq+0x1b/0x30
[21411.603286]  reset_irq+0x1f/0x180 [i915]
[21411.603323]  execlists_reset+0x2f/0x1d0 [i915]
[21411.603351]  i915_reset_engine+0xf0/0x140 [i915]
[21411.603381]  i915_handle_error+0x11d/0x420 [i915]
[21411.603386]  ? scnprintf+0x3a/0x70
[21411.603421]  i915_hangcheck_elapsed+0x310/0x490 [i915]
[21411.603425]  process_one_work+0x195/0x310
[21411.603427]  worker_thread+0x26/0x3c0
[21411.603429]  ? wq_nice_store+0xd0/0xd0
[21411.603431]  kthread+0x107/0x120
[21411.603433]  ? _kthread_create_worker_on_cpu+0x40/0x40
[21411.603435]  ret_from_fork+0x1f/0x30
[21411.603437] Code: 00 00 89 c5 e8 50 e1 41 00 48 8b 53 38 89 e8 81 22 ff ff fb ff 5b 5d c3 90 41 54 4c 8d a7 a4 00 00 00 55 53 48 89 fb eb 02 f3 90 <48> 8b 43 38 8b 00 a9 00 00 04 00 75 f1 4c 89 e7 e8 89 e0 41 00

Fixes: 46b3617dfec8 ("drm/i915: Actually flush interrupts on reset not just wedging")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 517e92c6a70b..8d912d0c8fc1 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -869,7 +869,6 @@ static void reset_irq(struct intel_engine_cs *engine)
 {
 	/* Mark all CS interrupts as complete */
 	smp_store_mb(engine->execlists.active, 0);
-	synchronize_hardirq(engine->i915->drm.irq);
 
 	clear_gtiir(engine);
 
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] drm/i915/execlists: Push the tasklet kick after reset to reset_finish
  2018-06-04  7:34 [PATCH 1/3] drm/i915/execlists: Remove synchronize_hardirq() Chris Wilson
@ 2018-06-04  7:34 ` Chris Wilson
  2018-06-04 15:17   ` Tvrtko Ursulin
  2018-06-14 15:48   ` Mika Kuoppala
  2018-06-04  7:34 ` [PATCH 3/3] drm/i915: Declare the driver wedged if hangcheck makes no progress Chris Wilson
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 15+ messages in thread
From: Chris Wilson @ 2018-06-04  7:34 UTC (permalink / raw)
  To: intel-gfx; +Cc: mika.kuoppala

In the unlikely case where we have failed to keep submitting to the GPU,
we end up with the ELSP queue empty but a pending queue of requests.
Here, we skip the per-engine reset as there is no guilty request, but in
doing so we also skip the engine restart leaving ourselves with a
permanently hung engine. A quick way to recover is by moving the tasklet
kick to execlists_reset_finish() (from init_hw). We still emit the error
on hanging, so the error is not lost but we should be able to recover.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 8d912d0c8fc1..c8d9b5aed94a 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1803,7 +1803,6 @@ static bool unexpected_starting_state(struct intel_engine_cs *engine)
 
 static int gen8_init_common_ring(struct intel_engine_cs *engine)
 {
-	struct intel_engine_execlists * const execlists = &engine->execlists;
 	int ret;
 
 	ret = intel_mocs_init_engine(engine);
@@ -1821,10 +1820,6 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
 
 	enable_execlists(engine);
 
-	/* After a GPU reset, we may have requests to replay */
-	if (execlists->first)
-		tasklet_schedule(&execlists->tasklet);
-
 	return 0;
 }
 
@@ -2006,6 +2001,12 @@ static void execlists_reset(struct intel_engine_cs *engine,
 
 static void execlists_reset_finish(struct intel_engine_cs *engine)
 {
+	struct intel_engine_execlists * const execlists = &engine->execlists;
+
+	/* After a GPU reset, we may have requests to replay */
+	if (execlists->first)
+		tasklet_schedule(&execlists->tasklet);
+
 	/*
 	 * Flush the tasklet while we still have the forcewake to be sure
 	 * that it is not allowed to sleep before we restart and reload a
@@ -2015,7 +2016,7 @@ static void execlists_reset_finish(struct intel_engine_cs *engine)
 	 * serialising multiple attempts to reset so that we know that we
 	 * are the only one manipulating tasklet state.
 	 */
-	__tasklet_enable_sync_once(&engine->execlists.tasklet);
+	__tasklet_enable_sync_once(&execlists->tasklet);
 
 	GEM_TRACE("%s\n", engine->name);
 }
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] drm/i915: Declare the driver wedged if hangcheck makes no progress
  2018-06-04  7:34 [PATCH 1/3] drm/i915/execlists: Remove synchronize_hardirq() Chris Wilson
  2018-06-04  7:34 ` [PATCH 2/3] drm/i915/execlists: Push the tasklet kick after reset to reset_finish Chris Wilson
@ 2018-06-04  7:34 ` Chris Wilson
  2018-06-05  9:26   ` Mika Kuoppala
  2018-06-04  7:57 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/3] drm/i915/execlists: Remove synchronize_hardirq() Patchwork
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Chris Wilson @ 2018-06-04  7:34 UTC (permalink / raw)
  To: intel-gfx; +Cc: mika.kuoppala

Hangcheck is our back up in case the GPU or the driver gets stuck. It
detects when the GPU is not making any progress and issues a GPU reset.
However, if the driver is failing to make any progress, we can get
ourselves into a situation where we continually try resetting the GPU to
no avail. Employ a second timeout such that if we continue to see the
same seqno (the stalled engine has made no progress at all) over the
course of several hangchecks, declare the driver wedged and attempt to
start afresh.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c     |  5 +++--
 drivers/gpu/drm/i915/i915_drv.h         |  2 ++
 drivers/gpu/drm/i915/intel_hangcheck.c  | 17 ++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  3 ++-
 4 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 15e86d34a81c..2eae3abb0ec1 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1359,11 +1359,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 		seq_printf(m, "\tseqno = %x [current %x, last %x]\n",
 			   engine->hangcheck.seqno, seqno[id],
 			   intel_engine_last_submit(engine));
-		seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s\n",
+		seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n",
 			   yesno(intel_engine_has_waiter(engine)),
 			   yesno(test_bit(engine->id,
 					  &dev_priv->gpu_error.missed_irq_rings)),
-			   yesno(engine->hangcheck.stalled));
+			   yesno(engine->hangcheck.stalled),
+			   yesno(engine->hangcheck.wedged));
 
 		spin_lock_irq(&b->rb_lock);
 		for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) {
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 38157df6ff5c..a4ed0baeb0ed 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -994,6 +994,8 @@ struct i915_gem_mm {
 #define I915_ENGINE_DEAD_TIMEOUT  (4 * HZ)  /* Seqno, head and subunits dead */
 #define I915_SEQNO_DEAD_TIMEOUT   (12 * HZ) /* Seqno dead with active head */
 
+#define I915_ENGINE_WEDGED_TIMEOUT  (60 * HZ)  /* Reset but no recovery? */
+
 enum modeset_restore {
 	MODESET_ON_LID_OPEN,
 	MODESET_DONE,
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index d47e346bd49e..2fc7a0dd0df9 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -294,6 +294,7 @@ static void hangcheck_store_sample(struct intel_engine_cs *engine,
 	engine->hangcheck.seqno = hc->seqno;
 	engine->hangcheck.action = hc->action;
 	engine->hangcheck.stalled = hc->stalled;
+	engine->hangcheck.wedged = hc->wedged;
 }
 
 static enum intel_engine_hangcheck_action
@@ -368,6 +369,9 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
 
 	hc->stalled = time_after(jiffies,
 				 engine->hangcheck.action_timestamp + timeout);
+	hc->wedged = time_after(jiffies,
+				 engine->hangcheck.action_timestamp +
+				 I915_ENGINE_WEDGED_TIMEOUT);
 }
 
 static void hangcheck_declare_hang(struct drm_i915_private *i915,
@@ -409,7 +413,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			     gpu_error.hangcheck_work.work);
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
-	unsigned int hung = 0, stuck = 0;
+	unsigned int hung = 0, stuck = 0, wedged = 0;
 
 	if (!i915_modparams.enable_hangcheck)
 		return;
@@ -440,6 +444,17 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			if (hc.action != ENGINE_DEAD)
 				stuck |= intel_engine_flag(engine);
 		}
+
+		if (engine->hangcheck.wedged)
+			wedged |= intel_engine_flag(engine);
+	}
+
+	if (wedged) {
+		dev_err(dev_priv->drm.dev,
+			"GPU recovery timed out,"
+			" cancelling all in-flight rendering.\n");
+		GEM_TRACE_DUMP();
+		i915_gem_set_wedged(dev_priv);
 	}
 
 	if (hung)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index acef385c4c80..d9e8d79c56e4 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -122,7 +122,8 @@ struct intel_engine_hangcheck {
 	int deadlock;
 	struct intel_instdone instdone;
 	struct i915_request *active_request;
-	bool stalled;
+	bool stalled:1;
+	bool wedged:1;
 };
 
 struct intel_ring {
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/3] drm/i915/execlists: Remove synchronize_hardirq()
  2018-06-04  7:34 [PATCH 1/3] drm/i915/execlists: Remove synchronize_hardirq() Chris Wilson
  2018-06-04  7:34 ` [PATCH 2/3] drm/i915/execlists: Push the tasklet kick after reset to reset_finish Chris Wilson
  2018-06-04  7:34 ` [PATCH 3/3] drm/i915: Declare the driver wedged if hangcheck makes no progress Chris Wilson
@ 2018-06-04  7:57 ` Patchwork
  2018-06-04  7:58 ` ✗ Fi.CI.SPARSE: " Patchwork
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Patchwork @ 2018-06-04  7:57 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/3] drm/i915/execlists: Remove synchronize_hardirq()
URL   : https://patchwork.freedesktop.org/series/44169/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
02de5fbf1d99 drm/i915/execlists: Remove synchronize_hardirq()
70e0a3c35bb8 drm/i915/execlists: Push the tasklet kick after reset to reset_finish
f2721aee38f2 drm/i915: Declare the driver wedged if hangcheck makes no progress
-:68: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#68: FILE: drivers/gpu/drm/i915/intel_hangcheck.c:373:
+	hc->wedged = time_after(jiffies,
+				 engine->hangcheck.action_timestamp +

-:109: WARNING:BOOL_BITFIELD: Avoid using bool as bitfield.  Prefer bool bitfields as unsigned int or u<8|16|32>
#109: FILE: drivers/gpu/drm/i915/intel_ringbuffer.h:125:
+	bool stalled:1;

-:110: WARNING:BOOL_BITFIELD: Avoid using bool as bitfield.  Prefer bool bitfields as unsigned int or u<8|16|32>
#110: FILE: drivers/gpu/drm/i915/intel_ringbuffer.h:126:
+	bool wedged:1;

total: 0 errors, 2 warnings, 1 checks, 72 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* ✗ Fi.CI.SPARSE: warning for series starting with [1/3] drm/i915/execlists: Remove synchronize_hardirq()
  2018-06-04  7:34 [PATCH 1/3] drm/i915/execlists: Remove synchronize_hardirq() Chris Wilson
                   ` (2 preceding siblings ...)
  2018-06-04  7:57 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/3] drm/i915/execlists: Remove synchronize_hardirq() Patchwork
@ 2018-06-04  7:58 ` Patchwork
  2018-06-04  8:18 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Patchwork @ 2018-06-04  7:58 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/3] drm/i915/execlists: Remove synchronize_hardirq()
URL   : https://patchwork.freedesktop.org/series/44169/
State : warning

== Summary ==

$ dim sparse origin/drm-tip
Commit: drm/i915/execlists: Remove synchronize_hardirq()
Okay!

Commit: drm/i915/execlists: Push the tasklet kick after reset to reset_finish
Okay!

Commit: drm/i915: Declare the driver wedged if hangcheck makes no progress
-drivers/gpu/drm/i915/selftests/../i915_drv.h:3665:16: warning: expression using sizeof(void)
+drivers/gpu/drm/i915/selftests/../i915_drv.h:3667:16: warning: expression using sizeof(void)

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* ✓ Fi.CI.BAT: success for series starting with [1/3] drm/i915/execlists: Remove synchronize_hardirq()
  2018-06-04  7:34 [PATCH 1/3] drm/i915/execlists: Remove synchronize_hardirq() Chris Wilson
                   ` (3 preceding siblings ...)
  2018-06-04  7:58 ` ✗ Fi.CI.SPARSE: " Patchwork
@ 2018-06-04  8:18 ` Patchwork
  2018-06-04  9:09 ` ✓ Fi.CI.IGT: " Patchwork
  2018-06-04 15:13 ` [PATCH 1/3] " Tvrtko Ursulin
  6 siblings, 0 replies; 15+ messages in thread
From: Patchwork @ 2018-06-04  8:18 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/3] drm/i915/execlists: Remove synchronize_hardirq()
URL   : https://patchwork.freedesktop.org/series/44169/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4275 -> Patchwork_9182 =

== Summary - SUCCESS ==

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/44169/revisions/1/mbox/

== Known issues ==

  Here are the changes found in Patchwork_9182 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@gem_exec_suspend@basic-s3:
      fi-skl-gvtdvm:      NOTRUN -> INCOMPLETE (fdo#105600, fdo#104108)

    igt@kms_flip@basic-flip-vs-dpms:
      fi-glk-j4005:       PASS -> DMESG-WARN (fdo#106000)

    
    ==== Possible fixes ====

    igt@core_auth@basic-auth:
      fi-bdw-gvtdvm:      DMESG-WARN (fdo#105600) -> PASS +2

    igt@gem_mmap_gtt@basic-small-bo-tiledx:
      fi-gdg-551:         FAIL (fdo#102575) -> PASS

    igt@kms_flip@basic-flip-vs-modeset:
      fi-glk-j4005:       DMESG-WARN (fdo#106000) -> PASS

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-snb-2520m:       INCOMPLETE (fdo#103713) -> PASS
      fi-cnl-psr:         DMESG-WARN (fdo#104951) -> PASS

    
  fdo#102575 https://bugs.freedesktop.org/show_bug.cgi?id=102575
  fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
  fdo#104108 https://bugs.freedesktop.org/show_bug.cgi?id=104108
  fdo#104951 https://bugs.freedesktop.org/show_bug.cgi?id=104951
  fdo#105600 https://bugs.freedesktop.org/show_bug.cgi?id=105600
  fdo#106000 https://bugs.freedesktop.org/show_bug.cgi?id=106000


== Participating hosts (40 -> 38) ==

  Additional (3): fi-hsw-peppy fi-skl-guc fi-skl-gvtdvm 
  Missing    (5): fi-ctg-p8600 fi-byt-squawks fi-ilk-m540 fi-cnl-y3 fi-skl-6700hq 


== Build changes ==

    * Linux: CI_DRM_4275 -> Patchwork_9182

  CI_DRM_4275: 8fdb62e0511e81fa935059c274a2457361fdb679 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4505: 8a8f0271a71e2e0d2a2caa4d41f4ad1d9c89670e @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9182: f2721aee38f22dada277bfa98661db7aac45b51b @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

f2721aee38f2 drm/i915: Declare the driver wedged if hangcheck makes no progress
70e0a3c35bb8 drm/i915/execlists: Push the tasklet kick after reset to reset_finish
02de5fbf1d99 drm/i915/execlists: Remove synchronize_hardirq()

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9182/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* ✓ Fi.CI.IGT: success for series starting with [1/3] drm/i915/execlists: Remove synchronize_hardirq()
  2018-06-04  7:34 [PATCH 1/3] drm/i915/execlists: Remove synchronize_hardirq() Chris Wilson
                   ` (4 preceding siblings ...)
  2018-06-04  8:18 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-06-04  9:09 ` Patchwork
  2018-06-04 15:13 ` [PATCH 1/3] " Tvrtko Ursulin
  6 siblings, 0 replies; 15+ messages in thread
From: Patchwork @ 2018-06-04  9:09 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/3] drm/i915/execlists: Remove synchronize_hardirq()
URL   : https://patchwork.freedesktop.org/series/44169/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4275_full -> Patchwork_9182_full =

== Summary - WARNING ==

  Minor unknown changes coming with Patchwork_9182_full need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9182_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/44169/revisions/1/mbox/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9182_full:

  === IGT changes ===

    ==== Warnings ====

    igt@gem_exec_schedule@deep-blt:
      shard-kbl:          PASS -> SKIP

    igt@pm_rc6_residency@rc6-accuracy:
      shard-kbl:          SKIP -> PASS

    
== Known issues ==

  Here are the changes found in Patchwork_9182_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_cursor_legacy@2x-nonblocking-modeset-vs-cursor-atomic:
      shard-glk:          PASS -> FAIL (fdo#106509, fdo#105454)

    igt@kms_flip@dpms-vs-vblank-race:
      shard-glk:          PASS -> FAIL (fdo#103060)

    igt@kms_flip@modeset-vs-vblank-race-interruptible:
      shard-hsw:          PASS -> FAIL (fdo#103060)

    igt@kms_flip_tiling@flip-x-tiled:
      shard-glk:          PASS -> FAIL (fdo#104724, fdo#103822)

    
    ==== Possible fixes ====

    igt@drv_selftest@live_gtt:
      shard-glk:          INCOMPLETE (k.org#198133, fdo#103359) -> PASS

    igt@gem_eio@hibernate:
      shard-snb:          INCOMPLETE (fdo#105411) -> PASS

    igt@kms_flip@2x-dpms-vs-vblank-race:
      shard-glk:          FAIL (fdo#103060) -> PASS

    igt@kms_flip@basic-flip-vs-wf_vblank:
      shard-hsw:          FAIL (fdo#103928) -> PASS

    igt@kms_flip@plain-flip-fb-recreate-interruptible:
      shard-glk:          FAIL (fdo#100368) -> PASS +1

    igt@kms_flip_tiling@flip-y-tiled:
      shard-glk:          FAIL (fdo#104724, fdo#103822) -> PASS

    
    ==== Warnings ====

    igt@gem_eio@suspend:
      shard-snb:          DMESG-FAIL (fdo#106808) -> INCOMPLETE (fdo#105411)

    
  fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
  fdo#103060 https://bugs.freedesktop.org/show_bug.cgi?id=103060
  fdo#103359 https://bugs.freedesktop.org/show_bug.cgi?id=103359
  fdo#103822 https://bugs.freedesktop.org/show_bug.cgi?id=103822
  fdo#103928 https://bugs.freedesktop.org/show_bug.cgi?id=103928
  fdo#104724 https://bugs.freedesktop.org/show_bug.cgi?id=104724
  fdo#105411 https://bugs.freedesktop.org/show_bug.cgi?id=105411
  fdo#105454 https://bugs.freedesktop.org/show_bug.cgi?id=105454
  fdo#106509 https://bugs.freedesktop.org/show_bug.cgi?id=106509
  fdo#106808 https://bugs.freedesktop.org/show_bug.cgi?id=106808
  k.org#198133 https://bugzilla.kernel.org/show_bug.cgi?id=198133


== Participating hosts (5 -> 5) ==

  No changes in participating hosts


== Build changes ==

    * Linux: CI_DRM_4275 -> Patchwork_9182

  CI_DRM_4275: 8fdb62e0511e81fa935059c274a2457361fdb679 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4505: 8a8f0271a71e2e0d2a2caa4d41f4ad1d9c89670e @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9182: f2721aee38f22dada277bfa98661db7aac45b51b @ git://anongit.freedesktop.org/gfx-ci/linux

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9182/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] drm/i915/execlists: Remove synchronize_hardirq()
  2018-06-04  7:34 [PATCH 1/3] drm/i915/execlists: Remove synchronize_hardirq() Chris Wilson
                   ` (5 preceding siblings ...)
  2018-06-04  9:09 ` ✓ Fi.CI.IGT: " Patchwork
@ 2018-06-04 15:13 ` Tvrtko Ursulin
  2018-06-04 15:21   ` Chris Wilson
  6 siblings, 1 reply; 15+ messages in thread
From: Tvrtko Ursulin @ 2018-06-04 15:13 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: mika.kuoppala


On 04/06/2018 08:34, Chris Wilson wrote:
> This looked to be a sensible precaution to flush an ongoing interrupt
> before resetting the IRQ. However, it has a potential to trigger a lockup
> and is not strictly required so drop it.
> 
> [21411.603173] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
> [21411.603174] Modules linked in: i915 intel_gtt drm_kms_helper
> [21411.603181] CPU: 1 PID: 672 Comm: kworker/1:2 Tainted: G     U            4.17.0-rc7+ #307
> [21411.603181] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016
> [21411.603227] Workqueue: events_long i915_hangcheck_elapsed [i915]
> [21411.603234] RIP: 0010:__synchronize_hardirq+0x12/0x50
> [21411.603235] RSP: 0018:ffffc90000393c48 EFLAGS: 00000006
> [21411.603236] RAX: 0000000001440200 RBX: ffff880275b09200 RCX: 000000000000003d
> [21411.603237] RDX: 0000000000000000 RSI: 000000000000007d RDI: ffff880275b09200
> [21411.603238] RBP: ffff880256e81440 R08: ffff88027647a910 R09: ffff880275b09200
> [21411.603239] R10: 0000000000000000 R11: 0000000000000040 R12: ffff880275b092a4
> [21411.603239] R13: 0000000000000286 R14: ffff880273a86000 R15: ffff88027004eac0
> [21411.603241] FS:  0000000000000000(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
> [21411.603242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [21411.603243] CR2: 00007f9cc24d6018 CR3: 0000000002a09000 CR4: 00000000001406e0
> [21411.603243] Call Trace:
> [21411.603250]  synchronize_hardirq+0x1b/0x30
> [21411.603286]  reset_irq+0x1f/0x180 [i915]
> [21411.603323]  execlists_reset+0x2f/0x1d0 [i915]
> [21411.603351]  i915_reset_engine+0xf0/0x140 [i915]
> [21411.603381]  i915_handle_error+0x11d/0x420 [i915]
> [21411.603386]  ? scnprintf+0x3a/0x70
> [21411.603421]  i915_hangcheck_elapsed+0x310/0x490 [i915]
> [21411.603425]  process_one_work+0x195/0x310
> [21411.603427]  worker_thread+0x26/0x3c0
> [21411.603429]  ? wq_nice_store+0xd0/0xd0
> [21411.603431]  kthread+0x107/0x120
> [21411.603433]  ? _kthread_create_worker_on_cpu+0x40/0x40
> [21411.603435]  ret_from_fork+0x1f/0x30
> [21411.603437] Code: 00 00 89 c5 e8 50 e1 41 00 48 8b 53 38 89 e8 81 22 ff ff fb ff 5b 5d c3 90 41 54 4c 8d a7 a4 00 00 00 55 53 48 89 fb eb 02 f3 90 <48> 8b 43 38 8b 00 a9 00 00 04 00 75 f1 4c 89 e7 e8 89 e0 41 00
> 
> Fixes: 46b3617dfec8 ("drm/i915: Actually flush interrupts on reset not just wedging")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> ---
>   drivers/gpu/drm/i915/intel_lrc.c | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 517e92c6a70b..8d912d0c8fc1 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -869,7 +869,6 @@ static void reset_irq(struct intel_engine_cs *engine)
>   {
>   	/* Mark all CS interrupts as complete */
>   	smp_store_mb(engine->execlists.active, 0);
> -	synchronize_hardirq(engine->i915->drm.irq);
>   
>   	clear_gtiir(engine);
>   
> 

Funny, I was just commenting on synchronize_hardirq in the direct 
submission series.

In this one I don't immediately see what deadlocks. What's on the other 
side, I mean what is preventing interrupt handler from exiting?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] drm/i915/execlists: Push the tasklet kick after reset to reset_finish
  2018-06-04  7:34 ` [PATCH 2/3] drm/i915/execlists: Push the tasklet kick after reset to reset_finish Chris Wilson
@ 2018-06-04 15:17   ` Tvrtko Ursulin
  2018-06-05  9:31     ` Chris Wilson
  2018-06-14 15:48   ` Mika Kuoppala
  1 sibling, 1 reply; 15+ messages in thread
From: Tvrtko Ursulin @ 2018-06-04 15:17 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: mika.kuoppala


On 04/06/2018 08:34, Chris Wilson wrote:
> In the unlikely case where we have failed to keep submitting to the GPU,
> we end up with the ELSP queue empty but a pending queue of requests.

How does this happen? We have nothing in ports but a queue of requests, 
but we managed to declare a GPU hang, even though there is nothing in 
ports so GPU looks idle from the outside?

Regards,

Tvrtko

> Here, we skip the per-engine reset as there is no guilty request, but in
> doing so we also skip the engine restart leaving ourselves with a
> permanently hung engine. A quick way to recover is by moving the tasklet
> kick to execlists_reset_finish() (from init_hw). We still emit the error
> on hanging, so the error is not lost but we should be able to recover.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> ---
>   drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++------
>   1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 8d912d0c8fc1..c8d9b5aed94a 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1803,7 +1803,6 @@ static bool unexpected_starting_state(struct intel_engine_cs *engine)
>   
>   static int gen8_init_common_ring(struct intel_engine_cs *engine)
>   {
> -	struct intel_engine_execlists * const execlists = &engine->execlists;
>   	int ret;
>   
>   	ret = intel_mocs_init_engine(engine);
> @@ -1821,10 +1820,6 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
>   
>   	enable_execlists(engine);
>   
> -	/* After a GPU reset, we may have requests to replay */
> -	if (execlists->first)
> -		tasklet_schedule(&execlists->tasklet);
> -
>   	return 0;
>   }
>   
> @@ -2006,6 +2001,12 @@ static void execlists_reset(struct intel_engine_cs *engine,
>   
>   static void execlists_reset_finish(struct intel_engine_cs *engine)
>   {
> +	struct intel_engine_execlists * const execlists = &engine->execlists;
> +
> +	/* After a GPU reset, we may have requests to replay */
> +	if (execlists->first)
> +		tasklet_schedule(&execlists->tasklet);
> +
>   	/*
>   	 * Flush the tasklet while we still have the forcewake to be sure
>   	 * that it is not allowed to sleep before we restart and reload a
> @@ -2015,7 +2016,7 @@ static void execlists_reset_finish(struct intel_engine_cs *engine)
>   	 * serialising multiple attempts to reset so that we know that we
>   	 * are the only one manipulating tasklet state.
>   	 */
> -	__tasklet_enable_sync_once(&engine->execlists.tasklet);
> +	__tasklet_enable_sync_once(&execlists->tasklet);
>   
>   	GEM_TRACE("%s\n", engine->name);
>   }
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] drm/i915/execlists: Remove synchronize_hardirq()
  2018-06-04 15:13 ` [PATCH 1/3] " Tvrtko Ursulin
@ 2018-06-04 15:21   ` Chris Wilson
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Wilson @ 2018-06-04 15:21 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: mika.kuoppala

Quoting Tvrtko Ursulin (2018-06-04 16:13:54)
> 
> On 04/06/2018 08:34, Chris Wilson wrote:
> > This looked to be a sensible precaution to flush an ongoing interrupt
> > before resetting the IRQ. However, it has a potential to trigger a lockup
> > and is not strictly required so drop it.
> > 
> > [21411.603173] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
> > [21411.603174] Modules linked in: i915 intel_gtt drm_kms_helper
> > [21411.603181] CPU: 1 PID: 672 Comm: kworker/1:2 Tainted: G     U            4.17.0-rc7+ #307
> > [21411.603181] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0029.2016.1124.1625 11/24/2016
> > [21411.603227] Workqueue: events_long i915_hangcheck_elapsed [i915]
> > [21411.603234] RIP: 0010:__synchronize_hardirq+0x12/0x50
> > [21411.603235] RSP: 0018:ffffc90000393c48 EFLAGS: 00000006
> > [21411.603236] RAX: 0000000001440200 RBX: ffff880275b09200 RCX: 000000000000003d
> > [21411.603237] RDX: 0000000000000000 RSI: 000000000000007d RDI: ffff880275b09200
> > [21411.603238] RBP: ffff880256e81440 R08: ffff88027647a910 R09: ffff880275b09200
> > [21411.603239] R10: 0000000000000000 R11: 0000000000000040 R12: ffff880275b092a4
> > [21411.603239] R13: 0000000000000286 R14: ffff880273a86000 R15: ffff88027004eac0
> > [21411.603241] FS:  0000000000000000(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
> > [21411.603242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [21411.603243] CR2: 00007f9cc24d6018 CR3: 0000000002a09000 CR4: 00000000001406e0
> > [21411.603243] Call Trace:
> > [21411.603250]  synchronize_hardirq+0x1b/0x30
> > [21411.603286]  reset_irq+0x1f/0x180 [i915]
> > [21411.603323]  execlists_reset+0x2f/0x1d0 [i915]
> > [21411.603351]  i915_reset_engine+0xf0/0x140 [i915]
> > [21411.603381]  i915_handle_error+0x11d/0x420 [i915]
> > [21411.603386]  ? scnprintf+0x3a/0x70
> > [21411.603421]  i915_hangcheck_elapsed+0x310/0x490 [i915]
> > [21411.603425]  process_one_work+0x195/0x310
> > [21411.603427]  worker_thread+0x26/0x3c0
> > [21411.603429]  ? wq_nice_store+0xd0/0xd0
> > [21411.603431]  kthread+0x107/0x120
> > [21411.603433]  ? _kthread_create_worker_on_cpu+0x40/0x40
> > [21411.603435]  ret_from_fork+0x1f/0x30
> > [21411.603437] Code: 00 00 89 c5 e8 50 e1 41 00 48 8b 53 38 89 e8 81 22 ff ff fb ff 5b 5d c3 90 41 54 4c 8d a7 a4 00 00 00 55 53 48 89 fb eb 02 f3 90 <48> 8b 43 38 8b 00 a9 00 00 04 00 75 f1 4c 89 e7 e8 89 e0 41 00
> > 
> > Fixes: 46b3617dfec8 ("drm/i915: Actually flush interrupts on reset not just wedging")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Cc: Michel Thierry <michel.thierry@intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > ---
> >   drivers/gpu/drm/i915/intel_lrc.c | 1 -
> >   1 file changed, 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index 517e92c6a70b..8d912d0c8fc1 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -869,7 +869,6 @@ static void reset_irq(struct intel_engine_cs *engine)
> >   {
> >       /* Mark all CS interrupts as complete */
> >       smp_store_mb(engine->execlists.active, 0);
> > -     synchronize_hardirq(engine->i915->drm.irq);
> >   
> >       clear_gtiir(engine);
> >   
> > 
> 
> Funny, I was just commenting on synchronize_hardirq in the direct 
> submission series.
> 
> In this one I don't immediately see what deadlocks. What's on the other 
> side, I mean what is preventing interrupt handler from exiting?

Hmm, right I probably just messed up over there. Knee jerk reaction in
thinking it was a local CPU deadlock.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] drm/i915: Declare the driver wedged if hangcheck makes no progress
  2018-06-04  7:34 ` [PATCH 3/3] drm/i915: Declare the driver wedged if hangcheck makes no progress Chris Wilson
@ 2018-06-05  9:26   ` Mika Kuoppala
  2018-06-05  9:33     ` Chris Wilson
  0 siblings, 1 reply; 15+ messages in thread
From: Mika Kuoppala @ 2018-06-05  9:26 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Hangcheck is our back up in case the GPU or the driver gets stuck. It
> detects when the GPU is not making any progress and issues a GPU reset.
> However, if the driver is failing to make any progress, we can get
> ourselves into a situation where we continually try resetting the GPU to
> no avail. Employ a second timeout such that if we continue to see the
> same seqno (the stalled engine has made no progress at all) over the
> course of several hangchecks, declare the driver wedged and attempt to
> start afresh.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c     |  5 +++--
>  drivers/gpu/drm/i915/i915_drv.h         |  2 ++
>  drivers/gpu/drm/i915/intel_hangcheck.c  | 17 ++++++++++++++++-
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  3 ++-
>  4 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index 15e86d34a81c..2eae3abb0ec1 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -1359,11 +1359,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
>  		seq_printf(m, "\tseqno = %x [current %x, last %x]\n",
>  			   engine->hangcheck.seqno, seqno[id],
>  			   intel_engine_last_submit(engine));
> -		seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s\n",
> +		seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n",
>  			   yesno(intel_engine_has_waiter(engine)),
>  			   yesno(test_bit(engine->id,
>  					  &dev_priv->gpu_error.missed_irq_rings)),
> -			   yesno(engine->hangcheck.stalled));
> +			   yesno(engine->hangcheck.stalled),
> +			   yesno(engine->hangcheck.wedged));
>  
>  		spin_lock_irq(&b->rb_lock);
>  		for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) {
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 38157df6ff5c..a4ed0baeb0ed 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -994,6 +994,8 @@ struct i915_gem_mm {
>  #define I915_ENGINE_DEAD_TIMEOUT  (4 * HZ)  /* Seqno, head and subunits dead */
>  #define I915_SEQNO_DEAD_TIMEOUT   (12 * HZ) /* Seqno dead with active head */
>  
> +#define I915_ENGINE_WEDGED_TIMEOUT  (60 * HZ)  /* Reset but no recovery? */
> +
>  enum modeset_restore {
>  	MODESET_ON_LID_OPEN,
>  	MODESET_DONE,
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index d47e346bd49e..2fc7a0dd0df9 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -294,6 +294,7 @@ static void hangcheck_store_sample(struct intel_engine_cs *engine,
>  	engine->hangcheck.seqno = hc->seqno;
>  	engine->hangcheck.action = hc->action;
>  	engine->hangcheck.stalled = hc->stalled;
> +	engine->hangcheck.wedged = hc->wedged;
>  }
>  
>  static enum intel_engine_hangcheck_action
> @@ -368,6 +369,9 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
>  
>  	hc->stalled = time_after(jiffies,
>  				 engine->hangcheck.action_timestamp + timeout);
> +	hc->wedged = time_after(jiffies,
> +				 engine->hangcheck.action_timestamp +
> +				 I915_ENGINE_WEDGED_TIMEOUT);

As the init_hangcheck() will clear the wedged state,
our reset failure paths needs to steer clear of this.

Would it be better if we warn on the wedged being set
in init paths and explicitly clear it on unset_wedged?

-Mika


>  }
>  
>  static void hangcheck_declare_hang(struct drm_i915_private *i915,
> @@ -409,7 +413,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  			     gpu_error.hangcheck_work.work);
>  	struct intel_engine_cs *engine;
>  	enum intel_engine_id id;
> -	unsigned int hung = 0, stuck = 0;
> +	unsigned int hung = 0, stuck = 0, wedged = 0;
>  
>  	if (!i915_modparams.enable_hangcheck)
>  		return;
> @@ -440,6 +444,17 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  			if (hc.action != ENGINE_DEAD)
>  				stuck |= intel_engine_flag(engine);
>  		}
> +
> +		if (engine->hangcheck.wedged)
> +			wedged |= intel_engine_flag(engine);
> +	}
> +
> +	if (wedged) {
> +		dev_err(dev_priv->drm.dev,
> +			"GPU recovery timed out,"
> +			" cancelling all in-flight rendering.\n");
> +		GEM_TRACE_DUMP();
> +		i915_gem_set_wedged(dev_priv);
>  	}
>  
>  	if (hung)
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index acef385c4c80..d9e8d79c56e4 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -122,7 +122,8 @@ struct intel_engine_hangcheck {
>  	int deadlock;
>  	struct intel_instdone instdone;
>  	struct i915_request *active_request;
> -	bool stalled;
> +	bool stalled:1;
> +	bool wedged:1;
>  };
>  
>  struct intel_ring {
> -- 
> 2.17.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] drm/i915/execlists: Push the tasklet kick after reset to reset_finish
  2018-06-04 15:17   ` Tvrtko Ursulin
@ 2018-06-05  9:31     ` Chris Wilson
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Wilson @ 2018-06-05  9:31 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: mika.kuoppala

Quoting Tvrtko Ursulin (2018-06-04 16:17:47)
> 
> On 04/06/2018 08:34, Chris Wilson wrote:
> > In the unlikely case where we have failed to keep submitting to the GPU,
> > we end up with the ELSP queue empty but a pending queue of requests.
> 
> How does this happen? We have nothing in ports but a queue of requests, 
> but we managed to declare a GPU hang, even though there is nothing in 
> ports so GPU looks idle from the outside?

Driver bug. A buggy driver is no excuse for us to fail to recover
though.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] drm/i915: Declare the driver wedged if hangcheck makes no progress
  2018-06-05  9:26   ` Mika Kuoppala
@ 2018-06-05  9:33     ` Chris Wilson
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Wilson @ 2018-06-05  9:33 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2018-06-05 10:26:37)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Hangcheck is our back up in case the GPU or the driver gets stuck. It
> > detects when the GPU is not making any progress and issues a GPU reset.
> > However, if the driver is failing to make any progress, we can get
> > ourselves into a situation where we continually try resetting the GPU to
> > no avail. Employ a second timeout such that if we continue to see the
> > same seqno (the stalled engine has made no progress at all) over the
> > course of several hangchecks, declare the driver wedged and attempt to
> > start afresh.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_debugfs.c     |  5 +++--
> >  drivers/gpu/drm/i915/i915_drv.h         |  2 ++
> >  drivers/gpu/drm/i915/intel_hangcheck.c  | 17 ++++++++++++++++-
> >  drivers/gpu/drm/i915/intel_ringbuffer.h |  3 ++-
> >  4 files changed, 23 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> > index 15e86d34a81c..2eae3abb0ec1 100644
> > --- a/drivers/gpu/drm/i915/i915_debugfs.c
> > +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> > @@ -1359,11 +1359,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
> >               seq_printf(m, "    seqno = %x [current %x, last %x]\n",
> >                          engine->hangcheck.seqno, seqno[id],
> >                          intel_engine_last_submit(engine));
> > -             seq_printf(m, "    waiters? %s, fake irq active? %s, stalled? %s\n",
> > +             seq_printf(m, "    waiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n",
> >                          yesno(intel_engine_has_waiter(engine)),
> >                          yesno(test_bit(engine->id,
> >                                         &dev_priv->gpu_error.missed_irq_rings)),
> > -                        yesno(engine->hangcheck.stalled));
> > +                        yesno(engine->hangcheck.stalled),
> > +                        yesno(engine->hangcheck.wedged));
> >  
> >               spin_lock_irq(&b->rb_lock);
> >               for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) {
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index 38157df6ff5c..a4ed0baeb0ed 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -994,6 +994,8 @@ struct i915_gem_mm {
> >  #define I915_ENGINE_DEAD_TIMEOUT  (4 * HZ)  /* Seqno, head and subunits dead */
> >  #define I915_SEQNO_DEAD_TIMEOUT   (12 * HZ) /* Seqno dead with active head */
> >  
> > +#define I915_ENGINE_WEDGED_TIMEOUT  (60 * HZ)  /* Reset but no recovery? */
> > +
> >  enum modeset_restore {
> >       MODESET_ON_LID_OPEN,
> >       MODESET_DONE,
> > diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> > index d47e346bd49e..2fc7a0dd0df9 100644
> > --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> > @@ -294,6 +294,7 @@ static void hangcheck_store_sample(struct intel_engine_cs *engine,
> >       engine->hangcheck.seqno = hc->seqno;
> >       engine->hangcheck.action = hc->action;
> >       engine->hangcheck.stalled = hc->stalled;
> > +     engine->hangcheck.wedged = hc->wedged;
> >  }
> >  
> >  static enum intel_engine_hangcheck_action
> > @@ -368,6 +369,9 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
> >  
> >       hc->stalled = time_after(jiffies,
> >                                engine->hangcheck.action_timestamp + timeout);
> > +     hc->wedged = time_after(jiffies,
> > +                              engine->hangcheck.action_timestamp +
> > +                              I915_ENGINE_WEDGED_TIMEOUT);
> 
> As the init_hangcheck() will clear the wedged state,
> our reset failure paths needs to steer clear of this.

They do by definition (since the reset failed :)
 
> Would it be better if we warn on the wedged being set
> in init paths and explicitly clear it on unset_wedged?

I don't think so, if we reset, we may as well regard the reset as
completed. We detect repeated attempts to do the same reset; it's just
we need a back in case we don't do any reset.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] drm/i915/execlists: Push the tasklet kick after reset to reset_finish
  2018-06-04  7:34 ` [PATCH 2/3] drm/i915/execlists: Push the tasklet kick after reset to reset_finish Chris Wilson
  2018-06-04 15:17   ` Tvrtko Ursulin
@ 2018-06-14 15:48   ` Mika Kuoppala
  2018-06-14 18:38     ` Chris Wilson
  1 sibling, 1 reply; 15+ messages in thread
From: Mika Kuoppala @ 2018-06-14 15:48 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> In the unlikely case where we have failed to keep submitting to the GPU,
> we end up with the ELSP queue empty but a pending queue of requests.
> Here, we skip the per-engine reset as there is no guilty request, but in
> doing so we also skip the engine restart leaving ourselves with a
> permanently hung engine. A quick way to recover is by moving the tasklet
> kick to execlists_reset_finish() (from init_hw). We still emit the error
> on hanging, so the error is not lost but we should be able to recover.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 8d912d0c8fc1..c8d9b5aed94a 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1803,7 +1803,6 @@ static bool unexpected_starting_state(struct intel_engine_cs *engine)
>  
>  static int gen8_init_common_ring(struct intel_engine_cs *engine)
>  {
> -	struct intel_engine_execlists * const execlists = &engine->execlists;
>  	int ret;
>  
>  	ret = intel_mocs_init_engine(engine);
> @@ -1821,10 +1820,6 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
>  
>  	enable_execlists(engine);
>  
> -	/* After a GPU reset, we may have requests to replay */
> -	if (execlists->first)
> -		tasklet_schedule(&execlists->tasklet);
> -
>  	return 0;
>  }
>  
> @@ -2006,6 +2001,12 @@ static void execlists_reset(struct intel_engine_cs *engine,
>  
>  static void execlists_reset_finish(struct intel_engine_cs *engine)
>  {
> +	struct intel_engine_execlists * const execlists = &engine->execlists;
> +
> +	/* After a GPU reset, we may have requests to replay */
> +	if (execlists->first)
> +		tasklet_schedule(&execlists->tasklet);
> +
>  	/*
>  	 * Flush the tasklet while we still have the forcewake to be sure
>  	 * that it is not allowed to sleep before we restart and reload a
> @@ -2015,7 +2016,7 @@ static void execlists_reset_finish(struct intel_engine_cs *engine)
>  	 * serialising multiple attempts to reset so that we know that we
>  	 * are the only one manipulating tasklet state.
>  	 */
> -	__tasklet_enable_sync_once(&engine->execlists.tasklet);
> +	__tasklet_enable_sync_once(&execlists->tasklet);
>  
>  	GEM_TRACE("%s\n", engine->name);
>  }
> -- 
> 2.17.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] drm/i915/execlists: Push the tasklet kick after reset to reset_finish
  2018-06-14 15:48   ` Mika Kuoppala
@ 2018-06-14 18:38     ` Chris Wilson
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Wilson @ 2018-06-14 18:38 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2018-06-14 16:48:48)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > In the unlikely case where we have failed to keep submitting to the GPU,
> > we end up with the ELSP queue empty but a pending queue of requests.
> > Here, we skip the per-engine reset as there is no guilty request, but in
> > doing so we also skip the engine restart leaving ourselves with a
> > permanently hung engine. A quick way to recover is by moving the tasklet
> > kick to execlists_reset_finish() (from init_hw). We still emit the error
> > on hanging, so the error is not lost but we should be able to recover.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Michel Thierry <michel.thierry@intel.com>
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Thanks for the review, pushed.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-06-14 18:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-04  7:34 [PATCH 1/3] drm/i915/execlists: Remove synchronize_hardirq() Chris Wilson
2018-06-04  7:34 ` [PATCH 2/3] drm/i915/execlists: Push the tasklet kick after reset to reset_finish Chris Wilson
2018-06-04 15:17   ` Tvrtko Ursulin
2018-06-05  9:31     ` Chris Wilson
2018-06-14 15:48   ` Mika Kuoppala
2018-06-14 18:38     ` Chris Wilson
2018-06-04  7:34 ` [PATCH 3/3] drm/i915: Declare the driver wedged if hangcheck makes no progress Chris Wilson
2018-06-05  9:26   ` Mika Kuoppala
2018-06-05  9:33     ` Chris Wilson
2018-06-04  7:57 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/3] drm/i915/execlists: Remove synchronize_hardirq() Patchwork
2018-06-04  7:58 ` ✗ Fi.CI.SPARSE: " Patchwork
2018-06-04  8:18 ` ✓ Fi.CI.BAT: success " Patchwork
2018-06-04  9:09 ` ✓ Fi.CI.IGT: " Patchwork
2018-06-04 15:13 ` [PATCH 1/3] " Tvrtko Ursulin
2018-06-04 15:21   ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.