[PATCH 1/4] drm/i915: Initialize ring->hangcheck upon ring init

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/4] drm/i915: Initialize ring->hangcheck upon ring init
@ 2013-06-10 10:20 Chris Wilson
  2013-06-10 10:20 ` [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring Chris Wilson
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Chris Wilson @ 2013-06-10 10:20 UTC (permalink / raw)
  To: intel-gfx

When we reset and restart a ring, we also want to clear any existing
hangcheck.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_ringbuffer.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 1ef081c..a3cfa35 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -453,6 +453,8 @@ static int init_ring_common(struct intel_ring_buffer *ring)
 		ring->last_retired_head = -1;
 	}
 
+	memset(&ring->hangcheck, 0, sizeof(ring->hangcheck));
+
 out:
 	if (HAS_FORCE_WAKE(dev))
 		gen6_gt_force_wake_put(dev_priv);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
  2013-06-10 10:20 [PATCH 1/4] drm/i915: Initialize ring->hangcheck upon ring init Chris Wilson
@ 2013-06-10 10:20 ` Chris Wilson
  2013-06-11  9:45   ` Daniel Vetter
  2013-06-10 10:20 ` [PATCH 3/4] drm/i915: Don't count semaphore waits towards a stuck ring Chris Wilson
  2013-06-10 10:20 ` [PATCH 4/4] drm/i915: Eliminate the addr/seqno from the hangcheck warning Chris Wilson
  2 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2013-06-10 10:20 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky

After kicking a ring, it should be free to make progress again and so
should not be accused of being stuck until hangcheck fires once more. In
order to catch a denial-of-service within a batch or across multiple
batches, we still do increment the hangcheck score - just not as
severely so that it takes multiple kicks to fail.

This should address part of Ben's justified criticism of

commit 05407ff889ceebe383aa5907219f86582ef96b72
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Thu May 30 09:04:29 2013 +0300

    drm/i915: detect hang using per ring hangcheck_score

"There's also another corner case on the kick. If the seqno = 2
(though not stuck), and on the 3rd hangcheck, the ring is stuck, and
we try to kick it... we don't actually try to find out if the kick
helped."

v2: Make sure we catch DoS attempts with batches full of invalid WAITs.
v3: Preserve the ability to detect loops by always charging the ring
    if it is busy on the same request.
v4: Make sure we queue another check if on a new batch

References: https://bugs.freedesktop.org/show_bug.cgi?id=65394
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_irq.c |  110 +++++++++++++++++++++------------------
 1 file changed, 58 insertions(+), 52 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index dcb5209..32b2465 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2324,21 +2324,11 @@ ring_last_seqno(struct intel_ring_buffer *ring)
 			  struct drm_i915_gem_request, list)->seqno;
 }
 
-static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring,
-				     u32 ring_seqno, bool *err)
-{
-	if (list_empty(&ring->request_list) ||
-	    i915_seqno_passed(ring_seqno, ring_last_seqno(ring))) {
-		/* Issue a wake-up to catch stuck h/w. */
-		if (waitqueue_active(&ring->irq_queue)) {
-			DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
-				  ring->name);
-			wake_up_all(&ring->irq_queue);
-			*err = true;
-		}
-		return true;
-	}
-	return false;
+static bool
+ring_idle(struct intel_ring_buffer *ring, u32 seqno)
+{
+	return (list_empty(&ring->request_list) ||
+		i915_seqno_passed(seqno, ring_last_seqno(ring)));
 }
 
 static bool semaphore_passed(struct intel_ring_buffer *ring)
@@ -2372,16 +2362,26 @@ static bool semaphore_passed(struct intel_ring_buffer *ring)
 				 ioread32(ring->virtual_start+acthd+4)+1);
 }
 
-static bool kick_ring(struct intel_ring_buffer *ring)
+static bool ring_hung(struct intel_ring_buffer *ring)
 {
 	struct drm_device *dev = ring->dev;
 	struct drm_i915_private *dev_priv = dev->dev_private;
-	u32 tmp = I915_READ_CTL(ring);
+	u32 tmp;
+
+	if (IS_GEN2(dev))
+		return true;
+
+	/* Is the chip hanging on a WAIT_FOR_EVENT?
+	 * If so we can simply poke the RB_WAIT bit
+	 * and break the hang. This should work on
+	 * all but the second generation chipsets.
+	 */
+	tmp = I915_READ_CTL(ring);
 	if (tmp & RING_WAIT) {
 		DRM_ERROR("Kicking stuck wait on %s\n",
 			  ring->name);
 		I915_WRITE_CTL(ring, tmp);
-		return true;
+		return false;
 	}
 
 	if (INTEL_INFO(dev)->gen >= 6 &&
@@ -2390,22 +2390,10 @@ static bool kick_ring(struct intel_ring_buffer *ring)
 		DRM_ERROR("Kicking stuck semaphore on %s\n",
 			  ring->name);
 		I915_WRITE_CTL(ring, tmp);
-		return true;
-	}
-	return false;
-}
-
-static bool i915_hangcheck_ring_hung(struct intel_ring_buffer *ring)
-{
-	if (IS_GEN2(ring->dev))
 		return false;
+	}
 
-	/* Is the chip hanging on a WAIT_FOR_EVENT?
-	 * If so we can simply poke the RB_WAIT bit
-	 * and break the hang. This should work on
-	 * all but the second generation chipsets.
-	 */
-	return !kick_ring(ring);
+	return true;
 }
 
 /**
@@ -2423,45 +2411,63 @@ void i915_hangcheck_elapsed(unsigned long data)
 	struct intel_ring_buffer *ring;
 	int i;
 	int busy_count = 0, rings_hung = 0;
-	bool stuck[I915_NUM_RINGS];
+	bool stuck[I915_NUM_RINGS] = { 0 };
+#define BUSY 1
+#define KICK 5
+#define HUNG 20
+#define FIRE 30
 
 	if (!i915_enable_hangcheck)
 		return;
 
 	for_each_ring(ring, dev_priv, i) {
 		u32 seqno, acthd;
-		bool idle, err = false;
+		bool busy = true;
 
 		seqno = ring->get_seqno(ring, false);
 		acthd = intel_ring_get_active_head(ring);
-		idle = i915_hangcheck_ring_idle(ring, seqno, &err);
-		stuck[i] = ring->hangcheck.acthd == acthd;
-
-		if (idle) {
-			if (err)
-				ring->hangcheck.score += 2;
-			else
-				ring->hangcheck.score = 0;
-		} else {
-			busy_count++;
 
-			if (ring->hangcheck.seqno == seqno) {
-				ring->hangcheck.score++;
-
-				/* Kick ring if stuck*/
-				if (stuck[i])
-					i915_hangcheck_ring_hung(ring);
+		if (ring->hangcheck.seqno == seqno) {
+			if (ring_idle(ring, seqno)) {
+				if (waitqueue_active(&ring->irq_queue)) {
+					/* Issue a wake-up to catch stuck h/w. */
+					DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
+						  ring->name);
+					wake_up_all(&ring->irq_queue);
+					ring->hangcheck.score += HUNG;
+				} else
+					busy = false;
 			} else {
-				ring->hangcheck.score = 0;
+				int score;
+
+				stuck[i] = ring->hangcheck.acthd == acthd;
+				if (stuck[i]) {
+					/* Every time we kick the ring, add a
+					 * small increment to the hangcheck
+					 * score so that we can catch a
+					 * batch that is repeatedly kicked.
+					 */
+					score = ring_hung(ring) ? HUNG : KICK;
+				} else
+					score = BUSY;
+
+				ring->hangcheck.score += score;
 			}
+		} else {
+			/* Gradually reduce the count so that we catch DoS
+			 * attempts across multiple batches.
+			 */
+			if (ring->hangcheck.score > 0)
+				ring->hangcheck.score--;
 		}
 
 		ring->hangcheck.seqno = seqno;
 		ring->hangcheck.acthd = acthd;
+		busy_count += busy;
 	}
 
 	for_each_ring(ring, dev_priv, i) {
-		if (ring->hangcheck.score > 2) {
+		if (ring->hangcheck.score > FIRE) {
 			rings_hung++;
 			DRM_ERROR("%s: %s on %s 0x%x\n", ring->name,
 				  stuck[i] ? "stuck" : "no progress",
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/4] drm/i915: Don't count semaphore waits towards a stuck ring
  2013-06-10 10:20 [PATCH 1/4] drm/i915: Initialize ring->hangcheck upon ring init Chris Wilson
  2013-06-10 10:20 ` [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring Chris Wilson
@ 2013-06-10 10:20 ` Chris Wilson
  2013-06-11  9:51   ` Daniel Vetter
  2013-06-10 10:20 ` [PATCH 4/4] drm/i915: Eliminate the addr/seqno from the hangcheck warning Chris Wilson
  2 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2013-06-10 10:20 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky

If we detect a ring is in a valid wait for another, just let it be.
Eventually it will either begin to progress again, or the entire system
will come grinding to a halt and then hangcheck will fire as soon as the
deadlock is detected.

This error was foretold by Ben in
commit 05407ff889ceebe383aa5907219f86582ef96b72
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Thu May 30 09:04:29 2013 +0300

    drm/i915: detect hang using per ring hangcheck_score

"If ring B is waiting on ring A via semaphore, and ring A is making
progress, albeit slowly - the hangcheck will fire. The check will
determine that A is moving, however ring B will appear hung because
the ACTHD doesn't move. I honestly can't say if that's actually a
realistic problem to hit it probably implies the timeout value is too
low."

v2: Make sure we don't even incur the KICK cost whilst waiting.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=65394
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_irq.c         |  121 +++++++++++++++++++++++--------
 drivers/gpu/drm/i915/intel_ringbuffer.h |    1 +
 2 files changed, 90 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 32b2465..cf8584c 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2331,21 +2331,21 @@ ring_idle(struct intel_ring_buffer *ring, u32 seqno)
 		i915_seqno_passed(seqno, ring_last_seqno(ring)));
 }
 
-static bool semaphore_passed(struct intel_ring_buffer *ring)
+static struct intel_ring_buffer *
+semaphore_waits_for(struct intel_ring_buffer *ring, u32 *seqno)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
-	u32 acthd = intel_ring_get_active_head(ring) & HEAD_ADDR;
-	struct intel_ring_buffer *signaller;
-	u32 cmd, ipehr, acthd_min;
+	u32 cmd, ipehr, acthd, acthd_min;
 
 	ipehr = I915_READ(RING_IPEHR(ring->mmio_base));
 	if ((ipehr & ~(0x3 << 16)) !=
 	    (MI_SEMAPHORE_MBOX | MI_SEMAPHORE_COMPARE | MI_SEMAPHORE_REGISTER))
-		return false;
+		return NULL;
 
 	/* ACTHD is likely pointing to the dword after the actual command,
 	 * so scan backwards until we find the MBOX.
 	 */
+	acthd = intel_ring_get_active_head(ring) & HEAD_ADDR;
 	acthd_min = max((int)acthd - 3 * 4, 0);
 	do {
 		cmd = ioread32(ring->virtual_start + acthd);
@@ -2354,22 +2354,53 @@ static bool semaphore_passed(struct intel_ring_buffer *ring)
 
 		acthd -= 4;
 		if (acthd < acthd_min)
-			return false;
+			return NULL;
 	} while (1);
 
-	signaller = &dev_priv->ring[(ring->id + (((ipehr >> 17) & 1) + 1)) % 3];
-	return i915_seqno_passed(signaller->get_seqno(signaller, false),
-				 ioread32(ring->virtual_start+acthd+4)+1);
+	*seqno = ioread32(ring->virtual_start+acthd+4)+1;
+	return &dev_priv->ring[(ring->id + (((ipehr >> 17) & 1) + 1)) % 3];
+}
+
+static int semaphore_passed(struct intel_ring_buffer *ring)
+{
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct intel_ring_buffer *signaller;
+	u32 seqno, ctl;
+
+	ring->hangcheck.deadlock = true;
+
+	signaller = semaphore_waits_for(ring, &seqno);
+	if (signaller == NULL || signaller->hangcheck.deadlock)
+		return -1;
+
+	/* cursory check for an unkickable deadlock */
+	ctl = I915_READ_CTL(signaller);
+	if (ctl & RING_WAIT_SEMAPHORE && semaphore_passed(signaller) < 0)
+		return -1;
+
+	return i915_seqno_passed(signaller->get_seqno(signaller, false), seqno);
+}
+
+static void semaphore_clear_deadlocks(struct drm_i915_private *dev_priv)
+{
+	struct intel_ring_buffer *ring;
+	int i;
+
+	for_each_ring(ring, dev_priv, i)
+		ring->hangcheck.deadlock = false;
 }
 
-static bool ring_hung(struct intel_ring_buffer *ring)
+static enum { wait, active, kick, hung } ring_stuck(struct intel_ring_buffer *ring, u32 acthd)
 {
 	struct drm_device *dev = ring->dev;
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	u32 tmp;
 
+	if (ring->hangcheck.acthd != acthd)
+		return active;
+
 	if (IS_GEN2(dev))
-		return true;
+		return hung;
 
 	/* Is the chip hanging on a WAIT_FOR_EVENT?
 	 * If so we can simply poke the RB_WAIT bit
@@ -2381,19 +2412,24 @@ static bool ring_hung(struct intel_ring_buffer *ring)
 		DRM_ERROR("Kicking stuck wait on %s\n",
 			  ring->name);
 		I915_WRITE_CTL(ring, tmp);
-		return false;
-	}
-
-	if (INTEL_INFO(dev)->gen >= 6 &&
-	    tmp & RING_WAIT_SEMAPHORE &&
-	    semaphore_passed(ring)) {
-		DRM_ERROR("Kicking stuck semaphore on %s\n",
-			  ring->name);
-		I915_WRITE_CTL(ring, tmp);
-		return false;
+		return kick;
+	}
+
+	if (INTEL_INFO(dev)->gen >= 6 && tmp & RING_WAIT_SEMAPHORE) {
+		switch (semaphore_passed(ring)) {
+		default:
+			return hung;
+		case 1:
+			DRM_ERROR("Kicking stuck semaphore on %s\n",
+				  ring->name);
+			I915_WRITE_CTL(ring, tmp);
+			return kick;
+		case 0:
+			return wait;
+		}
 	}
 
-	return true;
+	return hung;
 }
 
 /**
@@ -2424,6 +2460,8 @@ void i915_hangcheck_elapsed(unsigned long data)
 		u32 seqno, acthd;
 		bool busy = true;
 
+		semaphore_clear_deadlocks(dev_priv);
+
 		seqno = ring->get_seqno(ring, false);
 		acthd = intel_ring_get_active_head(ring);
 
@@ -2440,17 +2478,36 @@ void i915_hangcheck_elapsed(unsigned long data)
 			} else {
 				int score;
 
-				stuck[i] = ring->hangcheck.acthd == acthd;
-				if (stuck[i]) {
-					/* Every time we kick the ring, add a
-					 * small increment to the hangcheck
-					 * score so that we can catch a
-					 * batch that is repeatedly kicked.
-					 */
-					score = ring_hung(ring) ? HUNG : KICK;
-				} else
+				/* We always increment the hangcheck score
+				 * if the ring is busy and still processing
+				 * the same request, so that no single request
+				 * can run indefinitely (such as a chain of
+				 * batches). The only time we do not increment
+				 * the hangcheck score on this ring, if this
+				 * ring is in a legitimate wait for another
+				 * ring. In that case the waiting ring is a
+				 * victim and we want to be sure we catch the
+				 * right culprit. Then every time we do kick
+				 * the ring, add a small increment to the
+				 * score so that we can catch a batch that is
+				 * being repeatedly kicked and so responsible
+				 * for stalling the machine.
+				 */
+				switch (ring_stuck(ring, acthd)) {
+				case wait:
+					score = 0;
+					break;
+				case active:
 					score = BUSY;
-
+					break;
+				case kick:
+					score = KICK;
+					break;
+				case hung:
+					score = HUNG;
+					stuck[i] = true;
+					break;
+				}
 				ring->hangcheck.score += score;
 			}
 		} else {
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index efc403d..a3e9610 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -38,6 +38,7 @@ struct  intel_hw_status_page {
 #define I915_READ_SYNC_1(ring) I915_READ(RING_SYNC_1((ring)->mmio_base))
 
 struct intel_ring_hangcheck {
+	bool deadlock;
 	u32 seqno;
 	u32 acthd;
 	int score;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] drm/i915: Eliminate the addr/seqno from the hangcheck warning
  2013-06-10 10:20 [PATCH 1/4] drm/i915: Initialize ring->hangcheck upon ring init Chris Wilson
  2013-06-10 10:20 ` [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring Chris Wilson
  2013-06-10 10:20 ` [PATCH 3/4] drm/i915: Don't count semaphore waits towards a stuck ring Chris Wilson
@ 2013-06-10 10:20 ` Chris Wilson
  2013-06-10 13:42   ` Mika Kuoppala
  2 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2013-06-10 10:20 UTC (permalink / raw)
  To: intel-gfx

This is of no value to the developer reading the report, let alone the
bamboozled user.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index cf8584c..39730b6 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2525,12 +2525,10 @@ void i915_hangcheck_elapsed(unsigned long data)
 
 	for_each_ring(ring, dev_priv, i) {
 		if (ring->hangcheck.score > FIRE) {
-			rings_hung++;
-			DRM_ERROR("%s: %s on %s 0x%x\n", ring->name,
+			DRM_ERROR("%s on %s ring\n",
 				  stuck[i] ? "stuck" : "no progress",
-				  stuck[i] ? "addr" : "seqno",
-				  stuck[i] ? ring->hangcheck.acthd & HEAD_ADDR :
-				  ring->hangcheck.seqno);
+				  ring->name);
+			rings_hung++;
 		}
 	}
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] drm/i915: Eliminate the addr/seqno from the hangcheck warning
  2013-06-10 10:20 ` [PATCH 4/4] drm/i915: Eliminate the addr/seqno from the hangcheck warning Chris Wilson
@ 2013-06-10 13:42   ` Mika Kuoppala
  0 siblings, 0 replies; 12+ messages in thread
From: Mika Kuoppala @ 2013-06-10 13:42 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> This is of no value to the developer reading the report, let alone the
> bamboozled user.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_irq.c |    8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index cf8584c..39730b6 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2525,12 +2525,10 @@ void i915_hangcheck_elapsed(unsigned long data)
>  
>  	for_each_ring(ring, dev_priv, i) {
>  		if (ring->hangcheck.score > FIRE) {
> -			rings_hung++;
> -			DRM_ERROR("%s: %s on %s 0x%x\n", ring->name,
> +			DRM_ERROR("%s on %s ring\n",
                                            ^^
Noticed one redudant 'ring' in here as ring->name already contains it.

Patches 1-4:
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
  2013-06-10 10:20 ` [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring Chris Wilson
@ 2013-06-11  9:45   ` Daniel Vetter
  2013-06-11 13:40     ` Chris Wilson
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vetter @ 2013-06-11  9:45 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, Ben Widawsky

On Mon, Jun 10, 2013 at 11:20:20AM +0100, Chris Wilson wrote:
> After kicking a ring, it should be free to make progress again and so
> should not be accused of being stuck until hangcheck fires once more. In
> order to catch a denial-of-service within a batch or across multiple
> batches, we still do increment the hangcheck score - just not as
> severely so that it takes multiple kicks to fail.
> 
> This should address part of Ben's justified criticism of
> 
> commit 05407ff889ceebe383aa5907219f86582ef96b72
> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Date:   Thu May 30 09:04:29 2013 +0300
> 
>     drm/i915: detect hang using per ring hangcheck_score
> 
> "There's also another corner case on the kick. If the seqno = 2
> (though not stuck), and on the 3rd hangcheck, the ring is stuck, and
> we try to kick it... we don't actually try to find out if the kick
> helped."
> 
> v2: Make sure we catch DoS attempts with batches full of invalid WAITs.
> v3: Preserve the ability to detect loops by always charging the ring
>     if it is busy on the same request.
> v4: Make sure we queue another check if on a new batch
> 
> References: https://bugs.freedesktop.org/show_bug.cgi?id=65394
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> ---
>  drivers/gpu/drm/i915/i915_irq.c |  110 +++++++++++++++++++++------------------
>  1 file changed, 58 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index dcb5209..32b2465 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2324,21 +2324,11 @@ ring_last_seqno(struct intel_ring_buffer *ring)
>  			  struct drm_i915_gem_request, list)->seqno;
>  }
>  
> -static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring,
> -				     u32 ring_seqno, bool *err)
> -{
> -	if (list_empty(&ring->request_list) ||
> -	    i915_seqno_passed(ring_seqno, ring_last_seqno(ring))) {
> -		/* Issue a wake-up to catch stuck h/w. */
> -		if (waitqueue_active(&ring->irq_queue)) {
> -			DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
> -				  ring->name);
> -			wake_up_all(&ring->irq_queue);
> -			*err = true;
> -		}
> -		return true;
> -	}
> -	return false;
> +static bool
> +ring_idle(struct intel_ring_buffer *ring, u32 seqno)
> +{
> +	return (list_empty(&ring->request_list) ||
> +		i915_seqno_passed(seqno, ring_last_seqno(ring)));
>  }
>  
>  static bool semaphore_passed(struct intel_ring_buffer *ring)
> @@ -2372,16 +2362,26 @@ static bool semaphore_passed(struct intel_ring_buffer *ring)
>  				 ioread32(ring->virtual_start+acthd+4)+1);
>  }
>  
> -static bool kick_ring(struct intel_ring_buffer *ring)
> +static bool ring_hung(struct intel_ring_buffer *ring)
>  {
>  	struct drm_device *dev = ring->dev;
>  	struct drm_i915_private *dev_priv = dev->dev_private;
> -	u32 tmp = I915_READ_CTL(ring);
> +	u32 tmp;
> +
> +	if (IS_GEN2(dev))
> +		return true;
> +
> +	/* Is the chip hanging on a WAIT_FOR_EVENT?
> +	 * If so we can simply poke the RB_WAIT bit
> +	 * and break the hang. This should work on
> +	 * all but the second generation chipsets.
> +	 */
> +	tmp = I915_READ_CTL(ring);
>  	if (tmp & RING_WAIT) {
>  		DRM_ERROR("Kicking stuck wait on %s\n",
>  			  ring->name);
>  		I915_WRITE_CTL(ring, tmp);
> -		return true;
> +		return false;
>  	}
>  
>  	if (INTEL_INFO(dev)->gen >= 6 &&
> @@ -2390,22 +2390,10 @@ static bool kick_ring(struct intel_ring_buffer *ring)
>  		DRM_ERROR("Kicking stuck semaphore on %s\n",
>  			  ring->name);
>  		I915_WRITE_CTL(ring, tmp);
> -		return true;
> -	}
> -	return false;
> -}
> -
> -static bool i915_hangcheck_ring_hung(struct intel_ring_buffer *ring)
> -{
> -	if (IS_GEN2(ring->dev))
>  		return false;
> +	}
>  
> -	/* Is the chip hanging on a WAIT_FOR_EVENT?
> -	 * If so we can simply poke the RB_WAIT bit
> -	 * and break the hang. This should work on
> -	 * all but the second generation chipsets.
> -	 */
> -	return !kick_ring(ring);
> +	return true;
>  }
>  
>  /**
> @@ -2423,45 +2411,63 @@ void i915_hangcheck_elapsed(unsigned long data)
>  	struct intel_ring_buffer *ring;
>  	int i;
>  	int busy_count = 0, rings_hung = 0;
> -	bool stuck[I915_NUM_RINGS];
> +	bool stuck[I915_NUM_RINGS] = { 0 };
> +#define BUSY 1
> +#define KICK 5
> +#define HUNG 20
> +#define FIRE 30
>  
>  	if (!i915_enable_hangcheck)
>  		return;
>  
>  	for_each_ring(ring, dev_priv, i) {
>  		u32 seqno, acthd;
> -		bool idle, err = false;
> +		bool busy = true;
>  
>  		seqno = ring->get_seqno(ring, false);
>  		acthd = intel_ring_get_active_head(ring);
> -		idle = i915_hangcheck_ring_idle(ring, seqno, &err);
> -		stuck[i] = ring->hangcheck.acthd == acthd;
> -
> -		if (idle) {
> -			if (err)
> -				ring->hangcheck.score += 2;
> -			else
> -				ring->hangcheck.score = 0;
> -		} else {
> -			busy_count++;
>  
> -			if (ring->hangcheck.seqno == seqno) {
> -				ring->hangcheck.score++;
> -
> -				/* Kick ring if stuck*/
> -				if (stuck[i])
> -					i915_hangcheck_ring_hung(ring);
> +		if (ring->hangcheck.seqno == seqno) {
> +			if (ring_idle(ring, seqno)) {
> +				if (waitqueue_active(&ring->irq_queue)) {
> +					/* Issue a wake-up to catch stuck h/w. */
> +					DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
> +						  ring->name);
> +					wake_up_all(&ring->irq_queue);
> +					ring->hangcheck.score += HUNG;

Not sure whether we want to hit missed interrupts this badly, it was
rather common a while back ;-) But we can fine-tune this easily now, so
now reservations for merging from my side.
-Daniel

> +				} else
> +					busy = false;
>  			} else {
> -				ring->hangcheck.score = 0;
> +				int score;
> +
> +				stuck[i] = ring->hangcheck.acthd == acthd;
> +				if (stuck[i]) {
> +					/* Every time we kick the ring, add a
> +					 * small increment to the hangcheck
> +					 * score so that we can catch a
> +					 * batch that is repeatedly kicked.
> +					 */
> +					score = ring_hung(ring) ? HUNG : KICK;
> +				} else
> +					score = BUSY;
> +
> +				ring->hangcheck.score += score;
>  			}
> +		} else {
> +			/* Gradually reduce the count so that we catch DoS
> +			 * attempts across multiple batches.
> +			 */
> +			if (ring->hangcheck.score > 0)
> +				ring->hangcheck.score--;
>  		}
>  
>  		ring->hangcheck.seqno = seqno;
>  		ring->hangcheck.acthd = acthd;
> +		busy_count += busy;
>  	}
>  
>  	for_each_ring(ring, dev_priv, i) {
> -		if (ring->hangcheck.score > 2) {
> +		if (ring->hangcheck.score > FIRE) {
>  			rings_hung++;
>  			DRM_ERROR("%s: %s on %s 0x%x\n", ring->name,
>  				  stuck[i] ? "stuck" : "no progress",
> -- 
> 1.7.10.4
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/4] drm/i915: Don't count semaphore waits towards a stuck ring
  2013-06-10 10:20 ` [PATCH 3/4] drm/i915: Don't count semaphore waits towards a stuck ring Chris Wilson
@ 2013-06-11  9:51   ` Daniel Vetter
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Vetter @ 2013-06-11  9:51 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, Ben Widawsky

On Mon, Jun 10, 2013 at 11:20:21AM +0100, Chris Wilson wrote:
> If we detect a ring is in a valid wait for another, just let it be.
> Eventually it will either begin to progress again, or the entire system
> will come grinding to a halt and then hangcheck will fire as soon as the
> deadlock is detected.
> 
> This error was foretold by Ben in
> commit 05407ff889ceebe383aa5907219f86582ef96b72
> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Date:   Thu May 30 09:04:29 2013 +0300
> 
>     drm/i915: detect hang using per ring hangcheck_score
> 
> "If ring B is waiting on ring A via semaphore, and ring A is making
> progress, albeit slowly - the hangcheck will fire. The check will
> determine that A is moving, however ring B will appear hung because
> the ACTHD doesn't move. I honestly can't say if that's actually a
> realistic problem to hit it probably implies the timeout value is too
> low."
> 
> v2: Make sure we don't even incur the KICK cost whilst waiting.
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=65394
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> ---
>  drivers/gpu/drm/i915/i915_irq.c         |  121 +++++++++++++++++++++++--------
>  drivers/gpu/drm/i915/intel_ringbuffer.h |    1 +
>  2 files changed, 90 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 32b2465..cf8584c 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2331,21 +2331,21 @@ ring_idle(struct intel_ring_buffer *ring, u32 seqno)
>  		i915_seqno_passed(seqno, ring_last_seqno(ring)));
>  }
>  
> -static bool semaphore_passed(struct intel_ring_buffer *ring)
> +static struct intel_ring_buffer *
> +semaphore_waits_for(struct intel_ring_buffer *ring, u32 *seqno)
>  {
>  	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> -	u32 acthd = intel_ring_get_active_head(ring) & HEAD_ADDR;
> -	struct intel_ring_buffer *signaller;
> -	u32 cmd, ipehr, acthd_min;
> +	u32 cmd, ipehr, acthd, acthd_min;
>  
>  	ipehr = I915_READ(RING_IPEHR(ring->mmio_base));
>  	if ((ipehr & ~(0x3 << 16)) !=
>  	    (MI_SEMAPHORE_MBOX | MI_SEMAPHORE_COMPARE | MI_SEMAPHORE_REGISTER))
> -		return false;
> +		return NULL;
>  
>  	/* ACTHD is likely pointing to the dword after the actual command,
>  	 * so scan backwards until we find the MBOX.
>  	 */
> +	acthd = intel_ring_get_active_head(ring) & HEAD_ADDR;
>  	acthd_min = max((int)acthd - 3 * 4, 0);
>  	do {
>  		cmd = ioread32(ring->virtual_start + acthd);
> @@ -2354,22 +2354,53 @@ static bool semaphore_passed(struct intel_ring_buffer *ring)
>  
>  		acthd -= 4;
>  		if (acthd < acthd_min)
> -			return false;
> +			return NULL;
>  	} while (1);
>  
> -	signaller = &dev_priv->ring[(ring->id + (((ipehr >> 17) & 1) + 1)) % 3];
> -	return i915_seqno_passed(signaller->get_seqno(signaller, false),
> -				 ioread32(ring->virtual_start+acthd+4)+1);
> +	*seqno = ioread32(ring->virtual_start+acthd+4)+1;
> +	return &dev_priv->ring[(ring->id + (((ipehr >> 17) & 1) + 1)) % 3];
> +}
> +
> +static int semaphore_passed(struct intel_ring_buffer *ring)
> +{
> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> +	struct intel_ring_buffer *signaller;
> +	u32 seqno, ctl;
> +
> +	ring->hangcheck.deadlock = true;
> +
> +	signaller = semaphore_waits_for(ring, &seqno);
> +	if (signaller == NULL || signaller->hangcheck.deadlock)
> +		return -1;
> +
> +	/* cursory check for an unkickable deadlock */
> +	ctl = I915_READ_CTL(signaller);
> +	if (ctl & RING_WAIT_SEMAPHORE && semaphore_passed(signaller) < 0)
> +		return -1;
> +
> +	return i915_seqno_passed(signaller->get_seqno(signaller, false), seqno);
> +}
> +
> +static void semaphore_clear_deadlocks(struct drm_i915_private *dev_priv)
> +{
> +	struct intel_ring_buffer *ring;
> +	int i;
> +
> +	for_each_ring(ring, dev_priv, i)
> +		ring->hangcheck.deadlock = false;
>  }
>  
> -static bool ring_hung(struct intel_ring_buffer *ring)
> +static enum { wait, active, kick, hung } ring_stuck(struct intel_ring_buffer *ring, u32 acthd)
>  {
>  	struct drm_device *dev = ring->dev;
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  	u32 tmp;
>  
> +	if (ring->hangcheck.acthd != acthd)
> +		return active;
> +
>  	if (IS_GEN2(dev))
> -		return true;
> +		return hung;
>  
>  	/* Is the chip hanging on a WAIT_FOR_EVENT?
>  	 * If so we can simply poke the RB_WAIT bit
> @@ -2381,19 +2412,24 @@ static bool ring_hung(struct intel_ring_buffer *ring)
>  		DRM_ERROR("Kicking stuck wait on %s\n",
>  			  ring->name);
>  		I915_WRITE_CTL(ring, tmp);
> -		return false;
> -	}
> -
> -	if (INTEL_INFO(dev)->gen >= 6 &&
> -	    tmp & RING_WAIT_SEMAPHORE &&
> -	    semaphore_passed(ring)) {
> -		DRM_ERROR("Kicking stuck semaphore on %s\n",
> -			  ring->name);
> -		I915_WRITE_CTL(ring, tmp);
> -		return false;
> +		return kick;
> +	}
> +
> +	if (INTEL_INFO(dev)->gen >= 6 && tmp & RING_WAIT_SEMAPHORE) {
> +		switch (semaphore_passed(ring)) {
> +		default:
> +			return hung;
> +		case 1:
> +			DRM_ERROR("Kicking stuck semaphore on %s\n",
> +				  ring->name);
> +			I915_WRITE_CTL(ring, tmp);
> +			return kick;
> +		case 0:
> +			return wait;
> +		}
>  	}
>  
> -	return true;
> +	return hung;
>  }
>  
>  /**
> @@ -2424,6 +2460,8 @@ void i915_hangcheck_elapsed(unsigned long data)
>  		u32 seqno, acthd;
>  		bool busy = true;
>  
> +		semaphore_clear_deadlocks(dev_priv);
> +
>  		seqno = ring->get_seqno(ring, false);
>  		acthd = intel_ring_get_active_head(ring);
>  
> @@ -2440,17 +2478,36 @@ void i915_hangcheck_elapsed(unsigned long data)
>  			} else {
>  				int score;
>  
> -				stuck[i] = ring->hangcheck.acthd == acthd;
> -				if (stuck[i]) {
> -					/* Every time we kick the ring, add a
> -					 * small increment to the hangcheck
> -					 * score so that we can catch a
> -					 * batch that is repeatedly kicked.
> -					 */
> -					score = ring_hung(ring) ? HUNG : KICK;
> -				} else
> +				/* We always increment the hangcheck score
> +				 * if the ring is busy and still processing
> +				 * the same request, so that no single request
> +				 * can run indefinitely (such as a chain of
> +				 * batches). The only time we do not increment
> +				 * the hangcheck score on this ring, if this
> +				 * ring is in a legitimate wait for another
> +				 * ring. In that case the waiting ring is a
> +				 * victim and we want to be sure we catch the
> +				 * right culprit. Then every time we do kick
> +				 * the ring, add a small increment to the
> +				 * score so that we can catch a batch that is
> +				 * being repeatedly kicked and so responsible
> +				 * for stalling the machine.
> +				 */
> +				switch (ring_stuck(ring, acthd)) {
> +				case wait:
> +					score = 0;
> +					break;
> +				case active:
>  					score = BUSY;
> -
> +					break;
> +				case kick:
> +					score = KICK;
> +					break;
> +				case hung:
> +					score = HUNG;
> +					stuck[i] = true;
> +					break;
> +				}
>  				ring->hangcheck.score += score;

I think extracting the score selection logic here would be nice, stuff is
falling of the cliff here a bit ;-)

Anyway, series merged, thanks a lot to everyone for digging into this an
coming up with a pretty neat solution.

Cheers, Daniel

>  			}
>  		} else {
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index efc403d..a3e9610 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -38,6 +38,7 @@ struct  intel_hw_status_page {
>  #define I915_READ_SYNC_1(ring) I915_READ(RING_SYNC_1((ring)->mmio_base))
>  
>  struct intel_ring_hangcheck {
> +	bool deadlock;
>  	u32 seqno;
>  	u32 acthd;
>  	int score;
> -- 
> 1.7.10.4
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
  2013-06-11  9:45   ` Daniel Vetter
@ 2013-06-11 13:40     ` Chris Wilson
  2013-06-11 14:05       ` Daniel Vetter
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2013-06-11 13:40 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, Ben Widawsky

On Tue, Jun 11, 2013 at 11:45:00AM +0200, Daniel Vetter wrote:
> On Mon, Jun 10, 2013 at 11:20:20AM +0100, Chris Wilson wrote:
> > +		if (ring->hangcheck.seqno == seqno) {
> > +			if (ring_idle(ring, seqno)) {
> > +				if (waitqueue_active(&ring->irq_queue)) {
> > +					/* Issue a wake-up to catch stuck h/w. */
> > +					DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
> > +						  ring->name);
> > +					wake_up_all(&ring->irq_queue);
> > +					ring->hangcheck.score += HUNG;
> 
> Not sure whether we want to hit missed interrupts this badly, it was
> rather common a while back ;-) But we can fine-tune this easily now, so
> now reservations for merging from my side.

Not sure what you mean here. The check is fairly easy and has gotten us
out of many a hole before, and makes for a good defense. So how would
you want to fine tune it?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
  2013-06-11 13:40     ` Chris Wilson
@ 2013-06-11 14:05       ` Daniel Vetter
  2013-06-11 14:16         ` Chris Wilson
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vetter @ 2013-06-11 14:05 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, intel-gfx, Ben Widawsky

On Tue, Jun 11, 2013 at 02:40:19PM +0100, Chris Wilson wrote:
> On Tue, Jun 11, 2013 at 11:45:00AM +0200, Daniel Vetter wrote:
> > On Mon, Jun 10, 2013 at 11:20:20AM +0100, Chris Wilson wrote:
> > > +		if (ring->hangcheck.seqno == seqno) {
> > > +			if (ring_idle(ring, seqno)) {
> > > +				if (waitqueue_active(&ring->irq_queue)) {
> > > +					/* Issue a wake-up to catch stuck h/w. */
> > > +					DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
> > > +						  ring->name);
> > > +					wake_up_all(&ring->irq_queue);
> > > +					ring->hangcheck.score += HUNG;
> > 
> > Not sure whether we want to hit missed interrupts this badly, it was
> > rather common a while back ;-) But we can fine-tune this easily now, so
> > now reservations for merging from my side.
> 
> Not sure what you mean here. The check is fairly easy and has gotten us
> out of many a hole before, and makes for a good defense. So how would
> you want to fine tune it?

Something like the MI_WAIT hangcheck score, but like I've said as long as
we don't have a real-world bug report (some poor guy disabled semaphores
maybe due to the snb issue?) not worth bothering at all.

I've just thought that if we're unlucky and miss the interrupt a few times
in a row we don't want to accidentally declare the gpu dead.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
  2013-06-11 14:05       ` Daniel Vetter
@ 2013-06-11 14:16         ` Chris Wilson
  2013-06-11 14:37           ` Daniel Vetter
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2013-06-11 14:16 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, Ben Widawsky

On Tue, Jun 11, 2013 at 04:05:41PM +0200, Daniel Vetter wrote:
> On Tue, Jun 11, 2013 at 02:40:19PM +0100, Chris Wilson wrote:
> > Not sure what you mean here. The check is fairly easy and has gotten us
> > out of many a hole before, and makes for a good defense. So how would
> > you want to fine tune it?
> 
> Something like the MI_WAIT hangcheck score, but like I've said as long as
> we don't have a real-world bug report (some poor guy disabled semaphores
> maybe due to the snb issue?) not worth bothering at all.
> 
> I've just thought that if we're unlucky and miss the interrupt a few times
> in a row we don't want to accidentally declare the gpu dead.

I regarded it as a driver bug, that a GPU reset would not help. So the
choice is between limping along with the hopefully occasional stall, or
terminating the GPU with extreme prejudice. I chose the former, hence
did not increment the hangcheck.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
  2013-06-11 14:16         ` Chris Wilson
@ 2013-06-11 14:37           ` Daniel Vetter
  2013-06-11 16:10             ` Chris Wilson
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vetter @ 2013-06-11 14:37 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, intel-gfx, Ben Widawsky

On Tue, Jun 11, 2013 at 4:16 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Tue, Jun 11, 2013 at 04:05:41PM +0200, Daniel Vetter wrote:
>> On Tue, Jun 11, 2013 at 02:40:19PM +0100, Chris Wilson wrote:
>> > Not sure what you mean here. The check is fairly easy and has gotten us
>> > out of many a hole before, and makes for a good defense. So how would
>> > you want to fine tune it?
>>
>> Something like the MI_WAIT hangcheck score, but like I've said as long as
>> we don't have a real-world bug report (some poor guy disabled semaphores
>> maybe due to the snb issue?) not worth bothering at all.
>>
>> I've just thought that if we're unlucky and miss the interrupt a few times
>> in a row we don't want to accidentally declare the gpu dead.
>
> I regarded it as a driver bug, that a GPU reset would not help. So the
> choice is between limping along with the hopefully occasional stall, or
> terminating the GPU with extreme prejudice. I chose the former, hence
> did not increment the hangcheck.

Hm, maybe I'm reading the logic wrongly, but don't we add a += HUNG
score now for a stuck, but idle ring? So pretty short of declaring the
thing dead? Ofc there's the slow decline if the gpu isn't actually
dead, but if we have more than 1 such stall every HUNG (=20) hangcheck
times we'll eventually declare it dead despite the limping along.

Anyway nothing to really worry about, just wanted to check my
understanding here.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
  2013-06-11 14:37           ` Daniel Vetter
@ 2013-06-11 16:10             ` Chris Wilson
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2013-06-11 16:10 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, Ben Widawsky

On Tue, Jun 11, 2013 at 04:37:26PM +0200, Daniel Vetter wrote:
> On Tue, Jun 11, 2013 at 4:16 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Tue, Jun 11, 2013 at 04:05:41PM +0200, Daniel Vetter wrote:
> >> On Tue, Jun 11, 2013 at 02:40:19PM +0100, Chris Wilson wrote:
> >> > Not sure what you mean here. The check is fairly easy and has gotten us
> >> > out of many a hole before, and makes for a good defense. So how would
> >> > you want to fine tune it?
> >>
> >> Something like the MI_WAIT hangcheck score, but like I've said as long as
> >> we don't have a real-world bug report (some poor guy disabled semaphores
> >> maybe due to the snb issue?) not worth bothering at all.
> >>
> >> I've just thought that if we're unlucky and miss the interrupt a few times
> >> in a row we don't want to accidentally declare the gpu dead.
> >
> > I regarded it as a driver bug, that a GPU reset would not help. So the
> > choice is between limping along with the hopefully occasional stall, or
> > terminating the GPU with extreme prejudice. I chose the former, hence
> > did not increment the hangcheck.
> 
> Hm, maybe I'm reading the logic wrongly, but don't we add a += HUNG
> score now for a stuck, but idle ring? So pretty short of declaring the
> thing dead?

Yeah... Didn't mean to do that, as all the time I was thinking "don't
hang here, this is our bug not userspace's".

> Ofc there's the slow decline if the gpu isn't actually
> dead, but if we have more than 1 such stall every HUNG (=20) hangcheck
> times we'll eventually declare it dead despite the limping along.
> 
> Anyway nothing to really worry about, just wanted to check my
> understanding here.

Looks like my fingers mutinied; and I am the one confused.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-06-11 16:10 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-10 10:20 [PATCH 1/4] drm/i915: Initialize ring->hangcheck upon ring init Chris Wilson
2013-06-10 10:20 ` [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring Chris Wilson
2013-06-11  9:45   ` Daniel Vetter
2013-06-11 13:40     ` Chris Wilson
2013-06-11 14:05       ` Daniel Vetter
2013-06-11 14:16         ` Chris Wilson
2013-06-11 14:37           ` Daniel Vetter
2013-06-11 16:10             ` Chris Wilson
2013-06-10 10:20 ` [PATCH 3/4] drm/i915: Don't count semaphore waits towards a stuck ring Chris Wilson
2013-06-11  9:51   ` Daniel Vetter
2013-06-10 10:20 ` [PATCH 4/4] drm/i915: Eliminate the addr/seqno from the hangcheck warning Chris Wilson
2013-06-10 13:42   ` Mika Kuoppala

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.