* [PATCH 1/4] drm/i915: Initialize ring->hangcheck upon ring init
@ 2013-06-10 10:20 Chris Wilson
2013-06-10 10:20 ` [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring Chris Wilson
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Chris Wilson @ 2013-06-10 10:20 UTC (permalink / raw)
To: intel-gfx
When we reset and restart a ring, we also want to clear any existing
hangcheck.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
drivers/gpu/drm/i915/intel_ringbuffer.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 1ef081c..a3cfa35 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -453,6 +453,8 @@ static int init_ring_common(struct intel_ring_buffer *ring)
ring->last_retired_head = -1;
}
+ memset(&ring->hangcheck, 0, sizeof(ring->hangcheck));
+
out:
if (HAS_FORCE_WAKE(dev))
gen6_gt_force_wake_put(dev_priv);
--
1.7.10.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
2013-06-10 10:20 [PATCH 1/4] drm/i915: Initialize ring->hangcheck upon ring init Chris Wilson
@ 2013-06-10 10:20 ` Chris Wilson
2013-06-11 9:45 ` Daniel Vetter
2013-06-10 10:20 ` [PATCH 3/4] drm/i915: Don't count semaphore waits towards a stuck ring Chris Wilson
2013-06-10 10:20 ` [PATCH 4/4] drm/i915: Eliminate the addr/seqno from the hangcheck warning Chris Wilson
2 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2013-06-10 10:20 UTC (permalink / raw)
To: intel-gfx; +Cc: Ben Widawsky
After kicking a ring, it should be free to make progress again and so
should not be accused of being stuck until hangcheck fires once more. In
order to catch a denial-of-service within a batch or across multiple
batches, we still do increment the hangcheck score - just not as
severely so that it takes multiple kicks to fail.
This should address part of Ben's justified criticism of
commit 05407ff889ceebe383aa5907219f86582ef96b72
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date: Thu May 30 09:04:29 2013 +0300
drm/i915: detect hang using per ring hangcheck_score
"There's also another corner case on the kick. If the seqno = 2
(though not stuck), and on the 3rd hangcheck, the ring is stuck, and
we try to kick it... we don't actually try to find out if the kick
helped."
v2: Make sure we catch DoS attempts with batches full of invalid WAITs.
v3: Preserve the ability to detect loops by always charging the ring
if it is busy on the same request.
v4: Make sure we queue another check if on a new batch
References: https://bugs.freedesktop.org/show_bug.cgi?id=65394
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Ben Widawsky <ben@bwidawsk.net>
---
drivers/gpu/drm/i915/i915_irq.c | 110 +++++++++++++++++++++------------------
1 file changed, 58 insertions(+), 52 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index dcb5209..32b2465 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2324,21 +2324,11 @@ ring_last_seqno(struct intel_ring_buffer *ring)
struct drm_i915_gem_request, list)->seqno;
}
-static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring,
- u32 ring_seqno, bool *err)
-{
- if (list_empty(&ring->request_list) ||
- i915_seqno_passed(ring_seqno, ring_last_seqno(ring))) {
- /* Issue a wake-up to catch stuck h/w. */
- if (waitqueue_active(&ring->irq_queue)) {
- DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
- ring->name);
- wake_up_all(&ring->irq_queue);
- *err = true;
- }
- return true;
- }
- return false;
+static bool
+ring_idle(struct intel_ring_buffer *ring, u32 seqno)
+{
+ return (list_empty(&ring->request_list) ||
+ i915_seqno_passed(seqno, ring_last_seqno(ring)));
}
static bool semaphore_passed(struct intel_ring_buffer *ring)
@@ -2372,16 +2362,26 @@ static bool semaphore_passed(struct intel_ring_buffer *ring)
ioread32(ring->virtual_start+acthd+4)+1);
}
-static bool kick_ring(struct intel_ring_buffer *ring)
+static bool ring_hung(struct intel_ring_buffer *ring)
{
struct drm_device *dev = ring->dev;
struct drm_i915_private *dev_priv = dev->dev_private;
- u32 tmp = I915_READ_CTL(ring);
+ u32 tmp;
+
+ if (IS_GEN2(dev))
+ return true;
+
+ /* Is the chip hanging on a WAIT_FOR_EVENT?
+ * If so we can simply poke the RB_WAIT bit
+ * and break the hang. This should work on
+ * all but the second generation chipsets.
+ */
+ tmp = I915_READ_CTL(ring);
if (tmp & RING_WAIT) {
DRM_ERROR("Kicking stuck wait on %s\n",
ring->name);
I915_WRITE_CTL(ring, tmp);
- return true;
+ return false;
}
if (INTEL_INFO(dev)->gen >= 6 &&
@@ -2390,22 +2390,10 @@ static bool kick_ring(struct intel_ring_buffer *ring)
DRM_ERROR("Kicking stuck semaphore on %s\n",
ring->name);
I915_WRITE_CTL(ring, tmp);
- return true;
- }
- return false;
-}
-
-static bool i915_hangcheck_ring_hung(struct intel_ring_buffer *ring)
-{
- if (IS_GEN2(ring->dev))
return false;
+ }
- /* Is the chip hanging on a WAIT_FOR_EVENT?
- * If so we can simply poke the RB_WAIT bit
- * and break the hang. This should work on
- * all but the second generation chipsets.
- */
- return !kick_ring(ring);
+ return true;
}
/**
@@ -2423,45 +2411,63 @@ void i915_hangcheck_elapsed(unsigned long data)
struct intel_ring_buffer *ring;
int i;
int busy_count = 0, rings_hung = 0;
- bool stuck[I915_NUM_RINGS];
+ bool stuck[I915_NUM_RINGS] = { 0 };
+#define BUSY 1
+#define KICK 5
+#define HUNG 20
+#define FIRE 30
if (!i915_enable_hangcheck)
return;
for_each_ring(ring, dev_priv, i) {
u32 seqno, acthd;
- bool idle, err = false;
+ bool busy = true;
seqno = ring->get_seqno(ring, false);
acthd = intel_ring_get_active_head(ring);
- idle = i915_hangcheck_ring_idle(ring, seqno, &err);
- stuck[i] = ring->hangcheck.acthd == acthd;
-
- if (idle) {
- if (err)
- ring->hangcheck.score += 2;
- else
- ring->hangcheck.score = 0;
- } else {
- busy_count++;
- if (ring->hangcheck.seqno == seqno) {
- ring->hangcheck.score++;
-
- /* Kick ring if stuck*/
- if (stuck[i])
- i915_hangcheck_ring_hung(ring);
+ if (ring->hangcheck.seqno == seqno) {
+ if (ring_idle(ring, seqno)) {
+ if (waitqueue_active(&ring->irq_queue)) {
+ /* Issue a wake-up to catch stuck h/w. */
+ DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
+ ring->name);
+ wake_up_all(&ring->irq_queue);
+ ring->hangcheck.score += HUNG;
+ } else
+ busy = false;
} else {
- ring->hangcheck.score = 0;
+ int score;
+
+ stuck[i] = ring->hangcheck.acthd == acthd;
+ if (stuck[i]) {
+ /* Every time we kick the ring, add a
+ * small increment to the hangcheck
+ * score so that we can catch a
+ * batch that is repeatedly kicked.
+ */
+ score = ring_hung(ring) ? HUNG : KICK;
+ } else
+ score = BUSY;
+
+ ring->hangcheck.score += score;
}
+ } else {
+ /* Gradually reduce the count so that we catch DoS
+ * attempts across multiple batches.
+ */
+ if (ring->hangcheck.score > 0)
+ ring->hangcheck.score--;
}
ring->hangcheck.seqno = seqno;
ring->hangcheck.acthd = acthd;
+ busy_count += busy;
}
for_each_ring(ring, dev_priv, i) {
- if (ring->hangcheck.score > 2) {
+ if (ring->hangcheck.score > FIRE) {
rings_hung++;
DRM_ERROR("%s: %s on %s 0x%x\n", ring->name,
stuck[i] ? "stuck" : "no progress",
--
1.7.10.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 3/4] drm/i915: Don't count semaphore waits towards a stuck ring
2013-06-10 10:20 [PATCH 1/4] drm/i915: Initialize ring->hangcheck upon ring init Chris Wilson
2013-06-10 10:20 ` [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring Chris Wilson
@ 2013-06-10 10:20 ` Chris Wilson
2013-06-11 9:51 ` Daniel Vetter
2013-06-10 10:20 ` [PATCH 4/4] drm/i915: Eliminate the addr/seqno from the hangcheck warning Chris Wilson
2 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2013-06-10 10:20 UTC (permalink / raw)
To: intel-gfx; +Cc: Ben Widawsky
If we detect a ring is in a valid wait for another, just let it be.
Eventually it will either begin to progress again, or the entire system
will come grinding to a halt and then hangcheck will fire as soon as the
deadlock is detected.
This error was foretold by Ben in
commit 05407ff889ceebe383aa5907219f86582ef96b72
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date: Thu May 30 09:04:29 2013 +0300
drm/i915: detect hang using per ring hangcheck_score
"If ring B is waiting on ring A via semaphore, and ring A is making
progress, albeit slowly - the hangcheck will fire. The check will
determine that A is moving, however ring B will appear hung because
the ACTHD doesn't move. I honestly can't say if that's actually a
realistic problem to hit it probably implies the timeout value is too
low."
v2: Make sure we don't even incur the KICK cost whilst waiting.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=65394
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Ben Widawsky <ben@bwidawsk.net>
---
drivers/gpu/drm/i915/i915_irq.c | 121 +++++++++++++++++++++++--------
drivers/gpu/drm/i915/intel_ringbuffer.h | 1 +
2 files changed, 90 insertions(+), 32 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 32b2465..cf8584c 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2331,21 +2331,21 @@ ring_idle(struct intel_ring_buffer *ring, u32 seqno)
i915_seqno_passed(seqno, ring_last_seqno(ring)));
}
-static bool semaphore_passed(struct intel_ring_buffer *ring)
+static struct intel_ring_buffer *
+semaphore_waits_for(struct intel_ring_buffer *ring, u32 *seqno)
{
struct drm_i915_private *dev_priv = ring->dev->dev_private;
- u32 acthd = intel_ring_get_active_head(ring) & HEAD_ADDR;
- struct intel_ring_buffer *signaller;
- u32 cmd, ipehr, acthd_min;
+ u32 cmd, ipehr, acthd, acthd_min;
ipehr = I915_READ(RING_IPEHR(ring->mmio_base));
if ((ipehr & ~(0x3 << 16)) !=
(MI_SEMAPHORE_MBOX | MI_SEMAPHORE_COMPARE | MI_SEMAPHORE_REGISTER))
- return false;
+ return NULL;
/* ACTHD is likely pointing to the dword after the actual command,
* so scan backwards until we find the MBOX.
*/
+ acthd = intel_ring_get_active_head(ring) & HEAD_ADDR;
acthd_min = max((int)acthd - 3 * 4, 0);
do {
cmd = ioread32(ring->virtual_start + acthd);
@@ -2354,22 +2354,53 @@ static bool semaphore_passed(struct intel_ring_buffer *ring)
acthd -= 4;
if (acthd < acthd_min)
- return false;
+ return NULL;
} while (1);
- signaller = &dev_priv->ring[(ring->id + (((ipehr >> 17) & 1) + 1)) % 3];
- return i915_seqno_passed(signaller->get_seqno(signaller, false),
- ioread32(ring->virtual_start+acthd+4)+1);
+ *seqno = ioread32(ring->virtual_start+acthd+4)+1;
+ return &dev_priv->ring[(ring->id + (((ipehr >> 17) & 1) + 1)) % 3];
+}
+
+static int semaphore_passed(struct intel_ring_buffer *ring)
+{
+ struct drm_i915_private *dev_priv = ring->dev->dev_private;
+ struct intel_ring_buffer *signaller;
+ u32 seqno, ctl;
+
+ ring->hangcheck.deadlock = true;
+
+ signaller = semaphore_waits_for(ring, &seqno);
+ if (signaller == NULL || signaller->hangcheck.deadlock)
+ return -1;
+
+ /* cursory check for an unkickable deadlock */
+ ctl = I915_READ_CTL(signaller);
+ if (ctl & RING_WAIT_SEMAPHORE && semaphore_passed(signaller) < 0)
+ return -1;
+
+ return i915_seqno_passed(signaller->get_seqno(signaller, false), seqno);
+}
+
+static void semaphore_clear_deadlocks(struct drm_i915_private *dev_priv)
+{
+ struct intel_ring_buffer *ring;
+ int i;
+
+ for_each_ring(ring, dev_priv, i)
+ ring->hangcheck.deadlock = false;
}
-static bool ring_hung(struct intel_ring_buffer *ring)
+static enum { wait, active, kick, hung } ring_stuck(struct intel_ring_buffer *ring, u32 acthd)
{
struct drm_device *dev = ring->dev;
struct drm_i915_private *dev_priv = dev->dev_private;
u32 tmp;
+ if (ring->hangcheck.acthd != acthd)
+ return active;
+
if (IS_GEN2(dev))
- return true;
+ return hung;
/* Is the chip hanging on a WAIT_FOR_EVENT?
* If so we can simply poke the RB_WAIT bit
@@ -2381,19 +2412,24 @@ static bool ring_hung(struct intel_ring_buffer *ring)
DRM_ERROR("Kicking stuck wait on %s\n",
ring->name);
I915_WRITE_CTL(ring, tmp);
- return false;
- }
-
- if (INTEL_INFO(dev)->gen >= 6 &&
- tmp & RING_WAIT_SEMAPHORE &&
- semaphore_passed(ring)) {
- DRM_ERROR("Kicking stuck semaphore on %s\n",
- ring->name);
- I915_WRITE_CTL(ring, tmp);
- return false;
+ return kick;
+ }
+
+ if (INTEL_INFO(dev)->gen >= 6 && tmp & RING_WAIT_SEMAPHORE) {
+ switch (semaphore_passed(ring)) {
+ default:
+ return hung;
+ case 1:
+ DRM_ERROR("Kicking stuck semaphore on %s\n",
+ ring->name);
+ I915_WRITE_CTL(ring, tmp);
+ return kick;
+ case 0:
+ return wait;
+ }
}
- return true;
+ return hung;
}
/**
@@ -2424,6 +2460,8 @@ void i915_hangcheck_elapsed(unsigned long data)
u32 seqno, acthd;
bool busy = true;
+ semaphore_clear_deadlocks(dev_priv);
+
seqno = ring->get_seqno(ring, false);
acthd = intel_ring_get_active_head(ring);
@@ -2440,17 +2478,36 @@ void i915_hangcheck_elapsed(unsigned long data)
} else {
int score;
- stuck[i] = ring->hangcheck.acthd == acthd;
- if (stuck[i]) {
- /* Every time we kick the ring, add a
- * small increment to the hangcheck
- * score so that we can catch a
- * batch that is repeatedly kicked.
- */
- score = ring_hung(ring) ? HUNG : KICK;
- } else
+ /* We always increment the hangcheck score
+ * if the ring is busy and still processing
+ * the same request, so that no single request
+ * can run indefinitely (such as a chain of
+ * batches). The only time we do not increment
+ * the hangcheck score on this ring, if this
+ * ring is in a legitimate wait for another
+ * ring. In that case the waiting ring is a
+ * victim and we want to be sure we catch the
+ * right culprit. Then every time we do kick
+ * the ring, add a small increment to the
+ * score so that we can catch a batch that is
+ * being repeatedly kicked and so responsible
+ * for stalling the machine.
+ */
+ switch (ring_stuck(ring, acthd)) {
+ case wait:
+ score = 0;
+ break;
+ case active:
score = BUSY;
-
+ break;
+ case kick:
+ score = KICK;
+ break;
+ case hung:
+ score = HUNG;
+ stuck[i] = true;
+ break;
+ }
ring->hangcheck.score += score;
}
} else {
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index efc403d..a3e9610 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -38,6 +38,7 @@ struct intel_hw_status_page {
#define I915_READ_SYNC_1(ring) I915_READ(RING_SYNC_1((ring)->mmio_base))
struct intel_ring_hangcheck {
+ bool deadlock;
u32 seqno;
u32 acthd;
int score;
--
1.7.10.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 4/4] drm/i915: Eliminate the addr/seqno from the hangcheck warning
2013-06-10 10:20 [PATCH 1/4] drm/i915: Initialize ring->hangcheck upon ring init Chris Wilson
2013-06-10 10:20 ` [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring Chris Wilson
2013-06-10 10:20 ` [PATCH 3/4] drm/i915: Don't count semaphore waits towards a stuck ring Chris Wilson
@ 2013-06-10 10:20 ` Chris Wilson
2013-06-10 13:42 ` Mika Kuoppala
2 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2013-06-10 10:20 UTC (permalink / raw)
To: intel-gfx
This is of no value to the developer reading the report, let alone the
bamboozled user.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
drivers/gpu/drm/i915/i915_irq.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index cf8584c..39730b6 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2525,12 +2525,10 @@ void i915_hangcheck_elapsed(unsigned long data)
for_each_ring(ring, dev_priv, i) {
if (ring->hangcheck.score > FIRE) {
- rings_hung++;
- DRM_ERROR("%s: %s on %s 0x%x\n", ring->name,
+ DRM_ERROR("%s on %s ring\n",
stuck[i] ? "stuck" : "no progress",
- stuck[i] ? "addr" : "seqno",
- stuck[i] ? ring->hangcheck.acthd & HEAD_ADDR :
- ring->hangcheck.seqno);
+ ring->name);
+ rings_hung++;
}
}
--
1.7.10.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 4/4] drm/i915: Eliminate the addr/seqno from the hangcheck warning
2013-06-10 10:20 ` [PATCH 4/4] drm/i915: Eliminate the addr/seqno from the hangcheck warning Chris Wilson
@ 2013-06-10 13:42 ` Mika Kuoppala
0 siblings, 0 replies; 12+ messages in thread
From: Mika Kuoppala @ 2013-06-10 13:42 UTC (permalink / raw)
To: Chris Wilson, intel-gfx
Chris Wilson <chris@chris-wilson.co.uk> writes:
> This is of no value to the developer reading the report, let alone the
> bamboozled user.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
> drivers/gpu/drm/i915/i915_irq.c | 8 +++-----
> 1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index cf8584c..39730b6 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2525,12 +2525,10 @@ void i915_hangcheck_elapsed(unsigned long data)
>
> for_each_ring(ring, dev_priv, i) {
> if (ring->hangcheck.score > FIRE) {
> - rings_hung++;
> - DRM_ERROR("%s: %s on %s 0x%x\n", ring->name,
> + DRM_ERROR("%s on %s ring\n",
^^
Noticed one redudant 'ring' in here as ring->name already contains it.
Patches 1-4:
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
2013-06-10 10:20 ` [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring Chris Wilson
@ 2013-06-11 9:45 ` Daniel Vetter
2013-06-11 13:40 ` Chris Wilson
0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vetter @ 2013-06-11 9:45 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx, Ben Widawsky
On Mon, Jun 10, 2013 at 11:20:20AM +0100, Chris Wilson wrote:
> After kicking a ring, it should be free to make progress again and so
> should not be accused of being stuck until hangcheck fires once more. In
> order to catch a denial-of-service within a batch or across multiple
> batches, we still do increment the hangcheck score - just not as
> severely so that it takes multiple kicks to fail.
>
> This should address part of Ben's justified criticism of
>
> commit 05407ff889ceebe383aa5907219f86582ef96b72
> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Date: Thu May 30 09:04:29 2013 +0300
>
> drm/i915: detect hang using per ring hangcheck_score
>
> "There's also another corner case on the kick. If the seqno = 2
> (though not stuck), and on the 3rd hangcheck, the ring is stuck, and
> we try to kick it... we don't actually try to find out if the kick
> helped."
>
> v2: Make sure we catch DoS attempts with batches full of invalid WAITs.
> v3: Preserve the ability to detect loops by always charging the ring
> if it is busy on the same request.
> v4: Make sure we queue another check if on a new batch
>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=65394
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> ---
> drivers/gpu/drm/i915/i915_irq.c | 110 +++++++++++++++++++++------------------
> 1 file changed, 58 insertions(+), 52 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index dcb5209..32b2465 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2324,21 +2324,11 @@ ring_last_seqno(struct intel_ring_buffer *ring)
> struct drm_i915_gem_request, list)->seqno;
> }
>
> -static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring,
> - u32 ring_seqno, bool *err)
> -{
> - if (list_empty(&ring->request_list) ||
> - i915_seqno_passed(ring_seqno, ring_last_seqno(ring))) {
> - /* Issue a wake-up to catch stuck h/w. */
> - if (waitqueue_active(&ring->irq_queue)) {
> - DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
> - ring->name);
> - wake_up_all(&ring->irq_queue);
> - *err = true;
> - }
> - return true;
> - }
> - return false;
> +static bool
> +ring_idle(struct intel_ring_buffer *ring, u32 seqno)
> +{
> + return (list_empty(&ring->request_list) ||
> + i915_seqno_passed(seqno, ring_last_seqno(ring)));
> }
>
> static bool semaphore_passed(struct intel_ring_buffer *ring)
> @@ -2372,16 +2362,26 @@ static bool semaphore_passed(struct intel_ring_buffer *ring)
> ioread32(ring->virtual_start+acthd+4)+1);
> }
>
> -static bool kick_ring(struct intel_ring_buffer *ring)
> +static bool ring_hung(struct intel_ring_buffer *ring)
> {
> struct drm_device *dev = ring->dev;
> struct drm_i915_private *dev_priv = dev->dev_private;
> - u32 tmp = I915_READ_CTL(ring);
> + u32 tmp;
> +
> + if (IS_GEN2(dev))
> + return true;
> +
> + /* Is the chip hanging on a WAIT_FOR_EVENT?
> + * If so we can simply poke the RB_WAIT bit
> + * and break the hang. This should work on
> + * all but the second generation chipsets.
> + */
> + tmp = I915_READ_CTL(ring);
> if (tmp & RING_WAIT) {
> DRM_ERROR("Kicking stuck wait on %s\n",
> ring->name);
> I915_WRITE_CTL(ring, tmp);
> - return true;
> + return false;
> }
>
> if (INTEL_INFO(dev)->gen >= 6 &&
> @@ -2390,22 +2390,10 @@ static bool kick_ring(struct intel_ring_buffer *ring)
> DRM_ERROR("Kicking stuck semaphore on %s\n",
> ring->name);
> I915_WRITE_CTL(ring, tmp);
> - return true;
> - }
> - return false;
> -}
> -
> -static bool i915_hangcheck_ring_hung(struct intel_ring_buffer *ring)
> -{
> - if (IS_GEN2(ring->dev))
> return false;
> + }
>
> - /* Is the chip hanging on a WAIT_FOR_EVENT?
> - * If so we can simply poke the RB_WAIT bit
> - * and break the hang. This should work on
> - * all but the second generation chipsets.
> - */
> - return !kick_ring(ring);
> + return true;
> }
>
> /**
> @@ -2423,45 +2411,63 @@ void i915_hangcheck_elapsed(unsigned long data)
> struct intel_ring_buffer *ring;
> int i;
> int busy_count = 0, rings_hung = 0;
> - bool stuck[I915_NUM_RINGS];
> + bool stuck[I915_NUM_RINGS] = { 0 };
> +#define BUSY 1
> +#define KICK 5
> +#define HUNG 20
> +#define FIRE 30
>
> if (!i915_enable_hangcheck)
> return;
>
> for_each_ring(ring, dev_priv, i) {
> u32 seqno, acthd;
> - bool idle, err = false;
> + bool busy = true;
>
> seqno = ring->get_seqno(ring, false);
> acthd = intel_ring_get_active_head(ring);
> - idle = i915_hangcheck_ring_idle(ring, seqno, &err);
> - stuck[i] = ring->hangcheck.acthd == acthd;
> -
> - if (idle) {
> - if (err)
> - ring->hangcheck.score += 2;
> - else
> - ring->hangcheck.score = 0;
> - } else {
> - busy_count++;
>
> - if (ring->hangcheck.seqno == seqno) {
> - ring->hangcheck.score++;
> -
> - /* Kick ring if stuck*/
> - if (stuck[i])
> - i915_hangcheck_ring_hung(ring);
> + if (ring->hangcheck.seqno == seqno) {
> + if (ring_idle(ring, seqno)) {
> + if (waitqueue_active(&ring->irq_queue)) {
> + /* Issue a wake-up to catch stuck h/w. */
> + DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
> + ring->name);
> + wake_up_all(&ring->irq_queue);
> + ring->hangcheck.score += HUNG;
Not sure whether we want to hit missed interrupts this badly, it was
rather common a while back ;-) But we can fine-tune this easily now, so
now reservations for merging from my side.
-Daniel
> + } else
> + busy = false;
> } else {
> - ring->hangcheck.score = 0;
> + int score;
> +
> + stuck[i] = ring->hangcheck.acthd == acthd;
> + if (stuck[i]) {
> + /* Every time we kick the ring, add a
> + * small increment to the hangcheck
> + * score so that we can catch a
> + * batch that is repeatedly kicked.
> + */
> + score = ring_hung(ring) ? HUNG : KICK;
> + } else
> + score = BUSY;
> +
> + ring->hangcheck.score += score;
> }
> + } else {
> + /* Gradually reduce the count so that we catch DoS
> + * attempts across multiple batches.
> + */
> + if (ring->hangcheck.score > 0)
> + ring->hangcheck.score--;
> }
>
> ring->hangcheck.seqno = seqno;
> ring->hangcheck.acthd = acthd;
> + busy_count += busy;
> }
>
> for_each_ring(ring, dev_priv, i) {
> - if (ring->hangcheck.score > 2) {
> + if (ring->hangcheck.score > FIRE) {
> rings_hung++;
> DRM_ERROR("%s: %s on %s 0x%x\n", ring->name,
> stuck[i] ? "stuck" : "no progress",
> --
> 1.7.10.4
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/4] drm/i915: Don't count semaphore waits towards a stuck ring
2013-06-10 10:20 ` [PATCH 3/4] drm/i915: Don't count semaphore waits towards a stuck ring Chris Wilson
@ 2013-06-11 9:51 ` Daniel Vetter
0 siblings, 0 replies; 12+ messages in thread
From: Daniel Vetter @ 2013-06-11 9:51 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx, Ben Widawsky
On Mon, Jun 10, 2013 at 11:20:21AM +0100, Chris Wilson wrote:
> If we detect a ring is in a valid wait for another, just let it be.
> Eventually it will either begin to progress again, or the entire system
> will come grinding to a halt and then hangcheck will fire as soon as the
> deadlock is detected.
>
> This error was foretold by Ben in
> commit 05407ff889ceebe383aa5907219f86582ef96b72
> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Date: Thu May 30 09:04:29 2013 +0300
>
> drm/i915: detect hang using per ring hangcheck_score
>
> "If ring B is waiting on ring A via semaphore, and ring A is making
> progress, albeit slowly - the hangcheck will fire. The check will
> determine that A is moving, however ring B will appear hung because
> the ACTHD doesn't move. I honestly can't say if that's actually a
> realistic problem to hit it probably implies the timeout value is too
> low."
>
> v2: Make sure we don't even incur the KICK cost whilst waiting.
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=65394
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> ---
> drivers/gpu/drm/i915/i915_irq.c | 121 +++++++++++++++++++++++--------
> drivers/gpu/drm/i915/intel_ringbuffer.h | 1 +
> 2 files changed, 90 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 32b2465..cf8584c 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2331,21 +2331,21 @@ ring_idle(struct intel_ring_buffer *ring, u32 seqno)
> i915_seqno_passed(seqno, ring_last_seqno(ring)));
> }
>
> -static bool semaphore_passed(struct intel_ring_buffer *ring)
> +static struct intel_ring_buffer *
> +semaphore_waits_for(struct intel_ring_buffer *ring, u32 *seqno)
> {
> struct drm_i915_private *dev_priv = ring->dev->dev_private;
> - u32 acthd = intel_ring_get_active_head(ring) & HEAD_ADDR;
> - struct intel_ring_buffer *signaller;
> - u32 cmd, ipehr, acthd_min;
> + u32 cmd, ipehr, acthd, acthd_min;
>
> ipehr = I915_READ(RING_IPEHR(ring->mmio_base));
> if ((ipehr & ~(0x3 << 16)) !=
> (MI_SEMAPHORE_MBOX | MI_SEMAPHORE_COMPARE | MI_SEMAPHORE_REGISTER))
> - return false;
> + return NULL;
>
> /* ACTHD is likely pointing to the dword after the actual command,
> * so scan backwards until we find the MBOX.
> */
> + acthd = intel_ring_get_active_head(ring) & HEAD_ADDR;
> acthd_min = max((int)acthd - 3 * 4, 0);
> do {
> cmd = ioread32(ring->virtual_start + acthd);
> @@ -2354,22 +2354,53 @@ static bool semaphore_passed(struct intel_ring_buffer *ring)
>
> acthd -= 4;
> if (acthd < acthd_min)
> - return false;
> + return NULL;
> } while (1);
>
> - signaller = &dev_priv->ring[(ring->id + (((ipehr >> 17) & 1) + 1)) % 3];
> - return i915_seqno_passed(signaller->get_seqno(signaller, false),
> - ioread32(ring->virtual_start+acthd+4)+1);
> + *seqno = ioread32(ring->virtual_start+acthd+4)+1;
> + return &dev_priv->ring[(ring->id + (((ipehr >> 17) & 1) + 1)) % 3];
> +}
> +
> +static int semaphore_passed(struct intel_ring_buffer *ring)
> +{
> + struct drm_i915_private *dev_priv = ring->dev->dev_private;
> + struct intel_ring_buffer *signaller;
> + u32 seqno, ctl;
> +
> + ring->hangcheck.deadlock = true;
> +
> + signaller = semaphore_waits_for(ring, &seqno);
> + if (signaller == NULL || signaller->hangcheck.deadlock)
> + return -1;
> +
> + /* cursory check for an unkickable deadlock */
> + ctl = I915_READ_CTL(signaller);
> + if (ctl & RING_WAIT_SEMAPHORE && semaphore_passed(signaller) < 0)
> + return -1;
> +
> + return i915_seqno_passed(signaller->get_seqno(signaller, false), seqno);
> +}
> +
> +static void semaphore_clear_deadlocks(struct drm_i915_private *dev_priv)
> +{
> + struct intel_ring_buffer *ring;
> + int i;
> +
> + for_each_ring(ring, dev_priv, i)
> + ring->hangcheck.deadlock = false;
> }
>
> -static bool ring_hung(struct intel_ring_buffer *ring)
> +static enum { wait, active, kick, hung } ring_stuck(struct intel_ring_buffer *ring, u32 acthd)
> {
> struct drm_device *dev = ring->dev;
> struct drm_i915_private *dev_priv = dev->dev_private;
> u32 tmp;
>
> + if (ring->hangcheck.acthd != acthd)
> + return active;
> +
> if (IS_GEN2(dev))
> - return true;
> + return hung;
>
> /* Is the chip hanging on a WAIT_FOR_EVENT?
> * If so we can simply poke the RB_WAIT bit
> @@ -2381,19 +2412,24 @@ static bool ring_hung(struct intel_ring_buffer *ring)
> DRM_ERROR("Kicking stuck wait on %s\n",
> ring->name);
> I915_WRITE_CTL(ring, tmp);
> - return false;
> - }
> -
> - if (INTEL_INFO(dev)->gen >= 6 &&
> - tmp & RING_WAIT_SEMAPHORE &&
> - semaphore_passed(ring)) {
> - DRM_ERROR("Kicking stuck semaphore on %s\n",
> - ring->name);
> - I915_WRITE_CTL(ring, tmp);
> - return false;
> + return kick;
> + }
> +
> + if (INTEL_INFO(dev)->gen >= 6 && tmp & RING_WAIT_SEMAPHORE) {
> + switch (semaphore_passed(ring)) {
> + default:
> + return hung;
> + case 1:
> + DRM_ERROR("Kicking stuck semaphore on %s\n",
> + ring->name);
> + I915_WRITE_CTL(ring, tmp);
> + return kick;
> + case 0:
> + return wait;
> + }
> }
>
> - return true;
> + return hung;
> }
>
> /**
> @@ -2424,6 +2460,8 @@ void i915_hangcheck_elapsed(unsigned long data)
> u32 seqno, acthd;
> bool busy = true;
>
> + semaphore_clear_deadlocks(dev_priv);
> +
> seqno = ring->get_seqno(ring, false);
> acthd = intel_ring_get_active_head(ring);
>
> @@ -2440,17 +2478,36 @@ void i915_hangcheck_elapsed(unsigned long data)
> } else {
> int score;
>
> - stuck[i] = ring->hangcheck.acthd == acthd;
> - if (stuck[i]) {
> - /* Every time we kick the ring, add a
> - * small increment to the hangcheck
> - * score so that we can catch a
> - * batch that is repeatedly kicked.
> - */
> - score = ring_hung(ring) ? HUNG : KICK;
> - } else
> + /* We always increment the hangcheck score
> + * if the ring is busy and still processing
> + * the same request, so that no single request
> + * can run indefinitely (such as a chain of
> + * batches). The only time we do not increment
> + * the hangcheck score on this ring, if this
> + * ring is in a legitimate wait for another
> + * ring. In that case the waiting ring is a
> + * victim and we want to be sure we catch the
> + * right culprit. Then every time we do kick
> + * the ring, add a small increment to the
> + * score so that we can catch a batch that is
> + * being repeatedly kicked and so responsible
> + * for stalling the machine.
> + */
> + switch (ring_stuck(ring, acthd)) {
> + case wait:
> + score = 0;
> + break;
> + case active:
> score = BUSY;
> -
> + break;
> + case kick:
> + score = KICK;
> + break;
> + case hung:
> + score = HUNG;
> + stuck[i] = true;
> + break;
> + }
> ring->hangcheck.score += score;
I think extracting the score selection logic here would be nice, stuff is
falling of the cliff here a bit ;-)
Anyway, series merged, thanks a lot to everyone for digging into this an
coming up with a pretty neat solution.
Cheers, Daniel
> }
> } else {
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index efc403d..a3e9610 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -38,6 +38,7 @@ struct intel_hw_status_page {
> #define I915_READ_SYNC_1(ring) I915_READ(RING_SYNC_1((ring)->mmio_base))
>
> struct intel_ring_hangcheck {
> + bool deadlock;
> u32 seqno;
> u32 acthd;
> int score;
> --
> 1.7.10.4
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
2013-06-11 9:45 ` Daniel Vetter
@ 2013-06-11 13:40 ` Chris Wilson
2013-06-11 14:05 ` Daniel Vetter
0 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2013-06-11 13:40 UTC (permalink / raw)
To: Daniel Vetter; +Cc: intel-gfx, Ben Widawsky
On Tue, Jun 11, 2013 at 11:45:00AM +0200, Daniel Vetter wrote:
> On Mon, Jun 10, 2013 at 11:20:20AM +0100, Chris Wilson wrote:
> > + if (ring->hangcheck.seqno == seqno) {
> > + if (ring_idle(ring, seqno)) {
> > + if (waitqueue_active(&ring->irq_queue)) {
> > + /* Issue a wake-up to catch stuck h/w. */
> > + DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
> > + ring->name);
> > + wake_up_all(&ring->irq_queue);
> > + ring->hangcheck.score += HUNG;
>
> Not sure whether we want to hit missed interrupts this badly, it was
> rather common a while back ;-) But we can fine-tune this easily now, so
> now reservations for merging from my side.
Not sure what you mean here. The check is fairly easy and has gotten us
out of many a hole before, and makes for a good defense. So how would
you want to fine tune it?
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
2013-06-11 13:40 ` Chris Wilson
@ 2013-06-11 14:05 ` Daniel Vetter
2013-06-11 14:16 ` Chris Wilson
0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vetter @ 2013-06-11 14:05 UTC (permalink / raw)
To: Chris Wilson, Daniel Vetter, intel-gfx, Ben Widawsky
On Tue, Jun 11, 2013 at 02:40:19PM +0100, Chris Wilson wrote:
> On Tue, Jun 11, 2013 at 11:45:00AM +0200, Daniel Vetter wrote:
> > On Mon, Jun 10, 2013 at 11:20:20AM +0100, Chris Wilson wrote:
> > > + if (ring->hangcheck.seqno == seqno) {
> > > + if (ring_idle(ring, seqno)) {
> > > + if (waitqueue_active(&ring->irq_queue)) {
> > > + /* Issue a wake-up to catch stuck h/w. */
> > > + DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
> > > + ring->name);
> > > + wake_up_all(&ring->irq_queue);
> > > + ring->hangcheck.score += HUNG;
> >
> > Not sure whether we want to hit missed interrupts this badly, it was
> > rather common a while back ;-) But we can fine-tune this easily now, so
> > now reservations for merging from my side.
>
> Not sure what you mean here. The check is fairly easy and has gotten us
> out of many a hole before, and makes for a good defense. So how would
> you want to fine tune it?
Something like the MI_WAIT hangcheck score, but like I've said as long as
we don't have a real-world bug report (some poor guy disabled semaphores
maybe due to the snb issue?) not worth bothering at all.
I've just thought that if we're unlucky and miss the interrupt a few times
in a row we don't want to accidentally declare the gpu dead.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
2013-06-11 14:05 ` Daniel Vetter
@ 2013-06-11 14:16 ` Chris Wilson
2013-06-11 14:37 ` Daniel Vetter
0 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2013-06-11 14:16 UTC (permalink / raw)
To: Daniel Vetter; +Cc: intel-gfx, Ben Widawsky
On Tue, Jun 11, 2013 at 04:05:41PM +0200, Daniel Vetter wrote:
> On Tue, Jun 11, 2013 at 02:40:19PM +0100, Chris Wilson wrote:
> > Not sure what you mean here. The check is fairly easy and has gotten us
> > out of many a hole before, and makes for a good defense. So how would
> > you want to fine tune it?
>
> Something like the MI_WAIT hangcheck score, but like I've said as long as
> we don't have a real-world bug report (some poor guy disabled semaphores
> maybe due to the snb issue?) not worth bothering at all.
>
> I've just thought that if we're unlucky and miss the interrupt a few times
> in a row we don't want to accidentally declare the gpu dead.
I regarded it as a driver bug, that a GPU reset would not help. So the
choice is between limping along with the hopefully occasional stall, or
terminating the GPU with extreme prejudice. I chose the former, hence
did not increment the hangcheck.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
2013-06-11 14:16 ` Chris Wilson
@ 2013-06-11 14:37 ` Daniel Vetter
2013-06-11 16:10 ` Chris Wilson
0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vetter @ 2013-06-11 14:37 UTC (permalink / raw)
To: Chris Wilson, Daniel Vetter, intel-gfx, Ben Widawsky
On Tue, Jun 11, 2013 at 4:16 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Tue, Jun 11, 2013 at 04:05:41PM +0200, Daniel Vetter wrote:
>> On Tue, Jun 11, 2013 at 02:40:19PM +0100, Chris Wilson wrote:
>> > Not sure what you mean here. The check is fairly easy and has gotten us
>> > out of many a hole before, and makes for a good defense. So how would
>> > you want to fine tune it?
>>
>> Something like the MI_WAIT hangcheck score, but like I've said as long as
>> we don't have a real-world bug report (some poor guy disabled semaphores
>> maybe due to the snb issue?) not worth bothering at all.
>>
>> I've just thought that if we're unlucky and miss the interrupt a few times
>> in a row we don't want to accidentally declare the gpu dead.
>
> I regarded it as a driver bug, that a GPU reset would not help. So the
> choice is between limping along with the hopefully occasional stall, or
> terminating the GPU with extreme prejudice. I chose the former, hence
> did not increment the hangcheck.
Hm, maybe I'm reading the logic wrongly, but don't we add a += HUNG
score now for a stuck, but idle ring? So pretty short of declaring the
thing dead? Ofc there's the slow decline if the gpu isn't actually
dead, but if we have more than 1 such stall every HUNG (=20) hangcheck
times we'll eventually declare it dead despite the limping along.
Anyway nothing to really worry about, just wanted to check my
understanding here.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring
2013-06-11 14:37 ` Daniel Vetter
@ 2013-06-11 16:10 ` Chris Wilson
0 siblings, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2013-06-11 16:10 UTC (permalink / raw)
To: Daniel Vetter; +Cc: intel-gfx, Ben Widawsky
On Tue, Jun 11, 2013 at 04:37:26PM +0200, Daniel Vetter wrote:
> On Tue, Jun 11, 2013 at 4:16 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Tue, Jun 11, 2013 at 04:05:41PM +0200, Daniel Vetter wrote:
> >> On Tue, Jun 11, 2013 at 02:40:19PM +0100, Chris Wilson wrote:
> >> > Not sure what you mean here. The check is fairly easy and has gotten us
> >> > out of many a hole before, and makes for a good defense. So how would
> >> > you want to fine tune it?
> >>
> >> Something like the MI_WAIT hangcheck score, but like I've said as long as
> >> we don't have a real-world bug report (some poor guy disabled semaphores
> >> maybe due to the snb issue?) not worth bothering at all.
> >>
> >> I've just thought that if we're unlucky and miss the interrupt a few times
> >> in a row we don't want to accidentally declare the gpu dead.
> >
> > I regarded it as a driver bug, that a GPU reset would not help. So the
> > choice is between limping along with the hopefully occasional stall, or
> > terminating the GPU with extreme prejudice. I chose the former, hence
> > did not increment the hangcheck.
>
> Hm, maybe I'm reading the logic wrongly, but don't we add a += HUNG
> score now for a stuck, but idle ring? So pretty short of declaring the
> thing dead?
Yeah... Didn't mean to do that, as all the time I was thinking "don't
hang here, this is our bug not userspace's".
> Ofc there's the slow decline if the gpu isn't actually
> dead, but if we have more than 1 such stall every HUNG (=20) hangcheck
> times we'll eventually declare it dead despite the limping along.
>
> Anyway nothing to really worry about, just wanted to check my
> understanding here.
Looks like my fingers mutinied; and I am the one confused.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-06-11 16:10 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-10 10:20 [PATCH 1/4] drm/i915: Initialize ring->hangcheck upon ring init Chris Wilson
2013-06-10 10:20 ` [PATCH 2/4] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring Chris Wilson
2013-06-11 9:45 ` Daniel Vetter
2013-06-11 13:40 ` Chris Wilson
2013-06-11 14:05 ` Daniel Vetter
2013-06-11 14:16 ` Chris Wilson
2013-06-11 14:37 ` Daniel Vetter
2013-06-11 16:10 ` Chris Wilson
2013-06-10 10:20 ` [PATCH 3/4] drm/i915: Don't count semaphore waits towards a stuck ring Chris Wilson
2013-06-11 9:51 ` Daniel Vetter
2013-06-10 10:20 ` [PATCH 4/4] drm/i915: Eliminate the addr/seqno from the hangcheck warning Chris Wilson
2013-06-10 13:42 ` Mika Kuoppala
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.