All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Trigger hangcheck if we detect more a repeating missed IRQ
@ 2012-04-10 16:00 Chris Wilson
  2012-04-10 23:59 ` Ben Widawsky
  0 siblings, 1 reply; 5+ messages in thread
From: Chris Wilson @ 2012-04-10 16:00 UTC (permalink / raw)
  To: intel-gfx

On the first instance we just wish to kick the waiters and see if that
terminates the wait conditions. If it does not, then we do not want to
keep retrying without ever making any forward progress and becoming
stuck in a hangcheck loop.

Reported-and-tested-by: Lukas Hejtmanek <xhejtman@fi.muni.cz>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=48209
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_irq.c |   63 +++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 8c9187f..bdeecf1 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1859,6 +1859,36 @@ static bool kick_ring(struct intel_ring_buffer *ring)
 	return false;
 }
 
+static bool i915_hangcheck_hung(struct drm_device *dev)
+{
+	drm_i915_private_t *dev_priv = dev->dev_private;
+
+	if (dev_priv->hangcheck_count++ > 1) {
+		DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
+		i915_handle_error(dev, true);
+
+		if (!IS_GEN2(dev)) {
+			/* Is the chip hanging on a WAIT_FOR_EVENT?
+			 * If so we can simply poke the RB_WAIT bit
+			 * and break the hang. This should work on
+			 * all but the second generation chipsets.
+			 */
+			if (kick_ring(&dev_priv->ring[RCS]))
+				return false;
+
+			if (HAS_BSD(dev) && kick_ring(&dev_priv->ring[VCS]))
+				return false;
+
+			if (HAS_BLT(dev) && kick_ring(&dev_priv->ring[BCS]))
+				return false;
+		}
+
+		return true;
+	}
+
+	return false;
+}
+
 /**
  * This is called when the chip hasn't reported back with completed
  * batchbuffers in a long time. The first time this is called we simply record
@@ -1879,9 +1909,14 @@ void i915_hangcheck_elapsed(unsigned long data)
 	if (i915_hangcheck_ring_idle(&dev_priv->ring[RCS], &err) &&
 	    i915_hangcheck_ring_idle(&dev_priv->ring[VCS], &err) &&
 	    i915_hangcheck_ring_idle(&dev_priv->ring[BCS], &err)) {
-		dev_priv->hangcheck_count = 0;
-		if (err)
+		if (err) {
+			if (i915_hangcheck_hung(dev))
+				return;
+
 			goto repeat;
+		}
+
+		dev_priv->hangcheck_count = 0;
 		return;
 	}
 
@@ -1903,30 +1938,8 @@ void i915_hangcheck_elapsed(unsigned long data)
 	    dev_priv->last_acthd_blt == acthd_blt &&
 	    dev_priv->last_instdone == instdone &&
 	    dev_priv->last_instdone1 == instdone1) {
-		if (dev_priv->hangcheck_count++ > 1) {
-			DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
-			i915_handle_error(dev, true);
-
-			if (!IS_GEN2(dev)) {
-				/* Is the chip hanging on a WAIT_FOR_EVENT?
-				 * If so we can simply poke the RB_WAIT bit
-				 * and break the hang. This should work on
-				 * all but the second generation chipsets.
-				 */
-				if (kick_ring(&dev_priv->ring[RCS]))
-					goto repeat;
-
-				if (HAS_BSD(dev) &&
-				    kick_ring(&dev_priv->ring[VCS]))
-					goto repeat;
-
-				if (HAS_BLT(dev) &&
-				    kick_ring(&dev_priv->ring[BCS]))
-					goto repeat;
-			}
-
+		if (i915_hangcheck_hung(dev))
 			return;
-		}
 	} else {
 		dev_priv->hangcheck_count = 0;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: Trigger hangcheck if we detect more a repeating missed IRQ
  2012-04-10 16:00 [PATCH] drm/i915: Trigger hangcheck if we detect more a repeating missed IRQ Chris Wilson
@ 2012-04-10 23:59 ` Ben Widawsky
  2012-04-11  8:18   ` Chris Wilson
  0 siblings, 1 reply; 5+ messages in thread
From: Ben Widawsky @ 2012-04-10 23:59 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

On Tue, 10 Apr 2012 17:00:41 +0100
Chris Wilson <chris@chris-wilson.co.uk> wrote:

> On the first instance we just wish to kick the waiters and see if that
> terminates the wait conditions. If it does not, then we do not want to
> keep retrying without ever making any forward progress and becoming
> stuck in a hangcheck loop.
> 
> Reported-and-tested-by: Lukas Hejtmanek <xhejtman@fi.muni.cz>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=48209
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

I'm still confused about the problem we are purportedly fixing.

This should happen if we've missed an irq (or the watchdog fired too
soon), and then fires again before the thread has actually woken up to
realize that is missed the first IRQ?

As for extract the kick_ring bit of code for core hangcheck_elapsed,
that looks fine. I just don't quite understand the exact problem this
solves, and can't envision how we hit this case it seems the patch will
fix.

Ben

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: Trigger hangcheck if we detect more a repeating missed IRQ
  2012-04-10 23:59 ` Ben Widawsky
@ 2012-04-11  8:18   ` Chris Wilson
  2012-04-11 20:32     ` Ben Widawsky
  0 siblings, 1 reply; 5+ messages in thread
From: Chris Wilson @ 2012-04-11  8:18 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: intel-gfx

On Tue, 10 Apr 2012 16:59:11 -0700, Ben Widawsky <ben@bwidawsk.net> wrote:
> On Tue, 10 Apr 2012 17:00:41 +0100
> Chris Wilson <chris@chris-wilson.co.uk> wrote:
> 
> > On the first instance we just wish to kick the waiters and see if that
> > terminates the wait conditions. If it does not, then we do not want to
> > keep retrying without ever making any forward progress and becoming
> > stuck in a hangcheck loop.
> > 
> > Reported-and-tested-by: Lukas Hejtmanek <xhejtman@fi.muni.cz>
> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=48209
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> I'm still confused about the problem we are purportedly fixing.
> 
> This should happen if we've missed an irq (or the watchdog fired too
> soon), and then fires again before the thread has actually woken up to
> realize that is missed the first IRQ?
> 
> As for extract the kick_ring bit of code for core hangcheck_elapsed,
> that looks fine. I just don't quite understand the exact problem this
> solves, and can't envision how we hit this case it seems the patch will
> fix.

Sure, just look at the bug report for the garbage we wrote into the
ringbuffers and how we ended up indefinite wait. This is not defense
against normal behaviour but the driver screwing up.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: Trigger hangcheck if we detect more a repeating missed IRQ
  2012-04-11  8:18   ` Chris Wilson
@ 2012-04-11 20:32     ` Ben Widawsky
  2012-04-14 22:12       ` Daniel Vetter
  0 siblings, 1 reply; 5+ messages in thread
From: Ben Widawsky @ 2012-04-11 20:32 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

On Wed, 11 Apr 2012 09:18:15 +0100
Chris Wilson <chris@chris-wilson.co.uk> wrote:

> On Tue, 10 Apr 2012 16:59:11 -0700, Ben Widawsky <ben@bwidawsk.net> wrote:
> > On Tue, 10 Apr 2012 17:00:41 +0100
> > Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > 
> > > On the first instance we just wish to kick the waiters and see if that
> > > terminates the wait conditions. If it does not, then we do not want to
> > > keep retrying without ever making any forward progress and becoming
> > > stuck in a hangcheck loop.
> > > 
> > > Reported-and-tested-by: Lukas Hejtmanek <xhejtman@fi.muni.cz>
> > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=48209
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > I'm still confused about the problem we are purportedly fixing.
> > 
> > This should happen if we've missed an irq (or the watchdog fired too
> > soon), and then fires again before the thread has actually woken up to
> > realize that is missed the first IRQ?
> > 
> > As for extract the kick_ring bit of code for core hangcheck_elapsed,
> > that looks fine. I just don't quite understand the exact problem this
> > solves, and can't envision how we hit this case it seems the patch will
> > fix.
> 
> Sure, just look at the bug report for the garbage we wrote into the
> ringbuffers and how we ended up indefinite wait. This is not defense
> against normal behaviour but the driver screwing up.
> -Chris
> 

In that case this is
Reviewed-by: Ben Widawsky <ben@bwidawsk.net>

Though I am still pretty surprised that we have even seen this :|

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: Trigger hangcheck if we detect more a repeating missed IRQ
  2012-04-11 20:32     ` Ben Widawsky
@ 2012-04-14 22:12       ` Daniel Vetter
  0 siblings, 0 replies; 5+ messages in thread
From: Daniel Vetter @ 2012-04-14 22:12 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: intel-gfx

On Wed, Apr 11, 2012 at 01:32:55PM -0700, Ben Widawsky wrote:
> On Wed, 11 Apr 2012 09:18:15 +0100
> Chris Wilson <chris@chris-wilson.co.uk> wrote:
> 
> > On Tue, 10 Apr 2012 16:59:11 -0700, Ben Widawsky <ben@bwidawsk.net> wrote:
> > > On Tue, 10 Apr 2012 17:00:41 +0100
> > > Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > 
> > > > On the first instance we just wish to kick the waiters and see if that
> > > > terminates the wait conditions. If it does not, then we do not want to
> > > > keep retrying without ever making any forward progress and becoming
> > > > stuck in a hangcheck loop.
> > > > 
> > > > Reported-and-tested-by: Lukas Hejtmanek <xhejtman@fi.muni.cz>
> > > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=48209
> > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > 
> > > I'm still confused about the problem we are purportedly fixing.
> > > 
> > > This should happen if we've missed an irq (or the watchdog fired too
> > > soon), and then fires again before the thread has actually woken up to
> > > realize that is missed the first IRQ?
> > > 
> > > As for extract the kick_ring bit of code for core hangcheck_elapsed,
> > > that looks fine. I just don't quite understand the exact problem this
> > > solves, and can't envision how we hit this case it seems the patch will
> > > fix.
> > 
> > Sure, just look at the bug report for the garbage we wrote into the
> > ringbuffers and how we ended up indefinite wait. This is not defense
> > against normal behaviour but the driver screwing up.
> > -Chris
> > 
> 
> In that case this is
> Reviewed-by: Ben Widawsky <ben@bwidawsk.net>
> 
> Though I am still pretty surprised that we have even seen this :|

Queued for -next, thanks for the patch.
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-04-14 22:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-10 16:00 [PATCH] drm/i915: Trigger hangcheck if we detect more a repeating missed IRQ Chris Wilson
2012-04-10 23:59 ` Ben Widawsky
2012-04-11  8:18   ` Chris Wilson
2012-04-11 20:32     ` Ben Widawsky
2012-04-14 22:12       ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.