All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] [RFC] use HW watchdog timer
@ 2012-07-16 18:51 Ben Widawsky
  2012-07-16 18:51 ` [PATCH 1/4] drm/i915: Use HW watchdog for each batch Ben Widawsky
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Ben Widawsky @ 2012-07-16 18:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Ben Widawsky

This was my pet project for the last few days, but I have to take a
break from working on it for now to do some real work ;-). The patches
compile, and pass a basic test, but that's about it. There is still
quite a bit of work left to make this useful. The easiest thing would be
to tie this into error state.

The idea is pretty simple. It uses the HW watchdog to set a timeout per
batchbuffer, instead of a global software watchdog.

Pros:
* Potential for per batch, or ring watchdog values. I believe when/if we
get to GPGPU workloads, this is particularly interesting.
* Batch granularity hang detection. This mostly just makes hang
detection and recovery a bit easier IMO.

Cons:
* Blit ring doesn't have an interrupt. This means we still need the
software watchdog, and it makes hang detection more complex. I've been
led to believe future HW *may* have this interrupt.
* Semaphores 

I'm looking for feedback, mainly for Daniel, and Chris if this is worth
pursuing further when I have more time. The idea would be to eventually
use this to implement much of the ARB robustness requirements instead of
doing a bunch of request list processing.

Ben Widawsky (4):
  drm/i915: Use HW watchdog for each batch
  drm/i915: Turn on watchdog interrupts
  drm/i915: Add a breadcrumb
  drm/i915: Display the failing seqno

 drivers/gpu/drm/i915/i915_drv.h         |  2 +-
 drivers/gpu/drm/i915/i915_irq.c         | 14 ++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h         |  5 +++++
 drivers/gpu/drm/i915/intel_ringbuffer.c | 34 ++++++++++++++++++++++++++++++---
 drivers/gpu/drm/i915/intel_ringbuffer.h |  3 +++
 5 files changed, 54 insertions(+), 4 deletions(-)

-- 
1.7.11.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4] drm/i915: Use HW watchdog for each batch
  2012-07-16 18:51 [PATCH 0/4] [RFC] use HW watchdog timer Ben Widawsky
@ 2012-07-16 18:51 ` Ben Widawsky
  2012-07-16 18:51 ` [PATCH 2/4] drm/i915: Turn on watchdog interrupts Ben Widawsky
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Ben Widawsky @ 2012-07-16 18:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky

The HW watchdog exists for all the rings. It's just a register, but if
we writing in the command stream it has the effect we want of generating
interrupts if a given batch is taking too long.

Unfortunately, our hardware doesn't support interrupts on the blit ring.
We still need the software watchdog for that.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_drv.h         |  2 +-
 drivers/gpu/drm/i915/i915_reg.h         |  2 ++
 drivers/gpu/drm/i915/intel_ringbuffer.c | 26 +++++++++++++++++++++++++-
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 627fe35..0a13b22 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -430,7 +430,7 @@ typedef struct drm_i915_private {
 	int num_pch_pll;
 
 	/* For hangcheck timer */
-#define DRM_I915_HANGCHECK_PERIOD 1500 /* in ms */
+#define DRM_I915_HANGCHECK_PERIOD 3000 /* in ms */
 	struct timer_list hangcheck_timer;
 	int hangcheck_count;
 	uint32_t last_acthd[I915_NUM_RINGS];
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index cc82871..195154b 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -449,6 +449,8 @@
 #define RING_ACTHD(base)	((base)+0x74)
 #define RING_NOPID(base)	((base)+0x94)
 #define RING_IMR(base)		((base)+0xa8)
+#define RING_WATCHDOG_CTL(base) ((base)+0x178)
+#define RING_WATCHDOG_THRESH(base) ((base)+0x17c)
 #define   TAIL_ADDR		0x001FFFF8
 #define   HEAD_WRAP_COUNT	0xFFE00000
 #define   HEAD_WRAP_ONE		0x00200000
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index ddc4859..f61b4a0 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1038,6 +1038,9 @@ static int intel_init_ring_buffer(struct drm_device *dev,
 		goto err_unpin;
 	}
 
+	I915_WRITE(RING_WATCHDOG_CTL(ring->mmio_base), UINT_MAX);
+	I915_WRITE(RING_WATCHDOG_THRESH(ring->mmio_base), UINT_MAX);
+
 	ret = ring->init(ring);
 	if (ret)
 		goto err_unmap;
@@ -1318,19 +1321,40 @@ static int gen6_ring_flush(struct intel_ring_buffer *ring,
 	return 0;
 }
 
+#define PERIOD_NS	80
+#define WATCHDOG_TIMEOUT(timeout_ms) (DIV_ROUND_UP(timeout_ms * NSEC_PER_MSEC, PERIOD_NS))
+/* this should be less than the i915 software watchdog */
+#define DEFAULT_WATCHDOG_TIMEOUT WATCHDOG_TIMEOUT(DRM_I915_HANGCHECK_PERIOD / 2)
+
 static int
 gen6_ring_dispatch_execbuffer(struct intel_ring_buffer *ring,
 			      u32 offset, u32 len)
 {
 	int ret;
 
-	ret = intel_ring_begin(ring, 2);
+	ret = intel_ring_begin(ring, 14);
 	if (ret)
 		return ret;
 
+	intel_ring_emit(ring, MI_NOOP);
+	intel_ring_emit(ring, MI_LOAD_REGISTER_IMM(1));
+	intel_ring_emit(ring, RING_WATCHDOG_THRESH(ring->mmio_base));
+	intel_ring_emit(ring, DEFAULT_WATCHDOG_TIMEOUT);
+	intel_ring_emit(ring, MI_LOAD_REGISTER_IMM(1));
+	intel_ring_emit(ring, RING_WATCHDOG_CTL(ring->mmio_base));
+	intel_ring_emit(ring, 0);
+
+	/* add breadcrumb here */
+
 	intel_ring_emit(ring, MI_BATCH_BUFFER_START | MI_BATCH_NON_SECURE_I965);
 	/* bit0-7 is the length on GEN6+ */
 	intel_ring_emit(ring, offset);
+
+	intel_ring_emit(ring, MI_NOOP);
+	intel_ring_emit(ring, MI_LOAD_REGISTER_IMM(1));
+	intel_ring_emit(ring, RING_WATCHDOG_CTL(ring->mmio_base));
+	intel_ring_emit(ring, UINT_MAX);
+
 	intel_ring_advance(ring);
 
 	return 0;
-- 
1.7.11.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4] drm/i915: Turn on watchdog interrupts
  2012-07-16 18:51 [PATCH 0/4] [RFC] use HW watchdog timer Ben Widawsky
  2012-07-16 18:51 ` [PATCH 1/4] drm/i915: Use HW watchdog for each batch Ben Widawsky
@ 2012-07-16 18:51 ` Ben Widawsky
  2012-07-16 18:51 ` [PATCH 3/4] drm/i915: Add a breadcrumb Ben Widawsky
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Ben Widawsky @ 2012-07-16 18:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky

Now that we are actually using the watch timers, enabling the interrupts
just makes sense.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_irq.c         | 8 ++++++++
 drivers/gpu/drm/i915/i915_reg.h         | 3 +++
 drivers/gpu/drm/i915/intel_ringbuffer.c | 4 ++--
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 566f61b..a376371 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -475,6 +475,12 @@ static void snb_gt_irq_handler(struct drm_device *dev,
 		i915_handle_error(dev, false);
 	}
 
+	if (gt_iir & (GT_GEN6_RENDER_TIMEOUT_COUNTER_EXPIRED))
+		DRM_ERROR("Render timeout expired\n");
+
+	if (gt_iir & (GT_GEN6_BSD_TIMEOUT_COUNTER_EXPIRED))
+		DRM_ERROR("BSD timeout expired\n");
+
 	if (gt_iir & GT_GEN7_L3_PARITY_ERROR_INTERRUPT)
 		ivybridge_handle_parity_error(dev);
 }
@@ -1816,6 +1822,8 @@ static int ironlake_irq_postinstall(struct drm_device *dev)
 	if (IS_GEN6(dev))
 		render_irqs =
 			GT_USER_INTERRUPT |
+			GT_GEN6_RENDER_TIMEOUT_COUNTER_EXPIRED |
+			GT_GEN6_BSD_TIMEOUT_COUNTER_EXPIRED |
 			GEN6_BSD_USER_INTERRUPT |
 			GEN6_BLITTER_USER_INTERRUPT;
 	else
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 195154b..558ed9a 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -699,6 +699,7 @@
 
 #define GEN6_BSD_HWSTAM			0x12098
 #define GEN6_BSD_IMR			0x120a8
+#define   GEN6_BSD_TIMEOUT_COUNTER_EXPIRED (1 << 18)
 #define   GEN6_BSD_USER_INTERRUPT	(1 << 12)
 
 #define GEN6_BSD_RNCID			0x12198
@@ -3346,8 +3347,10 @@
 #define GT_GEN6_BLT_FLUSHDW_NOTIFY_INTERRUPT	(1 << 26)
 #define GT_GEN6_BLT_CS_ERROR_INTERRUPT		(1 << 25)
 #define GT_GEN6_BLT_USER_INTERRUPT		(1 << 22)
+#define GT_GEN6_BSD_TIMEOUT_COUNTER_EXPIRED	(1 << 18)
 #define GT_GEN6_BSD_CS_ERROR_INTERRUPT		(1 << 15)
 #define GT_GEN6_BSD_USER_INTERRUPT		(1 << 12)
+#define GT_GEN6_RENDER_TIMEOUT_COUNTER_EXPIRED  (1 << 6)
 #define GT_BSD_USER_INTERRUPT			(1 << 5) /* ilk only */
 #define GT_GEN7_L3_PARITY_ERROR_INTERRUPT	(1 << 5)
 #define GT_PIPE_NOTIFY				(1 << 4)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index f61b4a0..82bd0cd 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1397,7 +1397,7 @@ int intel_init_render_ring_buffer(struct drm_device *dev)
 		ring->flush = gen6_render_ring_flush;
 		ring->irq_get = gen6_ring_get_irq;
 		ring->irq_put = gen6_ring_put_irq;
-		ring->irq_enable_mask = GT_USER_INTERRUPT;
+		ring->irq_enable_mask = GT_USER_INTERRUPT | GEN6_RENDER_TIMEOUT_COUNTER_EXPIRED;
 		ring->get_seqno = gen6_ring_get_seqno;
 		ring->sync_to = gen6_ring_sync;
 		ring->semaphore_register[0] = MI_SEMAPHORE_SYNC_INVALID;
@@ -1530,7 +1530,7 @@ int intel_init_bsd_ring_buffer(struct drm_device *dev)
 		ring->flush = gen6_ring_flush;
 		ring->add_request = gen6_add_request;
 		ring->get_seqno = gen6_ring_get_seqno;
-		ring->irq_enable_mask = GEN6_BSD_USER_INTERRUPT;
+		ring->irq_enable_mask = GEN6_BSD_USER_INTERRUPT | GEN6_BSD_TIMEOUT_COUNTER_EXPIRED;
 		ring->irq_get = gen6_ring_get_irq;
 		ring->irq_put = gen6_ring_put_irq;
 		ring->dispatch_execbuffer = gen6_ring_dispatch_execbuffer;
-- 
1.7.11.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4] drm/i915: Add a breadcrumb
  2012-07-16 18:51 [PATCH 0/4] [RFC] use HW watchdog timer Ben Widawsky
  2012-07-16 18:51 ` [PATCH 1/4] drm/i915: Use HW watchdog for each batch Ben Widawsky
  2012-07-16 18:51 ` [PATCH 2/4] drm/i915: Turn on watchdog interrupts Ben Widawsky
@ 2012-07-16 18:51 ` Ben Widawsky
  2012-07-16 18:51 ` [PATCH 4/4] drm/i915: Display the failing seqno Ben Widawsky
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Ben Widawsky @ 2012-07-16 18:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky

Getting the watchdog interrupt is great, but knowing the exact batch the
watchdog fired on is even better. There are other ways to do this
without the watchdog, but having the watchdog makes it quite simple.

The idea is to simply emit the seqno before the batchbuffer begins
running, and if the watchdog fires we can read back the location where
the seqno was emitted to see the batchbuffer which hangs.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/intel_ringbuffer.c | 6 +++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h | 3 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 82bd0cd..39c6f8e 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1332,7 +1332,7 @@ gen6_ring_dispatch_execbuffer(struct intel_ring_buffer *ring,
 {
 	int ret;
 
-	ret = intel_ring_begin(ring, 14);
+	ret = intel_ring_begin(ring, 18);
 	if (ret)
 		return ret;
 
@@ -1345,6 +1345,10 @@ gen6_ring_dispatch_execbuffer(struct intel_ring_buffer *ring,
 	intel_ring_emit(ring, 0);
 
 	/* add breadcrumb here */
+	intel_ring_emit(ring, MI_STORE_DWORD_INDEX);
+	intel_ring_emit(ring, (I915_GEM_WATCHDOG_CRUMB_RCS + ring->id) << MI_STORE_DWORD_INDEX_SHIFT);
+	intel_ring_emit(ring, i915_gem_next_request_seqno(ring));
+	intel_ring_emit(ring, MI_NOOP);
 
 	intel_ring_emit(ring, MI_BATCH_BUFFER_START | MI_BATCH_NON_SECURE_I965);
 	/* bit0-7 is the length on GEN6+ */
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 1d3c81f..73de086 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -183,6 +183,9 @@ intel_read_status_page(struct intel_ring_buffer *ring,
  * The area from dword 0x20 to 0x3ff is available for driver usage.
  */
 #define I915_GEM_HWS_INDEX		0x20
+#define I915_GEM_WATCHDOG_CRUMB_RCS	0x21
+#define I915_GEM_WATCHDOG_CRUMB_VCS	0x22
+#define I915_GEM_WATCHDOG_CRUMB_BCS	0x23
 
 void intel_cleanup_ring_buffer(struct intel_ring_buffer *ring);
 
-- 
1.7.11.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4] drm/i915: Display the failing seqno
  2012-07-16 18:51 [PATCH 0/4] [RFC] use HW watchdog timer Ben Widawsky
                   ` (2 preceding siblings ...)
  2012-07-16 18:51 ` [PATCH 3/4] drm/i915: Add a breadcrumb Ben Widawsky
@ 2012-07-16 18:51 ` Ben Widawsky
  2012-07-16 20:16 ` [PATCH 0/4] [RFC] use HW watchdog timer Daniel Vetter
  2012-07-17 11:12 ` Chris Wilson
  5 siblings, 0 replies; 8+ messages in thread
From: Ben Widawsky @ 2012-07-16 18:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky

We'd really want to do a lot more here, but just as the demo for now.

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_irq.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index a376371..864472ee 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -475,11 +475,17 @@ static void snb_gt_irq_handler(struct drm_device *dev,
 		i915_handle_error(dev, false);
 	}
 
-	if (gt_iir & (GT_GEN6_RENDER_TIMEOUT_COUNTER_EXPIRED))
-		DRM_ERROR("Render timeout expired\n");
+	if (gt_iir & (GT_GEN6_RENDER_TIMEOUT_COUNTER_EXPIRED)) {
+		u32 hung_seqno = intel_read_status_page(&dev_priv->ring[RCS],
+							I915_GEM_WATCHDOG_CRUMB_RCS);
+		DRM_ERROR("Render timeout expired (%u)\n", hung_seqno);
+	}
 
-	if (gt_iir & (GT_GEN6_BSD_TIMEOUT_COUNTER_EXPIRED))
-		DRM_ERROR("BSD timeout expired\n");
+	if (gt_iir & (GT_GEN6_BSD_TIMEOUT_COUNTER_EXPIRED)) {
+		u32 hung_seqno = intel_read_status_page(&dev_priv->ring[VCS],
+							I915_GEM_WATCHDOG_CRUMB_VCS);
+		DRM_ERROR("BSD timeout expired (%u)\n", hung_seqno);
+	}
 
 	if (gt_iir & GT_GEN7_L3_PARITY_ERROR_INTERRUPT)
 		ivybridge_handle_parity_error(dev);
-- 
1.7.11.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/4] [RFC] use HW watchdog timer
  2012-07-16 18:51 [PATCH 0/4] [RFC] use HW watchdog timer Ben Widawsky
                   ` (3 preceding siblings ...)
  2012-07-16 18:51 ` [PATCH 4/4] drm/i915: Display the failing seqno Ben Widawsky
@ 2012-07-16 20:16 ` Daniel Vetter
  2012-07-17 11:12 ` Chris Wilson
  5 siblings, 0 replies; 8+ messages in thread
From: Daniel Vetter @ 2012-07-16 20:16 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Daniel Vetter, intel-gfx

On Mon, Jul 16, 2012 at 11:51:55AM -0700, Ben Widawsky wrote:
> This was my pet project for the last few days, but I have to take a
> break from working on it for now to do some real work ;-). The patches
> compile, and pass a basic test, but that's about it. There is still
> quite a bit of work left to make this useful. The easiest thing would be
> to tie this into error state.
> 
> The idea is pretty simple. It uses the HW watchdog to set a timeout per
> batchbuffer, instead of a global software watchdog.
> 
> Pros:
> * Potential for per batch, or ring watchdog values. I believe when/if we
> get to GPGPU workloads, this is particularly interesting.
> * Batch granularity hang detection. This mostly just makes hang
> detection and recovery a bit easier IMO.
> 
> Cons:
> * Blit ring doesn't have an interrupt. This means we still need the
> software watchdog, and it makes hang detection more complex. I've been
> led to believe future HW *may* have this interrupt.
> * Semaphores 
> 
> I'm looking for feedback, mainly for Daniel, and Chris if this is worth
> pursuing further when I have more time. The idea would be to eventually
> use this to implement much of the ARB robustness requirements instead of
> doing a bunch of request list processing.
> 
> Ben Widawsky (4):
>   drm/i915: Use HW watchdog for each batch
>   drm/i915: Turn on watchdog interrupts
>   drm/i915: Add a breadcrumb
>   drm/i915: Display the failing seqno

High-level drive-by review:

I think it's very important to separate hangs in the batch itself from
hangs in the ringbuffers, e.g. semaphore deadlocks. We should only blame
the former on the userspace-submitted batchbuffer. So on that ground I
think the following speudo-code check in the hangcheck code would be
simpler to implement and more robust:

/* Check whether we're outside of the ring. At worst this check presumes
 * that the hang is in the ring, never the other way round. */
if (ACTHEAD != RING_HEAD)
	/* Check whether ACTHEAD lies in any active ring. We can't check
	 * the object's gpu_domain, that might have been changed again
	 * already. */
	for_each_active_buffer(buffer)
		if (buffer->gtt_offset + buffer->size < ACTHEAD &&
		    buffer->gtt_offset >= ACTHEAD)
			goto found;

Otoh I don't think your patches here are a completely lost cause. This
render watchdog could be used rather well for a per-patch short timeout to
kill one specific batchbuffer I guess. But right now I don't see an
immediate use-case ...

Yours, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/4] [RFC] use HW watchdog timer
  2012-07-16 18:51 [PATCH 0/4] [RFC] use HW watchdog timer Ben Widawsky
                   ` (4 preceding siblings ...)
  2012-07-16 20:16 ` [PATCH 0/4] [RFC] use HW watchdog timer Daniel Vetter
@ 2012-07-17 11:12 ` Chris Wilson
  2012-07-17 18:51   ` Ben Widawsky
  5 siblings, 1 reply; 8+ messages in thread
From: Chris Wilson @ 2012-07-17 11:12 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Ben Widawsky

On Mon, 16 Jul 2012 11:51:55 -0700, Ben Widawsky <ben@bwidawsk.net> wrote:
> Pros:
> * Potential for per batch, or ring watchdog values. I believe when/if we
> get to GPGPU workloads, this is particularly interesting.
> * Batch granularity hang detection. This mostly just makes hang
> detection and recovery a bit easier IMO.
> 
> Cons:
> * Blit ring doesn't have an interrupt. This means we still need the
> software watchdog, and it makes hang detection more complex. I've been
> led to believe future HW *may* have this interrupt.
> * Semaphores 

Replacing the black magic for INSTDONE hang detection does seem like a
sensible plan, but as long as we require the hangcheck timer we are only
adding code complexity. So there really needs to a be a compelling
advantage for the watchdoy, something that we cannot acheive with the
existing method.

For me, the criteria is whether we ever miss a hang or falsely accuse
the hw of stopping. If I understand the watchdog correctly, it basically
ensures the batch completes within a certain interval which we can
codify into the existing hangcheck, so no USP.

Or is there more magic waiting in the wings?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/4] [RFC] use HW watchdog timer
  2012-07-17 11:12 ` Chris Wilson
@ 2012-07-17 18:51   ` Ben Widawsky
  0 siblings, 0 replies; 8+ messages in thread
From: Ben Widawsky @ 2012-07-17 18:51 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Daniel Vetter, intel-gfx

On Tue, 17 Jul 2012 12:12:39 +0100
Chris Wilson <chris@chris-wilson.co.uk> wrote:

> On Mon, 16 Jul 2012 11:51:55 -0700, Ben Widawsky <ben@bwidawsk.net> wrote:
> > Pros:
> > * Potential for per batch, or ring watchdog values. I believe when/if we
> > get to GPGPU workloads, this is particularly interesting.
> > * Batch granularity hang detection. This mostly just makes hang
> > detection and recovery a bit easier IMO.
> > 
> > Cons:
> > * Blit ring doesn't have an interrupt. This means we still need the
> > software watchdog, and it makes hang detection more complex. I've been
> > led to believe future HW *may* have this interrupt.
> > * Semaphores 
> 
> Replacing the black magic for INSTDONE hang detection does seem like a
> sensible plan, but as long as we require the hangcheck timer we are only
> adding code complexity. So there really needs to a be a compelling
> advantage for the watchdoy, something that we cannot acheive with the
> existing method.

Just to be clear, INSTDONE can go away. I don't think it's valuable for
the blitter.

> 
> For me, the criteria is whether we ever miss a hang or falsely accuse
> the hw of stopping. If I understand the watchdog correctly, it basically
> ensures the batch completes within a certain interval which we can
> codify into the existing hangcheck, so no USP.

Yeah. If we follow the windows model, I think we just tweak the value
until we find something, "good" and just always reset on the timeout
instead of doing instdone-foo.

> 
> Or is there more magic waiting in the wings?
> -Chris
> 

The magic was only a more straightforward way of finding the batch to
blame, and as I said on IRC, when I started I was planning to gut the
whole SW watchdog; that was the magic.

FWIW I think we may see the interrupt in future products; so it may
still be worth considering whether we want to move in this direction.

-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-07-17 18:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-16 18:51 [PATCH 0/4] [RFC] use HW watchdog timer Ben Widawsky
2012-07-16 18:51 ` [PATCH 1/4] drm/i915: Use HW watchdog for each batch Ben Widawsky
2012-07-16 18:51 ` [PATCH 2/4] drm/i915: Turn on watchdog interrupts Ben Widawsky
2012-07-16 18:51 ` [PATCH 3/4] drm/i915: Add a breadcrumb Ben Widawsky
2012-07-16 18:51 ` [PATCH 4/4] drm/i915: Display the failing seqno Ben Widawsky
2012-07-16 20:16 ` [PATCH 0/4] [RFC] use HW watchdog timer Daniel Vetter
2012-07-17 11:12 ` Chris Wilson
2012-07-17 18:51   ` Ben Widawsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.