All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] arb robustness enablers
@ 2013-02-26 11:05 Mika Kuoppala
  2013-02-26 11:05 ` [PATCH 01/13] drm/i915: add context parameter to i915_switch_context() Mika Kuoppala
                   ` (12 more replies)
  0 siblings, 13 replies; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

Hi,

Revisited patchset for guilty context detection. I have
tried to take into account all feedback received from
RFC series posted earlier.

Kreffed contexts are first in the series and the
hangcheck triggering using seqnos have been split into
multiple smaller patches.

-Mika

Mika Kuoppala (13):
  drm/i915: add context parameter to i915_switch_context()
  drm/i915: reference count for i915_hw_contexts
  drm/i915: pass seqno to i915_hangcheck_ring_idle
  drm/i915: track ring progression using seqnos
  drm/i915: introduce i915_hangcheck_ring_hung
  drm/i915: detect hang using per ring hangcheck_score
  drm/i915: remove i915_hangcheck_hung
  drm/i915: add struct ctx_reset_state
  drm/i915: add reset_state for hw_contexts
  drm/i915: mark rings which were waiting when hang happened
  drm/i915: add batch object and context to i915_add_request()
  drm/i915: find guilty batch buffer on ring resets
  drm/i915: refuse to submit more batchbuffers from guilty context

 drivers/gpu/drm/i915/i915_dma.c            |    2 +-
 drivers/gpu/drm/i915/i915_drv.c            |   23 +++++-
 drivers/gpu/drm/i915/i915_drv.h            |   40 +++++++--
 drivers/gpu/drm/i915/i915_gem.c            |  123 ++++++++++++++++++++++++++--
 drivers/gpu/drm/i915/i915_gem_context.c    |   90 ++++++++++++++++----
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   23 +++++-
 drivers/gpu/drm/i915/i915_irq.c            |  121 +++++++++++++--------------
 drivers/gpu/drm/i915/intel_overlay.c       |    5 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c    |    2 +-
 drivers/gpu/drm/i915/intel_ringbuffer.h    |    4 +
 10 files changed, 335 insertions(+), 98 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 01/13] drm/i915: add context parameter to i915_switch_context()
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  2013-02-26 18:31   ` Ben Widawsky
  2013-02-26 11:05 ` [PATCH 02/13] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

In preparation for the next commit, return context that
was switch to from i915_switch_context().

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |    3 ++-
 drivers/gpu/drm/i915/i915_gem.c            |    2 +-
 drivers/gpu/drm/i915/i915_gem_context.c    |   36 ++++++++++++++++++----------
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    2 +-
 4 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e95337c..27459ec 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1671,7 +1671,8 @@ void i915_gem_context_init(struct drm_device *dev);
 void i915_gem_context_fini(struct drm_device *dev);
 void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
 int i915_switch_context(struct intel_ring_buffer *ring,
-			struct drm_file *file, int to_id);
+			struct drm_file *file, int to_id,
+			struct i915_hw_context **ctx);
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 8413ffc..61610f3 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2517,7 +2517,7 @@ int i915_gpu_idle(struct drm_device *dev)
 
 	/* Flush everything onto the inactive list. */
 	for_each_ring(ring, dev_priv, i) {
-		ret = i915_switch_context(ring, NULL, DEFAULT_CONTEXT_ID);
+		ret = i915_switch_context(ring, NULL, DEFAULT_CONTEXT_ID, NULL);
 		if (ret)
 			return ret;
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 21177d9..4125919 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -428,10 +428,9 @@ static int do_switch(struct i915_hw_context *to)
 /**
  * i915_switch_context() - perform a GPU context switch.
  * @ring: ring for which we'll execute the context switch
- * @file_priv: file_priv associated with the context, may be NULL
- * @id: context id number
- * @seqno: sequence number by which the new context will be switched to
- * @flags:
+ * @file: file associated with the context, may be NULL
+ * @to_id: context id number
+ * @ctx: i915_hw_context we switched to, may be NULL
  *
  * The context life cycle is simple. The context refcount is incremented and
  * decremented by 1 and create and destroy. If the context is in use by the GPU,
@@ -440,29 +439,40 @@ static int do_switch(struct i915_hw_context *to)
  */
 int i915_switch_context(struct intel_ring_buffer *ring,
 			struct drm_file *file,
-			int to_id)
+			int to_id,
+			struct i915_hw_context **ctx)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
-	struct i915_hw_context *to;
+	struct i915_hw_context *to = NULL;
+	int ret = 0;
 
 	if (dev_priv->hw_contexts_disabled)
-		return 0;
+		goto out;
 
 	if (ring != &dev_priv->ring[RCS])
-		return 0;
+		goto out;
 
 	if (to_id == DEFAULT_CONTEXT_ID) {
 		to = ring->default_context;
 	} else {
-		if (file == NULL)
-			return -EINVAL;
+		if (file == NULL) {
+			ret = -EINVAL;
+			goto out;
+		}
 
 		to = i915_gem_context_get(file->driver_priv, to_id);
-		if (to == NULL)
-			return -ENOENT;
+		if (to == NULL) {
+			ret = -ENOENT;
+			goto out;
+		}
 	}
 
-	return do_switch(to);
+	ret = do_switch(to);
+out:
+	if (ctx)
+		*ctx = to;
+
+	return ret;
 }
 
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 2f2daeb..499d025 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1017,7 +1017,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	if (ret)
 		goto err;
 
-	ret = i915_switch_context(ring, file, ctx_id);
+	ret = i915_switch_context(ring, file, ctx_id, NULL);
 	if (ret)
 		goto err;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 02/13] drm/i915: reference count for i915_hw_contexts
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
  2013-02-26 11:05 ` [PATCH 01/13] drm/i915: add context parameter to i915_switch_context() Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  2013-02-26 21:45   ` Ben Widawsky
  2013-02-26 11:05 ` [PATCH 03/13] drm/i915: pass seqno to i915_hangcheck_ring_idle Mika Kuoppala
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

In preparation to do analysis of which context was
guilty of gpu hung, store kreffed context pointer
into request struct.

This allows us to inspect contexts when gpu is reset
even if those contexts would already be released
by userspace.

v2: track i915_hw_context pointers instead of using ctx_ids
    (from Chris Wilson)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |    8 +++++++-
 drivers/gpu/drm/i915/i915_gem.c            |   20 +++++++++++++++++---
 drivers/gpu/drm/i915/i915_gem_context.c    |   22 ++++++++++++++++++----
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   10 ++++++----
 drivers/gpu/drm/i915/intel_overlay.c       |    4 ++--
 drivers/gpu/drm/i915/intel_ringbuffer.c    |    2 +-
 6 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 27459ec..9f1a75d 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -437,6 +437,7 @@ struct i915_hw_ppgtt {
 /* This must match up with the value previously used for execbuf2.rsvd1. */
 #define DEFAULT_CONTEXT_ID 0
 struct i915_hw_context {
+	struct kref ref;
 	int id;
 	bool is_initialized;
 	struct drm_i915_file_private *file_priv;
@@ -1240,6 +1241,9 @@ struct drm_i915_gem_request {
 	/** Postion in the ringbuffer of the end of the request */
 	u32 tail;
 
+	/** Context related to this request */
+	struct i915_hw_context *ctx;
+
 	/** Time at which this request was emitted, in jiffies. */
 	unsigned long emitted_jiffies;
 
@@ -1629,7 +1633,8 @@ int __must_check i915_gpu_idle(struct drm_device *dev);
 int __must_check i915_gem_idle(struct drm_device *dev);
 int i915_add_request(struct intel_ring_buffer *ring,
 		     struct drm_file *file,
-		     u32 *seqno);
+		     u32 *seqno,
+		     struct i915_hw_context *ctx);
 int __must_check i915_wait_seqno(struct intel_ring_buffer *ring,
 				 uint32_t seqno);
 int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
@@ -1673,6 +1678,7 @@ void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
 int i915_switch_context(struct intel_ring_buffer *ring,
 			struct drm_file *file, int to_id,
 			struct i915_hw_context **ctx);
+void i915_gem_context_free(struct kref *ctx_ref);
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 61610f3..850b1bb 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -967,7 +967,7 @@ i915_gem_check_olr(struct intel_ring_buffer *ring, u32 seqno)
 
 	ret = 0;
 	if (seqno == ring->outstanding_lazy_request)
-		ret = i915_add_request(ring, NULL, NULL);
+		ret = i915_add_request(ring, NULL, NULL, NULL);
 
 	return ret;
 }
@@ -1997,7 +1997,8 @@ i915_gem_get_seqno(struct drm_device *dev, u32 *seqno)
 int
 i915_add_request(struct intel_ring_buffer *ring,
 		 struct drm_file *file,
-		 u32 *out_seqno)
+		 u32 *out_seqno,
+		 struct i915_hw_context *ctx)
 {
 	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	struct drm_i915_gem_request *request;
@@ -2037,6 +2038,11 @@ i915_add_request(struct intel_ring_buffer *ring,
 	request->seqno = intel_ring_get_seqno(ring);
 	request->ring = ring;
 	request->tail = request_ring_position;
+	request->ctx = ctx;
+
+	if (request->ctx)
+		kref_get(&request->ctx->ref);
+
 	request->emitted_jiffies = jiffies;
 	was_empty = list_empty(&ring->request_list);
 	list_add_tail(&request->list, &ring->request_list);
@@ -2101,6 +2107,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 
 		list_del(&request->list);
 		i915_gem_request_remove_from_client(request);
+
+		if (request->ctx)
+			kref_put(&request->ctx->ref, i915_gem_context_free);
+
 		kfree(request);
 	}
 
@@ -2195,6 +2205,10 @@ i915_gem_retire_requests_ring(struct intel_ring_buffer *ring)
 
 		list_del(&request->list);
 		i915_gem_request_remove_from_client(request);
+
+		if (request->ctx)
+			kref_put(&request->ctx->ref, i915_gem_context_free);
+
 		kfree(request);
 	}
 
@@ -2262,7 +2276,7 @@ i915_gem_retire_work_handler(struct work_struct *work)
 	idle = true;
 	for_each_ring(ring, dev_priv, i) {
 		if (ring->gpu_caches_dirty)
-			i915_add_request(ring, NULL, NULL);
+			i915_add_request(ring, NULL, NULL, NULL);
 
 		idle &= list_empty(&ring->request_list);
 	}
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 4125919..cba54fb 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -124,12 +124,24 @@ static int get_context_size(struct drm_device *dev)
 	return ret;
 }
 
-static void do_destroy(struct i915_hw_context *ctx)
+void i915_gem_context_free(struct kref *ctx_ref)
+{
+	struct i915_hw_context *ctx = container_of(ctx_ref,
+						   typeof(*ctx), ref);
+	kfree(ctx);
+}
+
+static void do_release(struct i915_hw_context *ctx)
 {
 	if (ctx->file_priv)
 		idr_remove(&ctx->file_priv->context_idr, ctx->id);
 
 	drm_gem_object_unreference(&ctx->obj->base);
+}
+
+static void do_destroy(struct i915_hw_context *ctx)
+{
+	do_release(ctx);
 	kfree(ctx);
 }
 
@@ -145,6 +157,7 @@ create_hw_context(struct drm_device *dev,
 	if (ctx == NULL)
 		return ERR_PTR(-ENOMEM);
 
+	kref_init(&ctx->ref);
 	ctx->obj = i915_gem_alloc_object(dev, dev_priv->hw_context_size);
 	if (ctx->obj == NULL) {
 		kfree(ctx);
@@ -286,8 +299,8 @@ static int context_idr_cleanup(int id, void *p, void *data)
 
 	BUG_ON(id == DEFAULT_CONTEXT_ID);
 
-	do_destroy(ctx);
-
+	do_release(ctx);
+	kref_put(&ctx->ref, i915_gem_context_free);
 	return 0;
 }
 
@@ -526,7 +539,8 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 		return -ENOENT;
 	}
 
-	do_destroy(ctx);
+	do_release(ctx);
+	kref_put(&ctx->ref, i915_gem_context_free);
 
 	mutex_unlock(&dev->struct_mutex);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 499d025..b1e54d5 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -789,13 +789,14 @@ i915_gem_execbuffer_move_to_active(struct list_head *objects,
 static void
 i915_gem_execbuffer_retire_commands(struct drm_device *dev,
 				    struct drm_file *file,
-				    struct intel_ring_buffer *ring)
+				    struct intel_ring_buffer *ring,
+				    struct i915_hw_context *ctx)
 {
 	/* Unconditionally force add_request to emit a full flush. */
 	ring->gpu_caches_dirty = true;
 
 	/* Add a breadcrumb for the completion of the batch buffer */
-	(void)i915_add_request(ring, file, NULL);
+	(void)i915_add_request(ring, file, NULL, ctx);
 }
 
 static int
@@ -834,6 +835,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	struct drm_i915_gem_object *batch_obj;
 	struct drm_clip_rect *cliprects = NULL;
 	struct intel_ring_buffer *ring;
+	struct i915_hw_context *ctx;
 	u32 ctx_id = i915_execbuffer2_get_context_id(*args);
 	u32 exec_start, exec_len;
 	u32 mask, flags;
@@ -1017,7 +1019,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	if (ret)
 		goto err;
 
-	ret = i915_switch_context(ring, file, ctx_id, NULL);
+	ret = i915_switch_context(ring, file, ctx_id, &ctx);
 	if (ret)
 		goto err;
 
@@ -1068,7 +1070,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	trace_i915_gem_ring_dispatch(ring, intel_ring_get_seqno(ring), flags);
 
 	i915_gem_execbuffer_move_to_active(&eb->objects, ring);
-	i915_gem_execbuffer_retire_commands(dev, file, ring);
+	i915_gem_execbuffer_retire_commands(dev, file, ring, ctx);
 
 err:
 	eb_destroy(eb);
diff --git a/drivers/gpu/drm/i915/intel_overlay.c b/drivers/gpu/drm/i915/intel_overlay.c
index 67a2501..540be47 100644
--- a/drivers/gpu/drm/i915/intel_overlay.c
+++ b/drivers/gpu/drm/i915/intel_overlay.c
@@ -217,7 +217,7 @@ static int intel_overlay_do_wait_request(struct intel_overlay *overlay,
 	int ret;
 
 	BUG_ON(overlay->last_flip_req);
-	ret = i915_add_request(ring, NULL, &overlay->last_flip_req);
+	ret = i915_add_request(ring, NULL, &overlay->last_flip_req, NULL);
 	if (ret)
 		return ret;
 
@@ -286,7 +286,7 @@ static int intel_overlay_continue(struct intel_overlay *overlay,
 	intel_ring_emit(ring, flip_addr);
 	intel_ring_advance(ring);
 
-	return i915_add_request(ring, NULL, &overlay->last_flip_req);
+	return i915_add_request(ring, NULL, &overlay->last_flip_req, NULL);
 }
 
 static void intel_overlay_release_old_vid_tail(struct intel_overlay *overlay)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 1d5d613..559d3e5 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1423,7 +1423,7 @@ int intel_ring_idle(struct intel_ring_buffer *ring)
 
 	/* We need to add any requests required to flush the objects and ring */
 	if (ring->outstanding_lazy_request) {
-		ret = i915_add_request(ring, NULL, NULL);
+		ret = i915_add_request(ring, NULL, NULL, NULL);
 		if (ret)
 			return ret;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 03/13] drm/i915: pass seqno to i915_hangcheck_ring_idle
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
  2013-02-26 11:05 ` [PATCH 01/13] drm/i915: add context parameter to i915_switch_context() Mika Kuoppala
  2013-02-26 11:05 ` [PATCH 02/13] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  2013-02-26 22:41   ` Ben Widawsky
  2013-02-26 11:05 ` [PATCH 04/13] drm/i915: track ring progression using seqnos Mika Kuoppala
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

In preparation for next commit, pass seqno as a parameter
to i915_hangcheck_ring_idle as it will be used inside
i915_hangcheck_elapsed.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 29037e0..4f60c87 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1740,11 +1740,11 @@ ring_last_seqno(struct intel_ring_buffer *ring)
 			  struct drm_i915_gem_request, list)->seqno;
 }
 
-static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring, bool *err)
+static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring,
+				     u32 ring_seqno, bool *err)
 {
 	if (list_empty(&ring->request_list) ||
-	    i915_seqno_passed(ring->get_seqno(ring, false),
-			      ring_last_seqno(ring))) {
+	    i915_seqno_passed(ring_seqno, ring_last_seqno(ring))) {
 		/* Issue a wake-up to catch stuck h/w. */
 		if (waitqueue_active(&ring->irq_queue)) {
 			DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
@@ -1821,7 +1821,10 @@ void i915_hangcheck_elapsed(unsigned long data)
 	memset(acthd, 0, sizeof(acthd));
 	idle = true;
 	for_each_ring(ring, dev_priv, i) {
-	    idle &= i915_hangcheck_ring_idle(ring, &err);
+		u32 seqno;
+
+		seqno = ring->get_seqno(ring, false);
+		idle &= i915_hangcheck_ring_idle(ring, seqno, &err);
 	    acthd[i] = intel_ring_get_active_head(ring);
 	}
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 04/13] drm/i915: track ring progression using seqnos
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
                   ` (2 preceding siblings ...)
  2013-02-26 11:05 ` [PATCH 03/13] drm/i915: pass seqno to i915_hangcheck_ring_idle Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  2013-02-26 14:12   ` Chris Wilson
  2013-02-26 11:05 ` [PATCH 05/13] drm/i915: introduce i915_hangcheck_ring_hung Mika Kuoppala
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

Instead of relying in acthd, track ring seqno progression
to detect if ring has hung.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |    2 --
 drivers/gpu/drm/i915/i915_irq.c         |   30 +++++++++++++-----------------
 drivers/gpu/drm/i915/intel_ringbuffer.h |    2 ++
 3 files changed, 15 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 9f1a75d..fb51b4f 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -797,8 +797,6 @@ struct i915_gpu_error {
 #define DRM_I915_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_I915_HANGCHECK_PERIOD)
 	struct timer_list hangcheck_timer;
 	int hangcheck_count;
-	uint32_t last_acthd[I915_NUM_RINGS];
-	uint32_t prev_instdone[I915_NUM_INSTDONE_REG];
 
 	/* For reset and error_state handling. */
 	spinlock_t lock;
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 4f60c87..de5af12 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1810,22 +1810,19 @@ void i915_hangcheck_elapsed(unsigned long data)
 {
 	struct drm_device *dev = (struct drm_device *)data;
 	drm_i915_private_t *dev_priv = dev->dev_private;
-	uint32_t acthd[I915_NUM_RINGS], instdone[I915_NUM_INSTDONE_REG];
 	struct intel_ring_buffer *ring;
 	bool err = false, idle;
 	int i;
+	u32 seqno[I915_NUM_RINGS];
+	bool work_done;
 
 	if (!i915_enable_hangcheck)
 		return;
 
-	memset(acthd, 0, sizeof(acthd));
 	idle = true;
 	for_each_ring(ring, dev_priv, i) {
-		u32 seqno;
-
-		seqno = ring->get_seqno(ring, false);
-		idle &= i915_hangcheck_ring_idle(ring, seqno, &err);
-	    acthd[i] = intel_ring_get_active_head(ring);
+		seqno[i] = ring->get_seqno(ring, false);
+		idle &= i915_hangcheck_ring_idle(ring, seqno[i], &err);
 	}
 
 	/* If all work is done then ACTHD clearly hasn't advanced. */
@@ -1841,20 +1838,19 @@ void i915_hangcheck_elapsed(unsigned long data)
 		return;
 	}
 
-	i915_get_extra_instdone(dev, instdone);
-	if (memcmp(dev_priv->gpu_error.last_acthd, acthd,
-		   sizeof(acthd)) == 0 &&
-	    memcmp(dev_priv->gpu_error.prev_instdone, instdone,
-		   sizeof(instdone)) == 0) {
+	work_done = false;
+	for_each_ring(ring, dev_priv, i) {
+		if (ring->hangcheck_seqno != seqno[i]) {
+			work_done = true;
+			ring->hangcheck_seqno = seqno[i];
+		}
+	}
+
+	if (!work_done) {
 		if (i915_hangcheck_hung(dev))
 			return;
 	} else {
 		dev_priv->gpu_error.hangcheck_count = 0;
-
-		memcpy(dev_priv->gpu_error.last_acthd, acthd,
-		       sizeof(acthd));
-		memcpy(dev_priv->gpu_error.prev_instdone, instdone,
-		       sizeof(instdone));
 	}
 
 repeat:
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index d66208c..9599c56 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -137,6 +137,8 @@ struct  intel_ring_buffer {
 	struct i915_hw_context *default_context;
 	struct drm_i915_gem_object *last_context_obj;
 
+	u32 hangcheck_seqno;
+
 	void *private;
 };
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 05/13] drm/i915: introduce i915_hangcheck_ring_hung
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
                   ` (3 preceding siblings ...)
  2013-02-26 11:05 ` [PATCH 04/13] drm/i915: track ring progression using seqnos Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  2013-02-26 23:03   ` Ben Widawsky
  2013-02-26 11:05 ` [PATCH 06/13] drm/i915: detect hang using per ring hangcheck_score Mika Kuoppala
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

In preparation to track per ring progress in hangcheck,
add i915_hangcheck_ring_hung.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c |   31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index de5af12..b828807 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1771,28 +1771,35 @@ static bool kick_ring(struct intel_ring_buffer *ring)
 	return false;
 }
 
+static bool i915_hangcheck_ring_hung(struct drm_device *dev,
+				     struct intel_ring_buffer *ring)
+{
+	if (!IS_GEN2(dev)) {
+		/* Is the chip hanging on a WAIT_FOR_EVENT?
+		 * If so we can simply poke the RB_WAIT bit
+		 * and break the hang. This should work on
+		 * all but the second generation chipsets.
+		 */
+		return !kick_ring(ring);
+	}
+
+	return false;
+}
+
 static bool i915_hangcheck_hung(struct drm_device *dev)
 {
 	drm_i915_private_t *dev_priv = dev->dev_private;
 
 	if (dev_priv->gpu_error.hangcheck_count++ > 1) {
 		bool hung = true;
+		struct intel_ring_buffer *ring;
+		int i;
 
 		DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
 		i915_handle_error(dev, true);
 
-		if (!IS_GEN2(dev)) {
-			struct intel_ring_buffer *ring;
-			int i;
-
-			/* Is the chip hanging on a WAIT_FOR_EVENT?
-			 * If so we can simply poke the RB_WAIT bit
-			 * and break the hang. This should work on
-			 * all but the second generation chipsets.
-			 */
-			for_each_ring(ring, dev_priv, i)
-				hung &= !kick_ring(ring);
-		}
+		for_each_ring(ring, dev_priv, i)
+			hung &= i915_hangcheck_ring_hung(dev, ring);
 
 		return hung;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 06/13] drm/i915: detect hang using per ring hangcheck_score
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
                   ` (4 preceding siblings ...)
  2013-02-26 11:05 ` [PATCH 05/13] drm/i915: introduce i915_hangcheck_ring_hung Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  2013-02-26 14:16   ` Chris Wilson
  2013-02-26 11:05 ` [PATCH 07/13] drm/i915: remove i915_hangcheck_hung Mika Kuoppala
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

Add per ring score of possible culprit for gpu hang. If
ring is busy and not waiting, it will get the highest score
across calls to i915_hangcheck_elapsed. This way we are
most likely to find the ring that caused the hang among
the waiting ones.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c         |   65 +++++++++++++++++--------------
 drivers/gpu/drm/i915/intel_ringbuffer.h |    1 +
 2 files changed, 36 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index b828807..4da8691 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -356,7 +356,6 @@ static void notify_ring(struct drm_device *dev,
 
 	wake_up_all(&ring->irq_queue);
 	if (i915_enable_hangcheck) {
-		dev_priv->gpu_error.hangcheck_count = 0;
 		mod_timer(&dev_priv->gpu_error.hangcheck_timer,
 			  round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES));
 	}
@@ -1818,52 +1817,58 @@ void i915_hangcheck_elapsed(unsigned long data)
 	struct drm_device *dev = (struct drm_device *)data;
 	drm_i915_private_t *dev_priv = dev->dev_private;
 	struct intel_ring_buffer *ring;
-	bool err = false, idle;
 	int i;
-	u32 seqno[I915_NUM_RINGS];
-	bool work_done;
+	int busy_count = 0, rings_hung = 0;
 
 	if (!i915_enable_hangcheck)
 		return;
 
-	idle = true;
 	for_each_ring(ring, dev_priv, i) {
-		seqno[i] = ring->get_seqno(ring, false);
-		idle &= i915_hangcheck_ring_idle(ring, seqno[i], &err);
-	}
+		u32 seqno;
+		bool idle, err = false;
+
+		seqno = ring->get_seqno(ring, false);
+		idle = i915_hangcheck_ring_idle(ring, seqno, &err);
 
-	/* If all work is done then ACTHD clearly hasn't advanced. */
-	if (idle) {
-		if (err) {
-			if (i915_hangcheck_hung(dev))
-				return;
+		if (idle) {
+			if (err)
+				ring->hangcheck_score++;
+			else
+				ring->hangcheck_score = 0;
+		} else {
+			busy_count++;
 
-			goto repeat;
+			if (ring->hangcheck_seqno == seqno) {
+				ring->hangcheck_score++;
+
+				/* If the ring is not waiting, raise
+				   the score further */
+				if (i915_hangcheck_ring_hung(dev, ring))
+					ring->hangcheck_score++;
+			} else {
+				ring->hangcheck_score = 0;
+			}
 		}
 
-		dev_priv->gpu_error.hangcheck_count = 0;
-		return;
+		ring->hangcheck_seqno = seqno;
 	}
 
-	work_done = false;
 	for_each_ring(ring, dev_priv, i) {
-		if (ring->hangcheck_seqno != seqno[i]) {
-			work_done = true;
-			ring->hangcheck_seqno = seqno[i];
+		if (ring->hangcheck_score > 2) {
+			rings_hung++;
+			DRM_ERROR("%s seems hung\n", ring->name);
 		}
 	}
 
-	if (!work_done) {
-		if (i915_hangcheck_hung(dev))
-			return;
-	} else {
-		dev_priv->gpu_error.hangcheck_count = 0;
-	}
+	if (rings_hung)
+		return i915_handle_error(dev, true);
 
-repeat:
-	/* Reset timer case chip hangs without another request being added */
-	mod_timer(&dev_priv->gpu_error.hangcheck_timer,
-		  round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES));
+	if (busy_count)
+		/* Reset timer case chip hangs without another request
+		 * being added */
+		mod_timer(&dev_priv->gpu_error.hangcheck_timer,
+			  round_jiffies_up(jiffies +
+					   DRM_I915_HANGCHECK_JIFFIES));
 }
 
 /* drm_dma.h hooks
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 9599c56..97b8f37 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -138,6 +138,7 @@ struct  intel_ring_buffer {
 	struct drm_i915_gem_object *last_context_obj;
 
 	u32 hangcheck_seqno;
+	int hangcheck_score;
 
 	void *private;
 };
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 07/13] drm/i915: remove i915_hangcheck_hung
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
                   ` (5 preceding siblings ...)
  2013-02-26 11:05 ` [PATCH 06/13] drm/i915: detect hang using per ring hangcheck_score Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  2013-02-26 11:05 ` [PATCH 08/13] drm/i915: add struct ctx_reset_state Mika Kuoppala
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

Rework of per ring hangcheck made this obsolete.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h |    1 -
 drivers/gpu/drm/i915/i915_irq.c |   21 ---------------------
 2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index fb51b4f..d225afd 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -796,7 +796,6 @@ struct i915_gpu_error {
 #define DRM_I915_HANGCHECK_PERIOD 1500 /* in ms */
 #define DRM_I915_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_I915_HANGCHECK_PERIOD)
 	struct timer_list hangcheck_timer;
-	int hangcheck_count;
 
 	/* For reset and error_state handling. */
 	spinlock_t lock;
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 4da8691..bfe5e84 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1785,27 +1785,6 @@ static bool i915_hangcheck_ring_hung(struct drm_device *dev,
 	return false;
 }
 
-static bool i915_hangcheck_hung(struct drm_device *dev)
-{
-	drm_i915_private_t *dev_priv = dev->dev_private;
-
-	if (dev_priv->gpu_error.hangcheck_count++ > 1) {
-		bool hung = true;
-		struct intel_ring_buffer *ring;
-		int i;
-
-		DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
-		i915_handle_error(dev, true);
-
-		for_each_ring(ring, dev_priv, i)
-			hung &= i915_hangcheck_ring_hung(dev, ring);
-
-		return hung;
-	}
-
-	return false;
-}
-
 /**
  * This is called when the chip hasn't reported back with completed
  * batchbuffers in a long time. The first time this is called we simply record
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 08/13] drm/i915: add struct ctx_reset_state
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
                   ` (6 preceding siblings ...)
  2013-02-26 11:05 ` [PATCH 07/13] drm/i915: remove i915_hangcheck_hung Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  2013-02-26 11:05 ` [PATCH 09/13] drm/i915: add reset_state for hw_contexts Mika Kuoppala
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

To count context losses, add struct ctx_reset_state for
both i915_hw_context and drm_i915_file_private.
drm_i915_file_private is used when there is no context.

Use kzalloc when allocating drm_i915_file_private to
initialize state.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_dma.c |    2 +-
 drivers/gpu/drm/i915/i915_drv.h |    8 ++++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index e16099b..f9919a3 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1792,7 +1792,7 @@ int i915_driver_open(struct drm_device *dev, struct drm_file *file)
 	struct drm_i915_file_private *file_priv;
 
 	DRM_DEBUG_DRIVER("\n");
-	file_priv = kmalloc(sizeof(*file_priv), GFP_KERNEL);
+	file_priv = kzalloc(sizeof(*file_priv), GFP_KERNEL);
 	if (!file_priv)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d225afd..1c67fb2 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -433,6 +433,11 @@ struct i915_hw_ppgtt {
 	void (*cleanup)(struct i915_hw_ppgtt *ppgtt);
 };
 
+struct ctx_reset_state {
+	u32 total;
+	u32 innocent;
+	u32 guilty;
+};
 
 /* This must match up with the value previously used for execbuf2.rsvd1. */
 #define DEFAULT_CONTEXT_ID 0
@@ -443,6 +448,7 @@ struct i915_hw_context {
 	struct drm_i915_file_private *file_priv;
 	struct intel_ring_buffer *ring;
 	struct drm_i915_gem_object *obj;
+	struct ctx_reset_state reset_state;
 };
 
 enum no_fbc_reason {
@@ -1258,6 +1264,8 @@ struct drm_i915_file_private {
 		struct list_head request_list;
 	} mm;
 	struct idr context_idr;
+
+	struct ctx_reset_state reset_state;
 };
 
 #define INTEL_INFO(dev)	(((struct drm_i915_private *) (dev)->dev_private)->info)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 09/13] drm/i915: add reset_state for hw_contexts
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
                   ` (7 preceding siblings ...)
  2013-02-26 11:05 ` [PATCH 08/13] drm/i915: add struct ctx_reset_state Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  2013-02-27  1:47   ` Ian Romanick
  2013-02-26 11:05 ` [PATCH 10/13] drm/i915: mark rings which were waiting when hang happened Mika Kuoppala
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

For arb-robustness, every context needs to have it's own
reset state tracking. Default context will be handled in a identical
way as the no-context case in further down in the patch set.
For no-context case, the reset state will be stored in
the file_priv part.

v2: handle default context inside get_reset_state

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |    4 ++++
 drivers/gpu/drm/i915/i915_gem_context.c |   34 +++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 1c67fb2..2cc5817 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1684,6 +1684,10 @@ int i915_switch_context(struct intel_ring_buffer *ring,
 			struct drm_file *file, int to_id,
 			struct i915_hw_context **ctx);
 void i915_gem_context_free(struct kref *ctx_ref);
+int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring,
+				     struct drm_file *file,
+				     u32 id,
+				     struct ctx_reset_state **rs);
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index cba54fb..1b14a06 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -304,6 +304,40 @@ static int context_idr_cleanup(int id, void *p, void *data)
 	return 0;
 }
 
+int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring,
+				     struct drm_file *file,
+				     u32 id,
+				     struct ctx_reset_state **rs)
+{
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct drm_i915_file_private *file_priv = file->driver_priv;
+	struct i915_hw_context *to;
+
+	if (dev_priv->hw_contexts_disabled)
+		return -ENOENT;
+
+	if (ring->id != RCS)
+		return -EINVAL;
+
+	if (rs == NULL)
+		return -EINVAL;
+
+	if (file == NULL)
+		return -EINVAL;
+
+	if (id == DEFAULT_CONTEXT_ID) {
+		*rs = &file_priv->reset_state;
+	} else {
+		to = i915_gem_context_get(file->driver_priv, id);
+		if (to == NULL)
+			return -ENOENT;
+
+		*rs = &to->reset_state;
+	}
+
+	return 0;
+}
+
 void i915_gem_context_close(struct drm_device *dev, struct drm_file *file)
 {
 	struct drm_i915_file_private *file_priv = file->driver_priv;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 10/13] drm/i915: mark rings which were waiting when hang happened
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
                   ` (8 preceding siblings ...)
  2013-02-26 11:05 ` [PATCH 09/13] drm/i915: add reset_state for hw_contexts Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  2013-02-26 11:05 ` [PATCH 11/13] drm/i915: add batch object and context to i915_add_request() Mika Kuoppala
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

For guilty batchbuffer analysis later on on ring resets,
mark all waiting rings so that we can skip them when
trying to find a true culprit for the gpu hang.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c         |   11 +++++++++--
 drivers/gpu/drm/i915/intel_ringbuffer.h |    1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index bfe5e84..6664e22 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1761,13 +1761,20 @@ static bool kick_ring(struct intel_ring_buffer *ring)
 	struct drm_device *dev = ring->dev;
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	u32 tmp = I915_READ_CTL(ring);
+
+	ring->hangcheck_was_waiting = false;
+
 	if (tmp & RING_WAIT) {
 		DRM_ERROR("Kicking stuck wait on %s\n",
 			  ring->name);
 		I915_WRITE_CTL(ring, tmp);
-		return true;
+		ring->hangcheck_was_waiting = true;
 	}
-	return false;
+
+	if ((INTEL_INFO(dev)->gen >= 6) && (tmp & RING_WAIT_SEMAPHORE))
+		ring->hangcheck_was_waiting = true;
+
+	return ring->hangcheck_was_waiting;
 }
 
 static bool i915_hangcheck_ring_hung(struct drm_device *dev,
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 97b8f37..573b0ef 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -139,6 +139,7 @@ struct  intel_ring_buffer {
 
 	u32 hangcheck_seqno;
 	int hangcheck_score;
+	bool hangcheck_was_waiting;
 
 	void *private;
 };
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 11/13] drm/i915: add batch object and context to i915_add_request()
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
                   ` (9 preceding siblings ...)
  2013-02-26 11:05 ` [PATCH 10/13] drm/i915: mark rings which were waiting when hang happened Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  2013-02-26 14:22   ` Chris Wilson
  2013-02-26 11:05 ` [PATCH 12/13] drm/i915: find guilty batch buffer on ring resets Mika Kuoppala
  2013-02-26 11:05 ` [PATCH 13/13] drm/i915: refuse to submit more batchbuffers from guilty context Mika Kuoppala
  12 siblings, 1 reply; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

In order to track down a batch buffer and context which
caused the ring to hang, store reference to bo and context
into the request struct. Request can also cause gpu to hang
after the batch in the flush section in the ring. To detect this
add start of the flush portion offset into the request.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |   11 +++++++++--
 drivers/gpu/drm/i915/i915_gem.c            |   12 ++++++++----
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    7 ++++---
 drivers/gpu/drm/i915/intel_overlay.c       |    5 +++--
 drivers/gpu/drm/i915/intel_ringbuffer.c    |    2 +-
 5 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 2cc5817..9361b2e 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1241,9 +1241,15 @@ struct drm_i915_gem_request {
 	/** GEM sequence number associated with this request. */
 	uint32_t seqno;
 
-	/** Postion in the ringbuffer of the end of the request */
+	/** Position in the ringbuffer of the start of the request */
+	u32 head;
+
+	/** Position in the ringbuffer of the end of the request */
 	u32 tail;
 
+	/** Batch buffer related to this request if any */
+	struct drm_i915_gem_object *batch_obj;
+
 	/** Context related to this request */
 	struct i915_hw_context *ctx;
 
@@ -1639,7 +1645,8 @@ int __must_check i915_gem_idle(struct drm_device *dev);
 int i915_add_request(struct intel_ring_buffer *ring,
 		     struct drm_file *file,
 		     u32 *seqno,
-		     struct i915_hw_context *ctx);
+		     struct i915_hw_context *ctx,
+		     struct drm_i915_gem_object *batch_obj);
 int __must_check i915_wait_seqno(struct intel_ring_buffer *ring,
 				 uint32_t seqno);
 int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 850b1bb..ace482d 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -967,7 +967,7 @@ i915_gem_check_olr(struct intel_ring_buffer *ring, u32 seqno)
 
 	ret = 0;
 	if (seqno == ring->outstanding_lazy_request)
-		ret = i915_add_request(ring, NULL, NULL, NULL);
+		ret = i915_add_request(ring, NULL, NULL, NULL, NULL);
 
 	return ret;
 }
@@ -1998,14 +1998,16 @@ int
 i915_add_request(struct intel_ring_buffer *ring,
 		 struct drm_file *file,
 		 u32 *out_seqno,
-		 struct i915_hw_context *ctx)
+		 struct i915_hw_context *ctx,
+		 struct drm_i915_gem_object *obj)
 {
 	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	struct drm_i915_gem_request *request;
-	u32 request_ring_position;
+	u32 request_ring_position, request_start;
 	int was_empty;
 	int ret;
 
+	request_start = intel_ring_get_tail(ring);
 	/*
 	 * Emit any outstanding flushes - execbuf can fail to emit the flush
 	 * after having emitted the batchbuffer command. Hence we need to fix
@@ -2037,7 +2039,9 @@ i915_add_request(struct intel_ring_buffer *ring,
 
 	request->seqno = intel_ring_get_seqno(ring);
 	request->ring = ring;
+	request->head = request_start;
 	request->tail = request_ring_position;
+	request->batch_obj = obj;
 	request->ctx = ctx;
 
 	if (request->ctx)
@@ -2276,7 +2280,7 @@ i915_gem_retire_work_handler(struct work_struct *work)
 	idle = true;
 	for_each_ring(ring, dev_priv, i) {
 		if (ring->gpu_caches_dirty)
-			i915_add_request(ring, NULL, NULL, NULL);
+			i915_add_request(ring, NULL, NULL, NULL, NULL);
 
 		idle &= list_empty(&ring->request_list);
 	}
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index b1e54d5..710784d 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -790,13 +790,14 @@ static void
 i915_gem_execbuffer_retire_commands(struct drm_device *dev,
 				    struct drm_file *file,
 				    struct intel_ring_buffer *ring,
-				    struct i915_hw_context *ctx)
+				    struct i915_hw_context *ctx,
+				    struct drm_i915_gem_object *obj)
 {
 	/* Unconditionally force add_request to emit a full flush. */
 	ring->gpu_caches_dirty = true;
 
 	/* Add a breadcrumb for the completion of the batch buffer */
-	(void)i915_add_request(ring, file, NULL, ctx);
+	(void)i915_add_request(ring, file, NULL, ctx, obj);
 }
 
 static int
@@ -1070,7 +1071,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	trace_i915_gem_ring_dispatch(ring, intel_ring_get_seqno(ring), flags);
 
 	i915_gem_execbuffer_move_to_active(&eb->objects, ring);
-	i915_gem_execbuffer_retire_commands(dev, file, ring, ctx);
+	i915_gem_execbuffer_retire_commands(dev, file, ring, ctx, batch_obj);
 
 err:
 	eb_destroy(eb);
diff --git a/drivers/gpu/drm/i915/intel_overlay.c b/drivers/gpu/drm/i915/intel_overlay.c
index 540be47..125b6fc 100644
--- a/drivers/gpu/drm/i915/intel_overlay.c
+++ b/drivers/gpu/drm/i915/intel_overlay.c
@@ -217,7 +217,7 @@ static int intel_overlay_do_wait_request(struct intel_overlay *overlay,
 	int ret;
 
 	BUG_ON(overlay->last_flip_req);
-	ret = i915_add_request(ring, NULL, &overlay->last_flip_req, NULL);
+	ret = i915_add_request(ring, NULL, &overlay->last_flip_req, NULL, NULL);
 	if (ret)
 		return ret;
 
@@ -286,7 +286,8 @@ static int intel_overlay_continue(struct intel_overlay *overlay,
 	intel_ring_emit(ring, flip_addr);
 	intel_ring_advance(ring);
 
-	return i915_add_request(ring, NULL, &overlay->last_flip_req, NULL);
+	return i915_add_request(ring, NULL, &overlay->last_flip_req,
+				NULL, NULL);
 }
 
 static void intel_overlay_release_old_vid_tail(struct intel_overlay *overlay)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 559d3e5..355291a 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1423,7 +1423,7 @@ int intel_ring_idle(struct intel_ring_buffer *ring)
 
 	/* We need to add any requests required to flush the objects and ring */
 	if (ring->outstanding_lazy_request) {
-		ret = i915_add_request(ring, NULL, NULL, NULL);
+		ret = i915_add_request(ring, NULL, NULL, NULL, NULL);
 		if (ret)
 			return ret;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 12/13] drm/i915: find guilty batch buffer on ring resets
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
                   ` (10 preceding siblings ...)
  2013-02-26 11:05 ` [PATCH 11/13] drm/i915: add batch object and context to i915_add_request() Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  2013-02-26 11:05 ` [PATCH 13/13] drm/i915: refuse to submit more batchbuffers from guilty context Mika Kuoppala
  12 siblings, 0 replies; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

After hang check timer has declared gpu to be hang,
rings are reset. In ring reset, when clearing
request list, do post mortem analysis to find out
the guilty batch buffer.

Select requests for further analysis by inspecting
the completed sequence number which has been updated
into the HWS page. If request was completed, it can't
be related to the hang.

For noncompleted requests mark the batch as guilty
if the ring was not waiting and the ring head was
stuck inside the buffer object or in the flush region
right after the batch. For everything else, mark
them as innocents.

v2: Fixed a typo in commit message (Ville Syrjälä)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c |   91 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index ace482d..6380a50 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2099,9 +2099,97 @@ i915_gem_request_remove_from_client(struct drm_i915_gem_request *request)
 	spin_unlock(&file_priv->mm.lock);
 }
 
+static bool i915_head_inside_object(u32 acthd, struct drm_i915_gem_object *obj)
+{
+	if (acthd >= obj->gtt_offset &&
+	    acthd < obj->gtt_offset + obj->base.size)
+		return true;
+
+	return false;
+}
+
+static bool i915_head_inside_request(u32 acthd, u32 rs, u32 re)
+{
+	if (rs < re) {
+		if (acthd >= rs && acthd < re)
+			return true;
+	} else if (rs > re) {
+		if (acthd >= rs || acthd < re)
+			return true;
+	}
+
+	return false;
+}
+
+static bool i915_request_guilty(struct drm_i915_gem_request *request,
+				const u32 acthd, bool *inside)
+{
+	if (request->batch_obj) {
+		if (i915_head_inside_object(acthd, request->batch_obj)) {
+			*inside = true;
+			return true;
+		}
+	}
+
+	if (i915_head_inside_request(acthd, request->head, request->tail)) {
+		*inside = false;
+		return true;
+	}
+
+	return false;
+}
+
+static void i915_set_reset_status(struct intel_ring_buffer *ring,
+				  struct drm_i915_gem_request *request,
+				  u32 acthd)
+{
+	bool inside;
+	struct ctx_reset_state *rs = NULL;
+	bool guilty;
+
+	/* Innocent until proven guilty */
+	guilty = false;
+
+	if (!ring->hangcheck_was_waiting &&
+	    i915_request_guilty(request, acthd, &inside)) {
+		DRM_ERROR("%s hung %s bo (0x%x ctx %d) at 0x%x\n",
+			  ring->name,
+			  inside ? "inside" : "flushing",
+			  request->batch_obj ?
+			  request->batch_obj->gtt_offset : 0,
+			  request->ctx ? request->ctx->id : 0,
+			  acthd);
+
+		guilty = true;
+	}
+
+	/* If contexts are disabled or this is the default context, use
+	 * file_priv->reset_state
+	 */
+	if (request->ctx && request->ctx->id != DEFAULT_CONTEXT_ID)
+		rs = &request->ctx->reset_state;
+	else if (request->file_priv)
+		rs = &request->file_priv->reset_state;
+
+	if (rs) {
+		rs->total++;
+
+		if (guilty)
+			rs->guilty++;
+		else
+			rs->innocent++;
+	}
+}
+
 static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 				      struct intel_ring_buffer *ring)
 {
+	u32 completed_seqno;
+	u32 acthd;
+
+	acthd = intel_ring_get_active_head(ring);
+	completed_seqno = ring->get_seqno(ring, false);
+
 	while (!list_empty(&ring->request_list)) {
 		struct drm_i915_gem_request *request;
 
@@ -2109,6 +2197,9 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 					   struct drm_i915_gem_request,
 					   list);
 
+		if (request->seqno > completed_seqno)
+			i915_set_reset_status(ring, request, acthd);
+
 		list_del(&request->list);
 		i915_gem_request_remove_from_client(request);
 
-- 
1.7.9.5

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 13/13] drm/i915: refuse to submit more batchbuffers from guilty context
  2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
                   ` (11 preceding siblings ...)
  2013-02-26 11:05 ` [PATCH 12/13] drm/i915: find guilty batch buffer on ring resets Mika Kuoppala
@ 2013-02-26 11:05 ` Mika Kuoppala
  12 siblings, 0 replies; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 11:05 UTC (permalink / raw)
  To: intel-gfx

If context has recently submitted a faulty batchbuffers guilty of
gpu hang and decides to keep submitting more crap, ban it permanently.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |   23 ++++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h            |    5 +++++
 drivers/gpu/drm/i915/i915_gem.c            |    8 ++++++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   12 ++++++++++++
 4 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index b342749..e305fbe 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -815,6 +815,8 @@ int intel_gpu_reset(struct drm_device *dev)
 int i915_reset(struct drm_device *dev)
 {
 	drm_i915_private_t *dev_priv = dev->dev_private;
+	struct ctx_reset_state *gstate;
+	bool do_wedge = true;
 	int ret;
 
 	if (!i915_try_reset)
@@ -822,10 +824,29 @@ int i915_reset(struct drm_device *dev)
 
 	mutex_lock(&dev->struct_mutex);
 
+	/* i915_gem_reset will set this if it finds guilty context */
+	dev_priv->gpu_error.guilty_state = NULL;
+
 	i915_gem_reset(dev);
 
+	gstate = dev_priv->gpu_error.guilty_state;
+
+	if (gstate) {
+		if (gstate->guilty == 1) {
+			do_wedge = false;
+		} else if (!gstate->banned &&
+			   get_seconds() - gstate->last_guilty_reset < 5) {
+			gstate->banned = true;
+			do_wedge = false;
+		}
+
+		gstate->last_guilty_reset = get_seconds();
+	}
+
+	dev_priv->gpu_error.guilty_state = NULL;
+
 	ret = -ENODEV;
-	if (get_seconds() - dev_priv->gpu_error.last_reset < 5)
+	if (do_wedge && get_seconds() - dev_priv->gpu_error.last_reset < 5)
 		DRM_ERROR("GPU hanging too fast, declaring wedged!\n");
 	else
 		ret = intel_gpu_reset(dev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 9361b2e..06c518f 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -437,6 +437,10 @@ struct ctx_reset_state {
 	u32 total;
 	u32 innocent;
 	u32 guilty;
+	unsigned long last_guilty_reset;
+
+	/* banned to submit more work */
+	bool banned;
 };
 
 /* This must match up with the value previously used for execbuf2.rsvd1. */
@@ -810,6 +814,7 @@ struct i915_gpu_error {
 	struct work_struct work;
 
 	unsigned long last_reset;
+	struct ctx_reset_state *guilty_state;
 
 	/**
 	 * State variable and reset counter controlling the reset flow
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 6380a50..1d41e97 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2143,6 +2143,7 @@ static void i915_set_reset_status(struct intel_ring_buffer *ring,
 				  struct drm_i915_gem_request *request,
 				  u32 acthd)
 {
+	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	bool inside;
 	struct ctx_reset_state *rs = NULL;
 	bool guilty;
@@ -2174,10 +2175,13 @@ static void i915_set_reset_status(struct intel_ring_buffer *ring,
 	if (rs) {
 		rs->total++;
 
-		if (guilty)
+		if (guilty) {
 			rs->guilty++;
-		else
+
+			dev_priv->gpu_error.guilty_state = rs;
+		} else {
 			rs->innocent++;
+		}
 	}
 }
 
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 710784d..20a4011 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -837,6 +837,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	struct drm_clip_rect *cliprects = NULL;
 	struct intel_ring_buffer *ring;
 	struct i915_hw_context *ctx;
+	struct ctx_reset_state *rs;
 	u32 ctx_id = i915_execbuffer2_get_context_id(*args);
 	u32 exec_start, exec_len;
 	u32 mask, flags;
@@ -1020,6 +1021,17 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	if (ret)
 		goto err;
 
+
+	ret = i915_gem_context_get_reset_state(&dev_priv->ring[RCS],
+					       file, ctx_id, &rs);
+	if (ret)
+		goto err;
+
+	if (rs->banned) {
+		ret = -EIO;
+		goto err;
+	}
+
 	ret = i915_switch_context(ring, file, ctx_id, &ctx);
 	if (ret)
 		goto err;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 04/13] drm/i915: track ring progression using seqnos
  2013-02-26 11:05 ` [PATCH 04/13] drm/i915: track ring progression using seqnos Mika Kuoppala
@ 2013-02-26 14:12   ` Chris Wilson
  2013-02-26 15:09     ` Mika Kuoppala
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Wilson @ 2013-02-26 14:12 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Tue, Feb 26, 2013 at 01:05:07PM +0200, Mika Kuoppala wrote:
> Instead of relying in acthd, track ring seqno progression
> to detect if ring has hung.

This needs a comment that it has a user visible side-effect of limiting
batches to a maximum of 1.5s runtime. Before, that limit was softer in
that we had a chance to spot that the GPU was busy - even if we could be
fooled by an infinite loop.

Did you write an i-g-t for detecting a ring of chained batchbuffers?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 06/13] drm/i915: detect hang using per ring hangcheck_score
  2013-02-26 11:05 ` [PATCH 06/13] drm/i915: detect hang using per ring hangcheck_score Mika Kuoppala
@ 2013-02-26 14:16   ` Chris Wilson
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Wilson @ 2013-02-26 14:16 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Tue, Feb 26, 2013 at 01:05:09PM +0200, Mika Kuoppala wrote:
> Add per ring score of possible culprit for gpu hang. If
> ring is busy and not waiting, it will get the highest score
> across calls to i915_hangcheck_elapsed. This way we are
> most likely to find the ring that caused the hang among
> the waiting ones.

I think you want to incorporate
https://patchwork.kernel.org/patch/1958381/
into this series as it addresses one case of many waiters, none guilty.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 11/13] drm/i915: add batch object and context to i915_add_request()
  2013-02-26 11:05 ` [PATCH 11/13] drm/i915: add batch object and context to i915_add_request() Mika Kuoppala
@ 2013-02-26 14:22   ` Chris Wilson
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Wilson @ 2013-02-26 14:22 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Tue, Feb 26, 2013 at 01:05:14PM +0200, Mika Kuoppala wrote:
> In order to track down a batch buffer and context which
> caused the ring to hang, store reference to bo and context
> into the request struct. Request can also cause gpu to hang
> after the batch in the flush section in the ring. To detect this
> add start of the flush portion offset into the request.

The call to i915_add_request() is becoming grotesque. We have two classes
of callers, execbuffer which sets everything and everyone else who set
nothing. So please provide a macro to curry all the extra unused
parameters and minimise the number of places we need to update
everytime the interface changes.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 04/13] drm/i915: track ring progression using seqnos
  2013-02-26 14:12   ` Chris Wilson
@ 2013-02-26 15:09     ` Mika Kuoppala
  2013-02-26 15:31       ` Chris Wilson
  0 siblings, 1 reply; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-26 15:09 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Tue, Feb 26, 2013 at 01:05:07PM +0200, Mika Kuoppala wrote:
>> Instead of relying in acthd, track ring seqno progression
>> to detect if ring has hung.
>
> This needs a comment that it has a user visible side-effect of limiting
> batches to a maximum of 1.5s runtime. Before, that limit was softer in
> that we had a chance to spot that the GPU was busy - even if we could be
> fooled by an infinite loop.

As the current code has:

> if (dev_priv->gpu_error.hangcheck_count++ > 1)

to trigger the hang, so it will trigger at 3rd timer call (4.5seconds)

With my patch it will be 4.5 seconds if rings are waiting 
and 3 seconds if there is non waiting ring involved.

As i can't explain what is the added benefit to declare hang earlier
if there is a non waiting ring, do you want me to simplify this
to just declare hang if there is no progress in 4.5 seconds in both cases?
To match the old trigger timing.

> Did you write an i-g-t for detecting a ring of chained batchbuffers?
> -Chris

Yes, you can find it in here:

https://github.com/mkuoppal/intel-gpu-tools/commit/1df7d49ff9ecedf9c55933a9e36b1eb41f07abc6

If you wan't to compile/run, you need also:
https://github.com/mkuoppal/linux/commit/f24c1d64f89b070c74afa38ab5ac148f56c84aaf

The interface will change but currently the test is based on old ioctl.

Thanks,
-Mika

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 04/13] drm/i915: track ring progression using seqnos
  2013-02-26 15:09     ` Mika Kuoppala
@ 2013-02-26 15:31       ` Chris Wilson
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Wilson @ 2013-02-26 15:31 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Tue, Feb 26, 2013 at 05:09:03PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > On Tue, Feb 26, 2013 at 01:05:07PM +0200, Mika Kuoppala wrote:
> >> Instead of relying in acthd, track ring seqno progression
> >> to detect if ring has hung.
> >
> > This needs a comment that it has a user visible side-effect of limiting
> > batches to a maximum of 1.5s runtime. Before, that limit was softer in
> > that we had a chance to spot that the GPU was busy - even if we could be
> > fooled by an infinite loop.
> 
> As the current code has:
> 
> > if (dev_priv->gpu_error.hangcheck_count++ > 1)
> 
> to trigger the hang, so it will trigger at 3rd timer call (4.5seconds)
> 
> With my patch it will be 4.5 seconds if rings are waiting 
> and 3 seconds if there is non waiting ring involved.
> 
> As i can't explain what is the added benefit to declare hang earlier
> if there is a non waiting ring, do you want me to simplify this
> to just declare hang if there is no progress in 4.5 seconds in both cases?
> To match the old trigger timing.

Yes, that would be good if you can simplify the logic so that it does
not react differently for no apparent reason. I would prefer the 3s
timeout.

> > Did you write an i-g-t for detecting a ring of chained batchbuffers?
> > -Chris
> 
> Yes, you can find it in here:
> 
> https://github.com/mkuoppal/intel-gpu-tools/commit/1df7d49ff9ecedf9c55933a9e36b1eb41f07abc6

I definitely add an exercise for an infinite loop that was not caught by
the current hangcheck, and add a fixes: note to this changelog.
(Couldn't spot that explicit case in the reset-status test, and it can
be added independently of this series as an demonstration of a known
breakage.)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/13] drm/i915: add context parameter to i915_switch_context()
  2013-02-26 11:05 ` [PATCH 01/13] drm/i915: add context parameter to i915_switch_context() Mika Kuoppala
@ 2013-02-26 18:31   ` Ben Widawsky
  0 siblings, 0 replies; 32+ messages in thread
From: Ben Widawsky @ 2013-02-26 18:31 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Tue, Feb 26, 2013 at 01:05:04PM +0200, Mika Kuoppala wrote:
> In preparation for the next commit, return context that
> was switch to from i915_switch_context().
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>

I'd suggest using ERR_PTR and PTR_ERR instead of the **ctx

> ---
>  drivers/gpu/drm/i915/i915_drv.h            |    3 ++-
>  drivers/gpu/drm/i915/i915_gem.c            |    2 +-
>  drivers/gpu/drm/i915/i915_gem_context.c    |   36 ++++++++++++++++++----------
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c |    2 +-
>  4 files changed, 27 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index e95337c..27459ec 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1671,7 +1671,8 @@ void i915_gem_context_init(struct drm_device *dev);
>  void i915_gem_context_fini(struct drm_device *dev);
>  void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
>  int i915_switch_context(struct intel_ring_buffer *ring,
> -			struct drm_file *file, int to_id);
> +			struct drm_file *file, int to_id,
> +			struct i915_hw_context **ctx);
>  int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>  				  struct drm_file *file);
>  int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 8413ffc..61610f3 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2517,7 +2517,7 @@ int i915_gpu_idle(struct drm_device *dev)
>  
>  	/* Flush everything onto the inactive list. */
>  	for_each_ring(ring, dev_priv, i) {
> -		ret = i915_switch_context(ring, NULL, DEFAULT_CONTEXT_ID);
> +		ret = i915_switch_context(ring, NULL, DEFAULT_CONTEXT_ID, NULL);
>  		if (ret)
>  			return ret;
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 21177d9..4125919 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -428,10 +428,9 @@ static int do_switch(struct i915_hw_context *to)
>  /**
>   * i915_switch_context() - perform a GPU context switch.
>   * @ring: ring for which we'll execute the context switch
> - * @file_priv: file_priv associated with the context, may be NULL
> - * @id: context id number
> - * @seqno: sequence number by which the new context will be switched to
> - * @flags:
> + * @file: file associated with the context, may be NULL
> + * @to_id: context id number
> + * @ctx: i915_hw_context we switched to, may be NULL
>   *
>   * The context life cycle is simple. The context refcount is incremented and
>   * decremented by 1 and create and destroy. If the context is in use by the GPU,
> @@ -440,29 +439,40 @@ static int do_switch(struct i915_hw_context *to)
>   */
>  int i915_switch_context(struct intel_ring_buffer *ring,
>  			struct drm_file *file,
> -			int to_id)
> +			int to_id,
> +			struct i915_hw_context **ctx)
>  {
>  	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> -	struct i915_hw_context *to;
> +	struct i915_hw_context *to = NULL;
> +	int ret = 0;
>  
>  	if (dev_priv->hw_contexts_disabled)
> -		return 0;
> +		goto out;
>  
>  	if (ring != &dev_priv->ring[RCS])
> -		return 0;
> +		goto out;
>  
>  	if (to_id == DEFAULT_CONTEXT_ID) {
>  		to = ring->default_context;
>  	} else {
> -		if (file == NULL)
> -			return -EINVAL;
> +		if (file == NULL) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
>  
>  		to = i915_gem_context_get(file->driver_priv, to_id);
> -		if (to == NULL)
> -			return -ENOENT;
> +		if (to == NULL) {
> +			ret = -ENOENT;
> +			goto out;
> +		}
>  	}
>  
> -	return do_switch(to);
> +	ret = do_switch(to);
> +out:
> +	if (ctx)
> +		*ctx = to;
> +
> +	return ret;
>  }
>  
>  int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 2f2daeb..499d025 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -1017,7 +1017,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
>  	if (ret)
>  		goto err;
>  
> -	ret = i915_switch_context(ring, file, ctx_id);
> +	ret = i915_switch_context(ring, file, ctx_id, NULL);
>  	if (ret)
>  		goto err;
>  
> -- 
> 1.7.9.5
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/13] drm/i915: reference count for i915_hw_contexts
  2013-02-26 11:05 ` [PATCH 02/13] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
@ 2013-02-26 21:45   ` Ben Widawsky
  0 siblings, 0 replies; 32+ messages in thread
From: Ben Widawsky @ 2013-02-26 21:45 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Tue, Feb 26, 2013 at 01:05:05PM +0200, Mika Kuoppala wrote:
> In preparation to do analysis of which context was
> guilty of gpu hung, store kreffed context pointer
> into request struct.
> 
> This allows us to inspect contexts when gpu is reset
> even if those contexts would already be released
> by userspace.
> 
> v2: track i915_hw_context pointers instead of using ctx_ids
>     (from Chris Wilson)
> 

Daniel: it's like if we just stored a bit of context info in the object,
we wouldn't need to refcount... hmm

[snip]
-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/13] drm/i915: pass seqno to i915_hangcheck_ring_idle
  2013-02-26 11:05 ` [PATCH 03/13] drm/i915: pass seqno to i915_hangcheck_ring_idle Mika Kuoppala
@ 2013-02-26 22:41   ` Ben Widawsky
  2013-03-01 13:49     ` Mika Kuoppala
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Widawsky @ 2013-02-26 22:41 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Tue, Feb 26, 2013 at 01:05:06PM +0200, Mika Kuoppala wrote:
> In preparation for next commit, pass seqno as a parameter
> to i915_hangcheck_ring_idle as it will be used inside
> i915_hangcheck_elapsed.
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_irq.c |   11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 29037e0..4f60c87 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1740,11 +1740,11 @@ ring_last_seqno(struct intel_ring_buffer *ring)
>  			  struct drm_i915_gem_request, list)->seqno;
>  }
>  
> -static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring, bool *err)
> +static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring,
> +				     u32 ring_seqno, bool *err)
>  {
>  	if (list_empty(&ring->request_list) ||
> -	    i915_seqno_passed(ring->get_seqno(ring, false),
> -			      ring_last_seqno(ring))) {
> +	    i915_seqno_passed(ring_seqno, ring_last_seqno(ring))) {
>  		/* Issue a wake-up to catch stuck h/w. */
>  		if (waitqueue_active(&ring->irq_queue)) {
>  			DRM_ERROR("Hangcheck timer elapsed... %s idle\n",

I think just return err. Don't bother with a bool and *err return.

> @@ -1821,7 +1821,10 @@ void i915_hangcheck_elapsed(unsigned long data)
>  	memset(acthd, 0, sizeof(acthd));
>  	idle = true;
>  	for_each_ring(ring, dev_priv, i) {
> -	    idle &= i915_hangcheck_ring_idle(ring, &err);
> +		u32 seqno;
> +
> +		seqno = ring->get_seqno(ring, false);
> +		idle &= i915_hangcheck_ring_idle(ring, seqno, &err);
>  	    acthd[i] = intel_ring_get_active_head(ring);
>  	}
>  
-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/13] drm/i915: introduce i915_hangcheck_ring_hung
  2013-02-26 11:05 ` [PATCH 05/13] drm/i915: introduce i915_hangcheck_ring_hung Mika Kuoppala
@ 2013-02-26 23:03   ` Ben Widawsky
  0 siblings, 0 replies; 32+ messages in thread
From: Ben Widawsky @ 2013-02-26 23:03 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Tue, Feb 26, 2013 at 01:05:08PM +0200, Mika Kuoppala wrote:
> In preparation to track per ring progress in hangcheck,
> add i915_hangcheck_ring_hung.
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_irq.c |   31 +++++++++++++++++++------------
>  1 file changed, 19 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index de5af12..b828807 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1771,28 +1771,35 @@ static bool kick_ring(struct intel_ring_buffer *ring)
>  	return false;
>  }
>  
> +static bool i915_hangcheck_ring_hung(struct drm_device *dev,
> +				     struct intel_ring_buffer *ring)
> +{
> +	if (!IS_GEN2(dev)) {
> +		/* Is the chip hanging on a WAIT_FOR_EVENT?
> +		 * If so we can simply poke the RB_WAIT bit
> +		 * and break the hang. This should work on
> +		 * all but the second generation chipsets.
> +		 */
> +		return !kick_ring(ring);
> +	}
> +
> +	return false;
> +}
> +
>  static bool i915_hangcheck_hung(struct drm_device *dev)
>  {
>  	drm_i915_private_t *dev_priv = dev->dev_private;
>  
>  	if (dev_priv->gpu_error.hangcheck_count++ > 1) {
>  		bool hung = true;
> +		struct intel_ring_buffer *ring;
> +		int i;
>  
>  		DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
>  		i915_handle_error(dev, true);
>  
> -		if (!IS_GEN2(dev)) {
> -			struct intel_ring_buffer *ring;
> -			int i;
> -
> -			/* Is the chip hanging on a WAIT_FOR_EVENT?
> -			 * If so we can simply poke the RB_WAIT bit
> -			 * and break the hang. This should work on
> -			 * all but the second generation chipsets.
> -			 */
> -			for_each_ring(ring, dev_priv, i)
> -				hung &= !kick_ring(ring);
> -		}
> +		for_each_ring(ring, dev_priv, i)
> +			hung &= i915_hangcheck_ring_hung(dev, ring);
>  
>  		return hung;
>  	}

I don't think you need dev. Just bail early if it's GEN2. If you really
need dev later, you can get it from ring->dev.

-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/13] drm/i915: add reset_state for hw_contexts
  2013-02-26 11:05 ` [PATCH 09/13] drm/i915: add reset_state for hw_contexts Mika Kuoppala
@ 2013-02-27  1:47   ` Ian Romanick
  2013-02-27  1:50     ` Ian Romanick
                       ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Ian Romanick @ 2013-02-27  1:47 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On 02/26/2013 03:05 AM, Mika Kuoppala wrote:
> For arb-robustness, every context needs to have it's own
> reset state tracking. Default context will be handled in a identical
> way as the no-context case in further down in the patch set.
> For no-context case, the reset state will be stored in
> the file_priv part.
>
> v2: handle default context inside get_reset_state

This isn't the interface we want.  I already sent you the patches for 
Mesa, and you seem to have completely ignored them.  Moreover, this 
interface provides no mechanism to query for its existence (other than 
relying on the kernel version), and no method to deprecate it.

Based on e-mail discussions, I think danvet agrees with me here.

Putting guilty / innocent counting in kernel puts policy decisions in 
the kernel that belong with the user space API that implements them. 
Putting these choices in the kernel significantly decreases how "future 
proof" the interface is.  Since any kernel/user interface has to be kept 
forever, this is just asking for maintenance trouble down the road.

Also, it's difficult (for me) to tell from the rest of the series 
whether or not a context that was not affected by a reset (had no 
batches in flight at all) can still observe that a reset occurred.

 From the GL point of view, once a context has observed a reset, it's 
garbage.  Knowing that it saw 1 reset or 43,000,000 resets is the same. 
  Since it's a count, GL will have to query the values before any 
rendering happens or it won't know whether the value 6 means there was a 
reset or not.

The right interface is a set of flags indicating the state the context 
was in when it observed a reset.  As this patch series stands, Mesa will 
not use this interface.

> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         |    4 ++++
>   drivers/gpu/drm/i915/i915_gem_context.c |   34 +++++++++++++++++++++++++++++++
>   2 files changed, 38 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 1c67fb2..2cc5817 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1684,6 +1684,10 @@ int i915_switch_context(struct intel_ring_buffer *ring,
>   			struct drm_file *file, int to_id,
>   			struct i915_hw_context **ctx);
>   void i915_gem_context_free(struct kref *ctx_ref);
> +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring,
> +				     struct drm_file *file,
> +				     u32 id,
> +				     struct ctx_reset_state **rs);
>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>   				  struct drm_file *file);
>   int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index cba54fb..1b14a06 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -304,6 +304,40 @@ static int context_idr_cleanup(int id, void *p, void *data)
>   	return 0;
>   }
>
> +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring,
> +				     struct drm_file *file,
> +				     u32 id,
> +				     struct ctx_reset_state **rs)
> +{
> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> +	struct drm_i915_file_private *file_priv = file->driver_priv;
> +	struct i915_hw_context *to;
> +
> +	if (dev_priv->hw_contexts_disabled)
> +		return -ENOENT;
> +
> +	if (ring->id != RCS)
> +		return -EINVAL;
> +
> +	if (rs == NULL)
> +		return -EINVAL;
> +
> +	if (file == NULL)
> +		return -EINVAL;
> +
> +	if (id == DEFAULT_CONTEXT_ID) {
> +		*rs = &file_priv->reset_state;
> +	} else {
> +		to = i915_gem_context_get(file->driver_priv, id);
> +		if (to == NULL)
> +			return -ENOENT;
> +
> +		*rs = &to->reset_state;
> +	}
> +
> +	return 0;
> +}
> +
>   void i915_gem_context_close(struct drm_device *dev, struct drm_file *file)
>   {
>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/13] drm/i915: add reset_state for hw_contexts
  2013-02-27  1:47   ` Ian Romanick
@ 2013-02-27  1:50     ` Ian Romanick
  2013-02-27  9:13     ` Chris Wilson
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Ian Romanick @ 2013-02-27  1:50 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On 02/26/2013 05:47 PM, Ian Romanick wrote:
> On 02/26/2013 03:05 AM, Mika Kuoppala wrote:
>> For arb-robustness, every context needs to have it's own
>> reset state tracking. Default context will be handled in a identical
>> way as the no-context case in further down in the patch set.
>> For no-context case, the reset state will be stored in
>> the file_priv part.
>>
>> v2: handle default context inside get_reset_state
>
> This isn't the interface we want.  I already sent you the patches for
> Mesa, and you seem to have completely ignored them.  Moreover, this
> interface provides no mechanism to query for its existence (other than
> relying on the kernel version), and no method to deprecate it.
>
> Based on e-mail discussions, I think danvet agrees with me here.
>
> Putting guilty / innocent counting in kernel puts policy decisions in
> the kernel that belong with the user space API that implements them.
> Putting these choices in the kernel significantly decreases how "future
> proof" the interface is.  Since any kernel/user interface has to be kept
> forever, this is just asking for maintenance trouble down the road.
>
> Also, it's difficult (for me) to tell from the rest of the series
> whether or not a context that was not affected by a reset (had no
> batches in flight at all) can still observe that a reset occurred.
>
>  From the GL point of view, once a context has observed a reset, it's
> garbage.  Knowing that it saw 1 reset or 43,000,000 resets is the same.
>   Since it's a count, GL will have to query the values before any
> rendering happens or it won't know whether the value 6 means there was a
> reset or not.
>
> The right interface is a set of flags indicating the state the context
> was in when it observed a reset.  As this patch series stands, Mesa will
> not use this interface.

I almost forgot... it seems like there is a lot of code in this series 
to support GPUs where we don't support (either in the existing kernel 
driver or in the hardware at all) hardware contexts.  Is this really 
necessary?  Is anyone going to implement ARB_robustness for 845?  I have 
no intention to do so.  It seems much better to simplify this code to 
only support the GPUs we are actually going to support in user-mode.

If there are no users, the code is untested.  Repeat Keith's mantra: 
Any code that isn't tested is broken. :)

>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.h         |    4 ++++
>>   drivers/gpu/drm/i915/i915_gem_context.c |   34
>> +++++++++++++++++++++++++++++++
>>   2 files changed, 38 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index 1c67fb2..2cc5817 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -1684,6 +1684,10 @@ int i915_switch_context(struct
>> intel_ring_buffer *ring,
>>               struct drm_file *file, int to_id,
>>               struct i915_hw_context **ctx);
>>   void i915_gem_context_free(struct kref *ctx_ref);
>> +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring,
>> +                     struct drm_file *file,
>> +                     u32 id,
>> +                     struct ctx_reset_state **rs);
>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>                     struct drm_file *file);
>>   int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index cba54fb..1b14a06 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -304,6 +304,40 @@ static int context_idr_cleanup(int id, void *p,
>> void *data)
>>       return 0;
>>   }
>>
>> +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring,
>> +                     struct drm_file *file,
>> +                     u32 id,
>> +                     struct ctx_reset_state **rs)
>> +{
>> +    struct drm_i915_private *dev_priv = ring->dev->dev_private;
>> +    struct drm_i915_file_private *file_priv = file->driver_priv;
>> +    struct i915_hw_context *to;
>> +
>> +    if (dev_priv->hw_contexts_disabled)
>> +        return -ENOENT;
>> +
>> +    if (ring->id != RCS)
>> +        return -EINVAL;
>> +
>> +    if (rs == NULL)
>> +        return -EINVAL;
>> +
>> +    if (file == NULL)
>> +        return -EINVAL;
>> +
>> +    if (id == DEFAULT_CONTEXT_ID) {
>> +        *rs = &file_priv->reset_state;
>> +    } else {
>> +        to = i915_gem_context_get(file->driver_priv, id);
>> +        if (to == NULL)
>> +            return -ENOENT;
>> +
>> +        *rs = &to->reset_state;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   void i915_gem_context_close(struct drm_device *dev, struct drm_file
>> *file)
>>   {
>>       struct drm_i915_file_private *file_priv = file->driver_priv;
>>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/13] drm/i915: add reset_state for hw_contexts
  2013-02-27  1:47   ` Ian Romanick
  2013-02-27  1:50     ` Ian Romanick
@ 2013-02-27  9:13     ` Chris Wilson
  2013-02-28  1:15       ` Ian Romanick
  2013-02-27 10:11     ` Mika Kuoppala
  2013-02-27 17:20     ` Jesse Barnes
  3 siblings, 1 reply; 32+ messages in thread
From: Chris Wilson @ 2013-02-27  9:13 UTC (permalink / raw)
  To: Ian Romanick; +Cc: intel-gfx

On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote:
> On 02/26/2013 03:05 AM, Mika Kuoppala wrote:
> >For arb-robustness, every context needs to have it's own
> >reset state tracking. Default context will be handled in a identical
> >way as the no-context case in further down in the patch set.
> >For no-context case, the reset state will be stored in
> >the file_priv part.
> >
> >v2: handle default context inside get_reset_state
> 
> This isn't the interface we want.  I already sent you the patches
> for Mesa, and you seem to have completely ignored them.  Moreover,
> this interface provides no mechanism to query for its existence
> (other than relying on the kernel version), and no method to
> deprecate it.

It's existence, and the existence of any successor, is the only test you
need to check for the interface. Testing the flag field up front also
lets you know if individual flags are known prior to use.
 
> Based on e-mail discussions, I think danvet agrees with me here.
> 
> Putting guilty / innocent counting in kernel puts policy decisions
> in the kernel that belong with the user space API that implements
> them. Putting these choices in the kernel significantly decreases
> how "future proof" the interface is.  Since any kernel/user
> interface has to be kept forever, this is just asking for
> maintenance trouble down the road.

I think you have the policy standpoint reversed.
 
> Also, it's difficult (for me) to tell from the rest of the series
> whether or not a context that was not affected by a reset (had no
> batches in flight at all) can still observe that a reset occurred.

Yes, the context can observe that the global reset increased, its did
not.

> From the GL point of view, once a context has observed a reset, it's
> garbage.  Knowing that it saw 1 reset or 43,000,000 resets is the
> same.  Since it's a count, GL will have to query the values before
> any rendering happens or it won't know whether the value 6 means
> there was a reset or not.
> 
> The right interface is a set of flags indicating the state the
> context was in when it observed a reset.  As this patch series
> stands, Mesa will not use this interface.

This interface conforms to your specifications and interface that you
last posted to the list and were revised on list. There are two
substantive differences; there is *less* policy in the kernel than
before, and the breadcrumb of last reset state was overlooked.

In order to make the reset status work in a multi-threaded environment
the kernel exists in, we have to assume that there will be multiple
actors on the reset status, and they will be comparing the state that
they are interested in. The set of flags that GL understands is
deducible from this interface.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/13] drm/i915: add reset_state for hw_contexts
  2013-02-27  1:47   ` Ian Romanick
  2013-02-27  1:50     ` Ian Romanick
  2013-02-27  9:13     ` Chris Wilson
@ 2013-02-27 10:11     ` Mika Kuoppala
  2013-02-27 17:20     ` Jesse Barnes
  3 siblings, 0 replies; 32+ messages in thread
From: Mika Kuoppala @ 2013-02-27 10:11 UTC (permalink / raw)
  To: Ian Romanick; +Cc: intel-gfx

Ian Romanick <idr@freedesktop.org> writes:

> On 02/26/2013 03:05 AM, Mika Kuoppala wrote:
>> For arb-robustness, every context needs to have it's own
>> reset state tracking. Default context will be handled in a identical
>> way as the no-context case in further down in the patch set.
>> For no-context case, the reset state will be stored in
>> the file_priv part.
>>
>> v2: handle default context inside get_reset_state
>
> This isn't the interface we want.  I already sent you the patches for 
> Mesa, and you seem to have completely ignored them.  Moreover, this 
> interface provides no mechanism to query for its existence (other than 
> relying on the kernel version), and no method to deprecate it.
>
> Based on e-mail discussions, I think danvet agrees with me here.
>
> Putting guilty / innocent counting in kernel puts policy decisions in 
> the kernel that belong with the user space API that implements them. 
> Putting these choices in the kernel significantly decreases how "future 
> proof" the interface is.  Since any kernel/user interface has to be kept 
> forever, this is just asking for maintenance trouble down the road.
>
> Also, it's difficult (for me) to tell from the rest of the series 
> whether or not a context that was not affected by a reset (had no 
> batches in flight at all) can still observe that a reset occurred.
>
>  From the GL point of view, once a context has observed a reset, it's 
> garbage.  Knowing that it saw 1 reset or 43,000,000 resets is the same. 
>   Since it's a count, GL will have to query the values before any 
> rendering happens or it won't know whether the value 6 means there was a 
> reset or not.
>
> The right interface is a set of flags indicating the state the context 
> was in when it observed a reset.  As this patch series stands, Mesa will 
> not use this interface.

There is a little bit of misunderstanding in here I think.
I haven't ignored your input. I abandoned the statistics based
ioctl interface based on your input.

This commit in question adds only internal interface to query
the context reset state.

My intention was to redo the ioctl interface on top of this internal
bookkeepping. The ioctl interface will be made according to this
commit:

http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=ef97f4092c703afd49b3516b15dfbfcd27d4bc97

Thanks,
-Mika

>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.h         |    4 ++++
>>   drivers/gpu/drm/i915/i915_gem_context.c |   34 +++++++++++++++++++++++++++++++
>>   2 files changed, 38 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index 1c67fb2..2cc5817 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -1684,6 +1684,10 @@ int i915_switch_context(struct intel_ring_buffer *ring,
>>   			struct drm_file *file, int to_id,
>>   			struct i915_hw_context **ctx);
>>   void i915_gem_context_free(struct kref *ctx_ref);
>> +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring,
>> +				     struct drm_file *file,
>> +				     u32 id,
>> +				     struct ctx_reset_state **rs);
>>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>   				  struct drm_file *file);
>>   int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
>> index cba54fb..1b14a06 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -304,6 +304,40 @@ static int context_idr_cleanup(int id, void *p, void *data)
>>   	return 0;
>>   }
>>
>> +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring,
>> +				     struct drm_file *file,
>> +				     u32 id,
>> +				     struct ctx_reset_state **rs)
>> +{
>> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
>> +	struct drm_i915_file_private *file_priv = file->driver_priv;
>> +	struct i915_hw_context *to;
>> +
>> +	if (dev_priv->hw_contexts_disabled)
>> +		return -ENOENT;
>> +
>> +	if (ring->id != RCS)
>> +		return -EINVAL;
>> +
>> +	if (rs == NULL)
>> +		return -EINVAL;
>> +
>> +	if (file == NULL)
>> +		return -EINVAL;
>> +
>> +	if (id == DEFAULT_CONTEXT_ID) {
>> +		*rs = &file_priv->reset_state;
>> +	} else {
>> +		to = i915_gem_context_get(file->driver_priv, id);
>> +		if (to == NULL)
>> +			return -ENOENT;
>> +
>> +		*rs = &to->reset_state;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   void i915_gem_context_close(struct drm_device *dev, struct drm_file *file)
>>   {
>>   	struct drm_i915_file_private *file_priv = file->driver_priv;
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/13] drm/i915: add reset_state for hw_contexts
  2013-02-27  1:47   ` Ian Romanick
                       ` (2 preceding siblings ...)
  2013-02-27 10:11     ` Mika Kuoppala
@ 2013-02-27 17:20     ` Jesse Barnes
  3 siblings, 0 replies; 32+ messages in thread
From: Jesse Barnes @ 2013-02-27 17:20 UTC (permalink / raw)
  To: Ian Romanick; +Cc: intel-gfx

On Tue, 26 Feb 2013 17:47:12 -0800
Ian Romanick <idr@freedesktop.org> wrote:

> On 02/26/2013 03:05 AM, Mika Kuoppala wrote:
> > For arb-robustness, every context needs to have it's own
> > reset state tracking. Default context will be handled in a identical
> > way as the no-context case in further down in the patch set.
> > For no-context case, the reset state will be stored in
> > the file_priv part.
> >
> > v2: handle default context inside get_reset_state
> 
> This isn't the interface we want.  I already sent you the patches for 
> Mesa, and you seem to have completely ignored them.  Moreover, this 
> interface provides no mechanism to query for its existence (other than 
> relying on the kernel version), and no method to deprecate it.

Before you get all wound up about this, note that you're replying to a
kernel internal function, not the ioctl that will be exposed to Mesa...

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/13] drm/i915: add reset_state for hw_contexts
  2013-02-27  9:13     ` Chris Wilson
@ 2013-02-28  1:15       ` Ian Romanick
  2013-02-28 11:14         ` Chris Wilson
  0 siblings, 1 reply; 32+ messages in thread
From: Ian Romanick @ 2013-02-28  1:15 UTC (permalink / raw)
  To: Chris Wilson, Mika Kuoppala, intel-gfx

On 02/27/2013 01:13 AM, Chris Wilson wrote:
> On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote:
>> On 02/26/2013 03:05 AM, Mika Kuoppala wrote:
>>> For arb-robustness, every context needs to have it's own
>>> reset state tracking. Default context will be handled in a identical
>>> way as the no-context case in further down in the patch set.
>>> For no-context case, the reset state will be stored in
>>> the file_priv part.
>>>
>>> v2: handle default context inside get_reset_state
>>
>> This isn't the interface we want.  I already sent you the patches
>> for Mesa, and you seem to have completely ignored them.  Moreover,
>> this interface provides no mechanism to query for its existence
>> (other than relying on the kernel version), and no method to
>> deprecate it.
>
> It's existence, and the existence of any successor, is the only test you
> need to check for the interface. Testing the flag field up front also
> lets you know if individual flags are known prior to use.
>
>> Based on e-mail discussions, I think danvet agrees with me here.
>>
>> Putting guilty / innocent counting in kernel puts policy decisions
>> in the kernel that belong with the user space API that implements
>> them. Putting these choices in the kernel significantly decreases
>> how "future proof" the interface is.  Since any kernel/user
>> interface has to be kept forever, this is just asking for
>> maintenance trouble down the road.
>
> I think you have the policy standpoint reversed.

Can you elaborate?  I've been out of the kernel loop for a long time, 
but I always thought policy and mechanism were supposed to be separated. 
  In the case of drivers with a user-mode component, the mechanism was 
in the kernel (where else could it be?), and policy decisions were in 
user-mode.  This is often at odds with keeping the interfaces between 
kernel and user thin, so that may be where my misunderstanding is.

>> Also, it's difficult (for me) to tell from the rest of the series
>> whether or not a context that was not affected by a reset (had no
>> batches in flight at all) can still observe that a reset occurred.
>
> Yes, the context can observe that the global reset increased, its did
> not.

See below for a couple reasons why I think this may be a mistake.

>>  From the GL point of view, once a context has observed a reset, it's
>> garbage.  Knowing that it saw 1 reset or 43,000,000 resets is the
>> same.  Since it's a count, GL will have to query the values before
>> any rendering happens or it won't know whether the value 6 means
>> there was a reset or not.
>>
>> The right interface is a set of flags indicating the state the
>> context was in when it observed a reset.  As this patch series
>> stands, Mesa will not use this interface.
>
> This interface conforms to your specifications and interface that you
> last posted to the list and were revised on list. There are two
> substantive differences; there is *less* policy in the kernel than
> before, and the breadcrumb of last reset state was overlooked.
>
> In order to make the reset status work in a multi-threaded environment
> the kernel exists in, we have to assume that there will be multiple
> actors on the reset status, and they will be comparing the state that
> they are interested in. The set of flags that GL understands is
> deducible from this interface.

We had some discussion on the list about a proposed interface last year. 
  We had a really good discussion, and both you and Daniel provided some 
really valuable feedback.  The biggest piece of feedback that both of 
you provided was to simplify.  I took that advice to heart.  The 
original interface was clunky and over complicated.

If memory serves, that discussion was on the order of 8 to 10 months 
ago.  Quite a lot has happened since then.  The biggest thing that 
happened is I started implementing the interface we discussed in the 
kernel, libdrm, and Mesa.  In that process I found a number of problems 
with even the simplified interface.  Back in September I used this new 
data to simplify the interface even further.  This allowed both the
kernel and the user-mode code to be much less complex.

Right about that same time the whole Mesa team got super busy.  First we 
had OpenGL 3.1, then we had OpenGL ES 3.0.  During that time all of my 
ARB_robustness work languished.  I didn't want to send another round of 
not-fully-baked ideas (or patches) to the list, so they just sat.  And 
sat.  And sat. :(

At FOSDEM earlier this month, I discussed this with Jesse.  I told him 
that there was approximately epsilon probability that I would get back 
to this work.  As a result, I was going to have to throw it over the 
wall to the kernel team.  At no point did he mention that anyone was 
already working on this.

About a week ago Daniel pulled me into a discussion with Mika about the 
ARB_robustness work.  I dug out my old code, paged back in the memory of 
the work from off-line storage, and provided a bunch of feedback to Mika 
and Daniel.  After I sent around my explanation of why I abandoned the 
counting interface and circulated my work, there was no response from 
Mika or anyone on that team.  I had a brief discussion with Daniel on 
IRC, and he seemed to agree with my sentiments and like the direction of 
my code.

Then this patch series appeared on the list, and here we are.

There are two requirements in the ARB_robustness extension (and 
additional layered extensions) that I believe cause problems with all of 
the proposed counting interfaces.

1. GL contexts that are not affected by a reset (i.e., didn't lose any 
objects, didn't lose any rendering, etc.) should not observe it.  This 
is the big one that all the browser vendors want.  If you have 5 tabs 
open, a reset caused by a malicious WebGL app in one tab shouldn't, if 
possible, force them to rebuild all of their state for all the other tabs.

2. After a GL context observes a reset, it is garbage.  It must be 
destroyed and a new one created.  This means we only need to know that a 
reset of any variety affected the context.

In addition, I'm concerned about having one GL context be able to 
observe, in any manner, resets that did not affect it.  Given that 
people have figured out how to get encryption keys by observing cache 
timings, it's not impossible for a malicious program to use information 
about reset counts to do something.  Leaving a potential gap like this 
in an extension that's used to improved security seems... like we're 
just asking for a Phoronix article.  We don't have any requirement to 
expose this information, so I don't think we should.

We also don't have any requirement to expose this functionality on 
anything pre-Sandy Bridge.  We may want, at some point, to expose it on 
Ironlake.  Based on that, it seems reasonable to tie the availability of 
reset notifications to hardware contexts.  Since we won't test it, Mesa 
certainly isn't going to expose it on 8xx or 915.

Based on this, a simplified interface became obvious:  for each hardware 
context, track a mask of kinds of resets that have affected that 
context.  I defined three bits in the mask:

1. The hw context had a batch executing when the reset occurred.

2. The hw context had a batch pending when the reset occurred. 
Hopefully in the future we could make most occurrences of this go away 
by re-queuing the batch.  It may also be possible to do this in 
user-mode, but that may be more difficult.

3. "Other."

There's also the potential to add other bits to communicate information 
about lost buffers, textures, etc.  This could be used to create a 
layered extension that lets applications transfer data from the dead 
context to the new context.  Browsers may not be too interested in this, 
but I could imagine compositors or other kinds of applications 
benefiting.  We may never implement that, but it's easier to communicate 
this through a set of flags than through a set of counts.

The guilty / innocent counts in this patch series lose information.  We 
can probably reconstruct the current flags from the counts, but we would 
have to do some refactoring (or additional counting) to get other flags 
in the future.  All of this would make the kernel code more complex. 
Why reconstruct the data you want when you could just track that data in 
the first place?

Since the functionality in not available on all hardware, I also added a 
query to determine the availability.  This has the added benefit that, 
should this interface prove to be insufficient in the future, we could 
deprecate it by having the query always return false.  All software 
involved has to be able the handle the case where the interface is not 
available, so I don't think this should run awry of the rules about 
kernel/user interface breakage.  Please correct me if I'm wrong.

The Mesa patches are in my robustness2 branch:

http://cgit.freedesktop.org/~idr/mesa/log/?h=robustness2

As Mika pointed out, the patch that makes use of the new kernel 
interface is:

http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=ef97f4092c703afd49b3516b15dfbfcd27d4bc97

In addition,

http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=6f73cbeecb90b0a08b0f016e19a16f6f50a1a99c

changes to brw_context.c to use intel_get_boolean to determine the 
availability of the kernel query.  I'd really like to only expose 
__DRI2_ROBUSTNESS when the kernel query exists, but I don't see a way to 
do that with the DRI extension mechanism.

The libdrm patches are in master of:

http://cgit.freedesktop.org/~idr/drm

I had attacked the problem from the opposite end from Mika.  My initial 
implementation always reported "other" because a lot of other work 
needed to happen to differentiate the other cases.  It's good to see 
that work progressing.  My really, really, really incomplete kernel 
patch is at:

http://people.freedesktop.org/~idr/robust.patch

This patch was against some kernel from early September 2012.  It may 
not apply or build at this point.  I believe I was able to observe a 
reset by having an app run a shader with an infinite loop and poll 
glGetGraphicsResetStatusARB.  I couldn't find that code, so I may be 
misremembering.  With the exception of the I915_PARAM_HAS_RESET_QUERY 
handling, nearly all of this patch is probably useless at this point.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/13] drm/i915: add reset_state for hw_contexts
  2013-02-28  1:15       ` Ian Romanick
@ 2013-02-28 11:14         ` Chris Wilson
  2013-02-28 20:31           ` Ian Romanick
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Wilson @ 2013-02-28 11:14 UTC (permalink / raw)
  To: Ian Romanick; +Cc: intel-gfx

On Wed, Feb 27, 2013 at 05:15:08PM -0800, Ian Romanick wrote:
> On 02/27/2013 01:13 AM, Chris Wilson wrote:
> >On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote:
> >>On 02/26/2013 03:05 AM, Mika Kuoppala wrote:
> >>>For arb-robustness, every context needs to have it's own
> >>>reset state tracking. Default context will be handled in a identical
> >>>way as the no-context case in further down in the patch set.
> >>>For no-context case, the reset state will be stored in
> >>>the file_priv part.
> >>>
> >>>v2: handle default context inside get_reset_state
> >>
> >>This isn't the interface we want.  I already sent you the patches
> >>for Mesa, and you seem to have completely ignored them.  Moreover,
> >>this interface provides no mechanism to query for its existence
> >>(other than relying on the kernel version), and no method to
> >>deprecate it.
> >
> >It's existence, and the existence of any successor, is the only test you
> >need to check for the interface. Testing the flag field up front also
> >lets you know if individual flags are known prior to use.
> >
> >>Based on e-mail discussions, I think danvet agrees with me here.
> >>
> >>Putting guilty / innocent counting in kernel puts policy decisions
> >>in the kernel that belong with the user space API that implements
> >>them. Putting these choices in the kernel significantly decreases
> >>how "future proof" the interface is.  Since any kernel/user
> >>interface has to be kept forever, this is just asking for
> >>maintenance trouble down the road.
> >
> >I think you have the policy standpoint reversed.
> 
> Can you elaborate?  I've been out of the kernel loop for a long
> time, but I always thought policy and mechanism were supposed to be
> separated.  In the case of drivers with a user-mode component, the
> mechanism was in the kernel (where else could it be?), and policy
> decisions were in user-mode.  This is often at odds with keeping the
> interfaces between kernel and user thin, so that may be where my
> misunderstanding is.

Right, the idea is to keep policy out of the kernel. I disagree that
your suggestion succeeds in doing this.

[snip to details]

> There are two requirements in the ARB_robustness extension (and
> additional layered extensions) that I believe cause problems with
> all of the proposed counting interfaces.
> 
> 1. GL contexts that are not affected by a reset (i.e., didn't lose
> any objects, didn't lose any rendering, etc.) should not observe it.
> This is the big one that all the browser vendors want.  If you have
> 5 tabs open, a reset caused by a malicious WebGL app in one tab
> shouldn't, if possible, force them to rebuild all of their state for
> all the other tabs.
> 
> 2. After a GL context observes a reset, it is garbage.  It must be
> destroyed and a new one created.  This means we only need to know
> that a reset of any variety affected the context.

For me observes implies the ability for the context to inspect global
system state, whereas I think you mean if the context and its associated
state is affected by a reset.
 
> In addition, I'm concerned about having one GL context be able to
> observe, in any manner, resets that did not affect it.  Given that
> people have figured out how to get encryption keys by observing
> cache timings, it's not impossible for a malicious program to use
> information about reset counts to do something.  Leaving a potential
> gap like this in an extension that's used to improved security
> seems... like we're just asking for a Phoronix article.  We don't
> have any requirement to expose this information, so I don't think we
> should.

So we should not do minor+major pagefault tracking either? I only
suggested that you could read the global state because I thought you
implied you wanted it. From the display server POV, global state is the
most useful as I am more interested in whether I can reliably use the GPU
and prevent a DoS from a misbehaving set of clients.
 
> We also don't have any requirement to expose this functionality on
> anything pre-Sandy Bridge.  We may want, at some point, to expose it
> on Ironlake.  Based on that, it seems reasonable to tie the
> availability of reset notifications to hardware contexts.  Since we
> won't test it, Mesa certainly isn't going to expose it on 8xx or
> 915.

>From a design standpoint hw contexts are just one function of the context
object and not the reverse. Ben has been rightfully arguing that we need
contexts in the kernel for far more than supporting logical hw contexts.

> Based on this, a simplified interface became obvious:  for each
> hardware context, track a mask of kinds of resets that have affected
> that context.  I defined three bits in the mask:

The problem is that you added an atomic-read-and-reset into the ioctl. I
objected to that policy when Mika presented it earlier as it prevents
concurrent accesses and so imposes an explicit usage model.

> 1. The hw context had a batch executing when the reset occurred.
> 
> 2. The hw context had a batch pending when the reset occurred.
> Hopefully in the future we could make most occurrences of this go
> away by re-queuing the batch.  It may also be possible to do this in
> user-mode, but that may be more difficult.
> 
> 3. "Other."
> 
> There's also the potential to add other bits to communicate
> information about lost buffers, textures, etc.  This could be used
> to create a layered extension that lets applications transfer data
> from the dead context to the new context.  Browsers may not be too
> interested in this, but I could imagine compositors or other kinds
> of applications benefiting.  We may never implement that, but it's
> easier to communicate this through a set of flags than through a set
> of counts.
> 
> The guilty / innocent counts in this patch series lose information.

Disagree. The flags represent a lose of information, as you explained
earlier.

> We can probably reconstruct the current flags from the counts, but
> we would have to do some refactoring (or additional counting) to get
> other flags in the future.  All of this would make the kernel code
> more complex. Why reconstruct the data you want when you could just
> track that data in the first place?

Because you are constructing policy decision based on raw information.

> Since the functionality in not available on all hardware, I also
> added a query to determine the availability.  This has the added
> benefit that, should this interface prove to be insufficient in the
> future, we could deprecate it by having the query always return
> false.

That offers no more information than just probing the ioctl.

> All software involved has to be able the handle the case
> where the interface is not available, so I don't think this should
> run awry of the rules about kernel/user interface breakage.  Please
> correct me if I'm wrong.

[snip patch locations]

The biggest change you made was to reduce the counts to a set of flags,
to make it harder for an attacker to analyse the reset frequency.
Instead that attacker can just poll the flags and coarsely reconstruct
the counts... If that is their goal. Instead it transfers userspace
policy into the kernel, which is what we were seeking to avoid, right?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 09/13] drm/i915: add reset_state for hw_contexts
  2013-02-28 11:14         ` Chris Wilson
@ 2013-02-28 20:31           ` Ian Romanick
  0 siblings, 0 replies; 32+ messages in thread
From: Ian Romanick @ 2013-02-28 20:31 UTC (permalink / raw)
  To: Chris Wilson, Mika Kuoppala, intel-gfx

On 02/28/2013 03:14 AM, Chris Wilson wrote:
> On Wed, Feb 27, 2013 at 05:15:08PM -0800, Ian Romanick wrote:
>> On 02/27/2013 01:13 AM, Chris Wilson wrote:
>>> On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote:
>>>> On 02/26/2013 03:05 AM, Mika Kuoppala wrote:
>>>>> For arb-robustness, every context needs to have it's own
>>>>> reset state tracking. Default context will be handled in a identical
>>>>> way as the no-context case in further down in the patch set.
>>>>> For no-context case, the reset state will be stored in
>>>>> the file_priv part.
>>>>>
>>>>> v2: handle default context inside get_reset_state
>>>>
>>>> This isn't the interface we want.  I already sent you the patches
>>>> for Mesa, and you seem to have completely ignored them.  Moreover,
>>>> this interface provides no mechanism to query for its existence
>>>> (other than relying on the kernel version), and no method to
>>>> deprecate it.
>>>
>>> It's existence, and the existence of any successor, is the only test you
>>> need to check for the interface. Testing the flag field up front also
>>> lets you know if individual flags are known prior to use.
>>>
>>>> Based on e-mail discussions, I think danvet agrees with me here.
>>>>
>>>> Putting guilty / innocent counting in kernel puts policy decisions
>>>> in the kernel that belong with the user space API that implements
>>>> them. Putting these choices in the kernel significantly decreases
>>>> how "future proof" the interface is.  Since any kernel/user
>>>> interface has to be kept forever, this is just asking for
>>>> maintenance trouble down the road.
>>>
>>> I think you have the policy standpoint reversed.
>>
>> Can you elaborate?  I've been out of the kernel loop for a long
>> time, but I always thought policy and mechanism were supposed to be
>> separated.  In the case of drivers with a user-mode component, the
>> mechanism was in the kernel (where else could it be?), and policy
>> decisions were in user-mode.  This is often at odds with keeping the
>> interfaces between kernel and user thin, so that may be where my
>> misunderstanding is.
>
> Right, the idea is to keep policy out of the kernel. I disagree that
> your suggestion succeeds in doing this.

I was trying to put the mechanism for determining what things happened 
in the kernel and the policy for interpreting what those things mean in 
user-mode.

Ignoring some bugs / misimplementation in my patch (which I try to 
address below), I thought I was doing that.  User-mode gets some 
information about the state it was in when the reset occurred.  It then 
uses that state to decide the value to return to the application for 
glGetGraphicsResetStatusARB.

After reading the bits below where I talk about other problems in the 
kernel code I posted, maybe you can address specific ways my proposal 
fails.  I want any glaring interface deficiencies fixed before any code 
goes in the kernel... I fully understand the pain, on both sides, of 
shipping a kernel interface only to have to replace it in a year.

> [snip to details]
>
>> There are two requirements in the ARB_robustness extension (and
>> additional layered extensions) that I believe cause problems with
>> all of the proposed counting interfaces.
>>
>> 1. GL contexts that are not affected by a reset (i.e., didn't lose
>> any objects, didn't lose any rendering, etc.) should not observe it.
>> This is the big one that all the browser vendors want.  If you have
>> 5 tabs open, a reset caused by a malicious WebGL app in one tab
>> shouldn't, if possible, force them to rebuild all of their state for
>> all the other tabs.
>>
>> 2. After a GL context observes a reset, it is garbage.  It must be
>> destroyed and a new one created.  This means we only need to know
>> that a reset of any variety affected the context.
>
> For me observes implies the ability for the context to inspect global
> system state, whereas I think you mean if the context and its associated
> state is affected by a reset.

I had thought about explicitly defining what I meant by a couple terms 
in my previous e-mail, but it was already getting quite long.  I guess I 
should have. :(  Here's what I meant:

A. "Affect the context" means the reset causes the context to lose some 
data.  It may be rendering that was queued (so framebuffer data is 
lost), it may be the contents of a buffer, or it may be something else. 
  Either way, the state of some data is not what the application expects 
it to be because of the reset.

B. "Observe the reset" means the GL context gets some data to know that 
the reset happened.  From the application point of view, this means 
glGetGraphicsResetStatusARB returns a value other than GL_NO_ERROR.

What I was trying the describe in point 1 in my previous message (above) 
is that B should occur if and only if A occurs.  Once B occurs, the 
application has to create a new context... recompile all of its shaders, 
reload all of its textures, reload all of its vertex data, etc.  We 
don't want to make apps do that any more often than absolutely necessary.

>> In addition, I'm concerned about having one GL context be able to
>> observe, in any manner, resets that did not affect it.  Given that
>> people have figured out how to get encryption keys by observing
>> cache timings, it's not impossible for a malicious program to use
>> information about reset counts to do something.  Leaving a potential
>> gap like this in an extension that's used to improved security
>> seems... like we're just asking for a Phoronix article.  We don't
>> have any requirement to expose this information, so I don't think we
>> should.
>
> So we should not do minor+major pagefault tracking either? I only
> suggested that you could read the global state because I thought you
> implied you wanted it. From the display server POV, global state is the
> most useful as I am more interested in whether I can reliably use the GPU
> and prevent a DoS from a misbehaving set of clients.

In the X model, this would be easy to solve.  We would make the 
total-number-of-resets query only available to be DRM master.  As we 
move away from that model, it becomes more difficult.

It seems that the usage is different enough that having separate queries 
for the global reset count and the per-context resets feels right.

That still leaves the potential information leakage issue, and I'm 
honestly not sure what the right answer is there.  Perhaps we could add 
that query after we've gotten some feedback from an actual security 
expert.  There may be other options that we haven't considered yet. 
Maybe something like advertising the GPU reset "load average."  Dunno.

>> We also don't have any requirement to expose this functionality on
>> anything pre-Sandy Bridge.  We may want, at some point, to expose it
>> on Ironlake.  Based on that, it seems reasonable to tie the
>> availability of reset notifications to hardware contexts.  Since we
>> won't test it, Mesa certainly isn't going to expose it on 8xx or
>> 915.
>
>>From a design standpoint hw contexts are just one function of the context
> object and not the reverse. Ben has been rightfully arguing that we need
> contexts in the kernel for far more than supporting logical hw contexts.

That's a fair point, and I also tend to agree with Ben.

I think it may also be an orthogonal issue.  As things currently stand 
(at least as far as I can tell) kernel contexts are tied to logical hw 
contexts.  Since we don't have a requirement for reset query support on 
older GPUs, it doesn't seem to make sense to put extra code in the 
kernel, that nobody will use, to do it.  If at some point the linkage 
between kernel contexts and hw contexts, and we can get a bunch of this 
mostly for free, then it makes sense to add the support then.

Every single time I've written "future" code like that, it has either 
been broken or it turns out to not be what I need when the future 
arrives.  I know I'm not the only one with that experience.

>> Based on this, a simplified interface became obvious:  for each
>> hardware context, track a mask of kinds of resets that have affected
>> that context.  I defined three bits in the mask:
>
> The problem is that you added an atomic-read-and-reset into the ioctl. I
> objected to that policy when Mika presented it earlier as it prevents
> concurrent accesses and so imposes an explicit usage model.

I may have mentioned that my kernel patch was very early work and should 
be taken with a grain of salt. :)  You probably noticed that the patch 
was generated by diff and not by git-format-patch.  I hadn't even 
committed any of that to my local tree.  Given its shabby state, I was 
reluctant to even send it out.  However, if I'm going to be critical of 
another person's work, I have a responsibility to be fully open.

I too came to the conclusion that the atomic-read-and-reset part was a 
bad idea... I had forgotten that I even coded it that way, or I would 
have mentioned that in my previous message.

The other thing in my kernel patch that I think is a bad idea (as you 
note below) is that processes can query the reset state of HW contexts 
associated with different fds.  That coupled with the 
atomic-read-and-reset is completely disastrous.  I think this should be 
tightened up for all the reasons we've both mentioned.

>> 1. The hw context had a batch executing when the reset occurred.
>>
>> 2. The hw context had a batch pending when the reset occurred.
>> Hopefully in the future we could make most occurrences of this go
>> away by re-queuing the batch.  It may also be possible to do this in
>> user-mode, but that may be more difficult.
>>
>> 3. "Other."
>>
>> There's also the potential to add other bits to communicate
>> information about lost buffers, textures, etc.  This could be used
>> to create a layered extension that lets applications transfer data
>> from the dead context to the new context.  Browsers may not be too
>> interested in this, but I could imagine compositors or other kinds
>> of applications benefiting.  We may never implement that, but it's
>> easier to communicate this through a set of flags than through a set
>> of counts.
>>
>> The guilty / innocent counts in this patch series lose information.
>
> Disagree. The flags represent a lose of information, as you explained
> earlier.

With the counts in this series, user-mode can't tell the difference 
between a reset that didn't affect the context and one that did.

Once a GL context observes a reset (i.e., when 
glGetGraphicsResetStatusARB returns non-GL_NO_ERROR), it is dead.  The 
only numbers that matter in that case are zero and not-zero.  If we 
assume that applications using this interface will query resets on a 
periodic basis, and both the major Linux browser vendors have told me 
that they do, the count will either be zero or some very small number. 
Given that, there's very little information lost, and the information 
that is lost is not important to the consumers of that information.

Now, this isn't the case with the display manager use case.  That case 
clearly wants counts, and it wants counts over time.  Like I said above, 
I think this is a different enough use case that having a separate 
mechanism may be the right approach.

>> We can probably reconstruct the current flags from the counts, but
>> we would have to do some refactoring (or additional counting) to get
>> other flags in the future.  All of this would make the kernel code
>> more complex. Why reconstruct the data you want when you could just
>> track that data in the first place?
>
> Because you are constructing policy decision based on raw information.
>
>> Since the functionality in not available on all hardware, I also
>> added a query to determine the availability.  This has the added
>> benefit that, should this interface prove to be insufficient in the
>> future, we could deprecate it by having the query always return
>> false.
>
> That offers no more information than just probing the ioctl.

There are a couple other bits of kernel functionality that Mesa queries 
using this method before using (e.g., GEN7 SOL reset).  Since this was, 
as far as I could tell, similarly querying kernel support for some 
hardware functionality, I assumed it should have a similar sort of 
query.  In the future, what metric should I use to determine which is which?

Also, what does "probing the ioctl" mean exactly?  Just call it and see 
whether you get ENOSYS?  Doesn't that run afoul of the requirement to 
never remove interfaces?

>> All software involved has to be able the handle the case
>> where the interface is not available, so I don't think this should
>> run awry of the rules about kernel/user interface breakage.  Please
>> correct me if I'm wrong.
>
> [snip patch locations]
>
> The biggest change you made was to reduce the counts to a set of flags,
> to make it harder for an attacker to analyse the reset frequency.
> Instead that attacker can just poll the flags and coarsely reconstruct
> the counts... If that is their goal. Instead it transfers userspace
> policy into the kernel, which is what we were seeking to avoid, right?
> -Chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/13] drm/i915: pass seqno to i915_hangcheck_ring_idle
  2013-02-26 22:41   ` Ben Widawsky
@ 2013-03-01 13:49     ` Mika Kuoppala
  0 siblings, 0 replies; 32+ messages in thread
From: Mika Kuoppala @ 2013-03-01 13:49 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: intel-gfx

Ben Widawsky <ben@bwidawsk.net> writes:

> On Tue, Feb 26, 2013 at 01:05:06PM +0200, Mika Kuoppala wrote:
>> In preparation for next commit, pass seqno as a parameter
>> to i915_hangcheck_ring_idle as it will be used inside
>> i915_hangcheck_elapsed.
>> 
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_irq.c |   11 +++++++----
>>  1 file changed, 7 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>> index 29037e0..4f60c87 100644
>> --- a/drivers/gpu/drm/i915/i915_irq.c
>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>> @@ -1740,11 +1740,11 @@ ring_last_seqno(struct intel_ring_buffer *ring)
>>  			  struct drm_i915_gem_request, list)->seqno;
>>  }
>>  
>> -static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring, bool *err)
>> +static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring,
>> +				     u32 ring_seqno, bool *err)
>>  {
>>  	if (list_empty(&ring->request_list) ||
>> -	    i915_seqno_passed(ring->get_seqno(ring, false),
>> -			      ring_last_seqno(ring))) {
>> +	    i915_seqno_passed(ring_seqno, ring_last_seqno(ring))) {
>>  		/* Issue a wake-up to catch stuck h/w. */
>>  		if (waitqueue_active(&ring->irq_queue)) {
>>  			DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
>
> I think just return err. Don't bother with a bool and *err return.

They have different meaning in here. Or do you suggest that
i return int with error and encode waiting/non waiting in return value?

-Mika


>> @@ -1821,7 +1821,10 @@ void i915_hangcheck_elapsed(unsigned long data)
>>  	memset(acthd, 0, sizeof(acthd));
>>  	idle = true;
>>  	for_each_ring(ring, dev_priv, i) {
>> -	    idle &= i915_hangcheck_ring_idle(ring, &err);
>> +		u32 seqno;
>> +
>> +		seqno = ring->get_seqno(ring, false);
>> +		idle &= i915_hangcheck_ring_idle(ring, seqno, &err);
>>  	    acthd[i] = intel_ring_get_active_head(ring);
>>  	}
>>  
> -- 
> Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-03-01 13:49 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-26 11:05 [PATCH 00/13] arb robustness enablers Mika Kuoppala
2013-02-26 11:05 ` [PATCH 01/13] drm/i915: add context parameter to i915_switch_context() Mika Kuoppala
2013-02-26 18:31   ` Ben Widawsky
2013-02-26 11:05 ` [PATCH 02/13] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
2013-02-26 21:45   ` Ben Widawsky
2013-02-26 11:05 ` [PATCH 03/13] drm/i915: pass seqno to i915_hangcheck_ring_idle Mika Kuoppala
2013-02-26 22:41   ` Ben Widawsky
2013-03-01 13:49     ` Mika Kuoppala
2013-02-26 11:05 ` [PATCH 04/13] drm/i915: track ring progression using seqnos Mika Kuoppala
2013-02-26 14:12   ` Chris Wilson
2013-02-26 15:09     ` Mika Kuoppala
2013-02-26 15:31       ` Chris Wilson
2013-02-26 11:05 ` [PATCH 05/13] drm/i915: introduce i915_hangcheck_ring_hung Mika Kuoppala
2013-02-26 23:03   ` Ben Widawsky
2013-02-26 11:05 ` [PATCH 06/13] drm/i915: detect hang using per ring hangcheck_score Mika Kuoppala
2013-02-26 14:16   ` Chris Wilson
2013-02-26 11:05 ` [PATCH 07/13] drm/i915: remove i915_hangcheck_hung Mika Kuoppala
2013-02-26 11:05 ` [PATCH 08/13] drm/i915: add struct ctx_reset_state Mika Kuoppala
2013-02-26 11:05 ` [PATCH 09/13] drm/i915: add reset_state for hw_contexts Mika Kuoppala
2013-02-27  1:47   ` Ian Romanick
2013-02-27  1:50     ` Ian Romanick
2013-02-27  9:13     ` Chris Wilson
2013-02-28  1:15       ` Ian Romanick
2013-02-28 11:14         ` Chris Wilson
2013-02-28 20:31           ` Ian Romanick
2013-02-27 10:11     ` Mika Kuoppala
2013-02-27 17:20     ` Jesse Barnes
2013-02-26 11:05 ` [PATCH 10/13] drm/i915: mark rings which were waiting when hang happened Mika Kuoppala
2013-02-26 11:05 ` [PATCH 11/13] drm/i915: add batch object and context to i915_add_request() Mika Kuoppala
2013-02-26 14:22   ` Chris Wilson
2013-02-26 11:05 ` [PATCH 12/13] drm/i915: find guilty batch buffer on ring resets Mika Kuoppala
2013-02-26 11:05 ` [PATCH 13/13] drm/i915: refuse to submit more batchbuffers from guilty context Mika Kuoppala

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.