All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/16] arb robustness enablers v2
@ 2013-03-14 15:52 Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 01/16] drm/i915: return context from i915_switch_context() Mika Kuoppala
                   ` (15 more replies)
  0 siblings, 16 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

Hi,

Reworked patchset for guilty context detection.
Main changes since the last posting:
- i915_add_request cleanup
- i915_switch_context returns ERR_PTR
- 3 seconds to declare hang regardless if guilty was found
- semaphore kicking for hung rings (from Chris)
- test case for the new interface
- interface is now included to have common base for discussions.

Thank you all who provided feedback.

The tree can be found here:
https://github.com/mkuoppal/linux/commits/arb-robustness

And testcase for i-g-t here:
https://github.com/mkuoppal/intel-gpu-tools/tree/arb-robustness

-Mika

Chris Wilson (1):
  drm/i915: Resurrect ring kicking for semaphores, selectively

Mika Kuoppala (15):
  drm/i915: return context from i915_switch_context()
  drm/i915: cleanup i915_add_request
  drm/i915: reference count for i915_hw_contexts
  drm/i915: pass seqno to i915_hangcheck_ring_idle
  drm/i915: track ring progression using seqnos
  drm/i915: introduce i915_hangcheck_ring_hung
  drm/i915: detect hang using per ring hangcheck_score
  drm/i915: remove i915_hangcheck_hung
  drm/i915: add struct ctx_reset_state
  drm/i915: add reset_state for hw_contexts
  drm/i915: add batch object and context to i915_add_request()
  drm/i915: mark rings which were waiting when hang happened
  drm/i915: find guilty batch buffer on ring resets
  drm/i915: refuse to submit more batchbuffers from guilty context
  drm/i915: add i915_gem_context_get_reset_status_ioctl

 drivers/gpu/drm/i915/i915_dma.c            |    5 +-
 drivers/gpu/drm/i915/i915_drv.c            |   84 +++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h            |   57 +++++++++--
 drivers/gpu/drm/i915/i915_gem.c            |  130 ++++++++++++++++++++++--
 drivers/gpu/drm/i915/i915_gem_context.c    |   97 +++++++++++++++---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   24 ++++-
 drivers/gpu/drm/i915/i915_irq.c            |  151 ++++++++++++++++------------
 drivers/gpu/drm/i915/intel_overlay.c       |    4 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c    |    2 +-
 drivers/gpu/drm/i915/intel_ringbuffer.h    |    4 +
 include/uapi/drm/i915_drm.h                |   28 ++++++
 11 files changed, 481 insertions(+), 105 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 01/16] drm/i915: return context from i915_switch_context()
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 02/16] drm/i915: cleanup i915_add_request Mika Kuoppala
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

In preparation for the next commit, return context that
was switched to from i915_switch_context().

v2: context in return value instead of param. (Ben Widawsky)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |    5 +++--
 drivers/gpu/drm/i915/i915_gem.c            |    8 ++++---
 drivers/gpu/drm/i915/i915_gem_context.c    |   32 +++++++++++++++++-----------
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    7 ++++--
 4 files changed, 32 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ca6b215..ccac8ea 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1669,8 +1669,9 @@ struct dma_buf *i915_gem_prime_export(struct drm_device *dev,
 void i915_gem_context_init(struct drm_device *dev);
 void i915_gem_context_fini(struct drm_device *dev);
 void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
-int i915_switch_context(struct intel_ring_buffer *ring,
-			struct drm_file *file, int to_id);
+struct i915_hw_context * __must_check
+i915_switch_context(struct intel_ring_buffer *ring,
+		    struct drm_file *file, int to_id);
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 1417fc6..93b8e05 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2517,9 +2517,11 @@ int i915_gpu_idle(struct drm_device *dev)
 
 	/* Flush everything onto the inactive list. */
 	for_each_ring(ring, dev_priv, i) {
-		ret = i915_switch_context(ring, NULL, DEFAULT_CONTEXT_ID);
-		if (ret)
-			return ret;
+		struct i915_hw_context *ctx;
+
+		ctx = i915_switch_context(ring, NULL, DEFAULT_CONTEXT_ID);
+		if (IS_ERR(ctx))
+			return PTR_ERR(ctx);
 
 		ret = intel_ring_idle(ring);
 		if (ret)
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 21177d9..258e5e0 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -428,41 +428,47 @@ static int do_switch(struct i915_hw_context *to)
 /**
  * i915_switch_context() - perform a GPU context switch.
  * @ring: ring for which we'll execute the context switch
- * @file_priv: file_priv associated with the context, may be NULL
- * @id: context id number
- * @seqno: sequence number by which the new context will be switched to
- * @flags:
+ * @file: file associated with the context, may be NULL
+ * @to_id: context id number
+ * @return: context that was switched to or ERR_PTR on error
  *
  * The context life cycle is simple. The context refcount is incremented and
  * decremented by 1 and create and destroy. If the context is in use by the GPU,
  * it will have a refoucnt > 1. This allows us to destroy the context abstract
  * object while letting the normal object tracking destroy the backing BO.
  */
-int i915_switch_context(struct intel_ring_buffer *ring,
-			struct drm_file *file,
-			int to_id)
+
+struct i915_hw_context *
+i915_switch_context(struct intel_ring_buffer *ring,
+		    struct drm_file *file,
+		    int to_id)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
-	struct i915_hw_context *to;
+	struct i915_hw_context *to = NULL;
+	int ret = 0;
 
 	if (dev_priv->hw_contexts_disabled)
-		return 0;
+		return NULL;
 
 	if (ring != &dev_priv->ring[RCS])
-		return 0;
+		return NULL;
 
 	if (to_id == DEFAULT_CONTEXT_ID) {
 		to = ring->default_context;
 	} else {
 		if (file == NULL)
-			return -EINVAL;
+			return ERR_PTR(-EINVAL);
 
 		to = i915_gem_context_get(file->driver_priv, to_id);
 		if (to == NULL)
-			return -ENOENT;
+			return ERR_PTR(-ENOENT);
 	}
 
-	return do_switch(to);
+	ret = do_switch(to);
+	if (ret)
+		return ERR_PTR(ret);
+
+	return to;
 }
 
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index ea963c3..ad7029e 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -838,6 +838,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	struct drm_i915_gem_object *batch_obj;
 	struct drm_clip_rect *cliprects = NULL;
 	struct intel_ring_buffer *ring;
+	struct i915_hw_context *ctx;
 	u32 ctx_id = i915_execbuffer2_get_context_id(*args);
 	u32 exec_start, exec_len;
 	u32 mask, flags;
@@ -1020,9 +1021,11 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	if (ret)
 		goto err;
 
-	ret = i915_switch_context(ring, file, ctx_id);
-	if (ret)
+	ctx = i915_switch_context(ring, file, ctx_id);
+	if (IS_ERR(ctx)) {
+		ret = PTR_ERR(ctx);
 		goto err;
+	}
 
 	if (ring == &dev_priv->ring[RCS] &&
 	    mode != dev_priv->relative_constants_mode) {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 02/16] drm/i915: cleanup i915_add_request
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 01/16] drm/i915: return context from i915_switch_context() Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-15  9:43   ` Chris Wilson
  2013-03-14 15:52 ` [PATCH v2 03/16] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

Only execbuffer needs all the parameters. Cleanup everything
else behind macro.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |    8 +++++---
 drivers/gpu/drm/i915/i915_gem.c            |   10 +++++-----
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    2 +-
 drivers/gpu/drm/i915/intel_overlay.c       |    4 ++--
 drivers/gpu/drm/i915/intel_ringbuffer.c    |    2 +-
 5 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ccac8ea..930b0ac 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1626,9 +1626,11 @@ void i915_gem_init_ppgtt(struct drm_device *dev);
 void i915_gem_cleanup_ringbuffer(struct drm_device *dev);
 int __must_check i915_gpu_idle(struct drm_device *dev);
 int __must_check i915_gem_idle(struct drm_device *dev);
-int i915_add_request(struct intel_ring_buffer *ring,
-		     struct drm_file *file,
-		     u32 *seqno);
+int i915_do_add_request(struct intel_ring_buffer *ring,
+			u32 *seqno,
+			struct drm_file *file);
+#define i915_add_request(ring, seqno) \
+	i915_do_add_request(ring, seqno, NULL)
 int __must_check i915_wait_seqno(struct intel_ring_buffer *ring,
 				 uint32_t seqno);
 int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 93b8e05..105ad4b 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -967,7 +967,7 @@ i915_gem_check_olr(struct intel_ring_buffer *ring, u32 seqno)
 
 	ret = 0;
 	if (seqno == ring->outstanding_lazy_request)
-		ret = i915_add_request(ring, NULL, NULL);
+		ret = i915_add_request(ring, NULL);
 
 	return ret;
 }
@@ -1995,9 +1995,9 @@ i915_gem_get_seqno(struct drm_device *dev, u32 *seqno)
 }
 
 int
-i915_add_request(struct intel_ring_buffer *ring,
-		 struct drm_file *file,
-		 u32 *out_seqno)
+i915_do_add_request(struct intel_ring_buffer *ring,
+		    u32 *out_seqno,
+		    struct drm_file *file)
 {
 	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	struct drm_i915_gem_request *request;
@@ -2262,7 +2262,7 @@ i915_gem_retire_work_handler(struct work_struct *work)
 	idle = true;
 	for_each_ring(ring, dev_priv, i) {
 		if (ring->gpu_caches_dirty)
-			i915_add_request(ring, NULL, NULL);
+			i915_add_request(ring, NULL);
 
 		idle &= list_empty(&ring->request_list);
 	}
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index ad7029e..013aca6 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -799,7 +799,7 @@ i915_gem_execbuffer_retire_commands(struct drm_device *dev,
 	ring->gpu_caches_dirty = true;
 
 	/* Add a breadcrumb for the completion of the batch buffer */
-	(void)i915_add_request(ring, file, NULL);
+	(void)i915_do_add_request(ring, NULL, file);
 }
 
 static int
diff --git a/drivers/gpu/drm/i915/intel_overlay.c b/drivers/gpu/drm/i915/intel_overlay.c
index 67a2501..40e509c 100644
--- a/drivers/gpu/drm/i915/intel_overlay.c
+++ b/drivers/gpu/drm/i915/intel_overlay.c
@@ -217,7 +217,7 @@ static int intel_overlay_do_wait_request(struct intel_overlay *overlay,
 	int ret;
 
 	BUG_ON(overlay->last_flip_req);
-	ret = i915_add_request(ring, NULL, &overlay->last_flip_req);
+	ret = i915_add_request(ring, &overlay->last_flip_req);
 	if (ret)
 		return ret;
 
@@ -286,7 +286,7 @@ static int intel_overlay_continue(struct intel_overlay *overlay,
 	intel_ring_emit(ring, flip_addr);
 	intel_ring_advance(ring);
 
-	return i915_add_request(ring, NULL, &overlay->last_flip_req);
+	return i915_add_request(ring, &overlay->last_flip_req);
 }
 
 static void intel_overlay_release_old_vid_tail(struct intel_overlay *overlay)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 1d5d613..9584bc1 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1423,7 +1423,7 @@ int intel_ring_idle(struct intel_ring_buffer *ring)
 
 	/* We need to add any requests required to flush the objects and ring */
 	if (ring->outstanding_lazy_request) {
-		ret = i915_add_request(ring, NULL, NULL);
+		ret = i915_add_request(ring, NULL);
 		if (ret)
 			return ret;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 03/16] drm/i915: reference count for i915_hw_contexts
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 01/16] drm/i915: return context from i915_switch_context() Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 02/16] drm/i915: cleanup i915_add_request Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-15  9:44   ` Chris Wilson
  2013-04-02 22:45   ` [PATCH 1/2] [v3] " Ben Widawsky
  2013-03-14 15:52 ` [PATCH v2 04/16] drm/i915: Resurrect ring kicking for semaphores, selectively Mika Kuoppala
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

In preparation to do analysis of which context was
guilty of gpu hung, store kreffed context pointer
into request struct.

This allows us to inspect contexts when gpu is reset
even if those contexts would already be released
by userspace.

v2: track i915_hw_context pointers instead of using ctx_ids
    (from Chris Wilson)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |   10 ++++++++--
 drivers/gpu/drm/i915/i915_gem.c            |   16 +++++++++++++++-
 drivers/gpu/drm/i915/i915_gem_context.c    |   22 ++++++++++++++++++----
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    7 ++++---
 4 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 930b0ac..79cd27f 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -437,6 +437,7 @@ struct i915_hw_ppgtt {
 /* This must match up with the value previously used for execbuf2.rsvd1. */
 #define DEFAULT_CONTEXT_ID 0
 struct i915_hw_context {
+	struct kref ref;
 	int id;
 	bool is_initialized;
 	struct drm_i915_file_private *file_priv;
@@ -1239,6 +1240,9 @@ struct drm_i915_gem_request {
 	/** Postion in the ringbuffer of the end of the request */
 	u32 tail;
 
+	/** Context related to this request */
+	struct i915_hw_context *ctx;
+
 	/** Time at which this request was emitted, in jiffies. */
 	unsigned long emitted_jiffies;
 
@@ -1628,9 +1632,10 @@ int __must_check i915_gpu_idle(struct drm_device *dev);
 int __must_check i915_gem_idle(struct drm_device *dev);
 int i915_do_add_request(struct intel_ring_buffer *ring,
 			u32 *seqno,
-			struct drm_file *file);
+			struct drm_file *file,
+			struct i915_hw_context *ctx);
 #define i915_add_request(ring, seqno) \
-	i915_do_add_request(ring, seqno, NULL)
+	i915_do_add_request(ring, seqno, NULL, NULL)
 int __must_check i915_wait_seqno(struct intel_ring_buffer *ring,
 				 uint32_t seqno);
 int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
@@ -1674,6 +1679,7 @@ void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
 struct i915_hw_context * __must_check
 i915_switch_context(struct intel_ring_buffer *ring,
 		    struct drm_file *file, int to_id);
+void i915_gem_context_free(struct kref *ctx_ref);
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 105ad4b..e905086 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1997,7 +1997,8 @@ i915_gem_get_seqno(struct drm_device *dev, u32 *seqno)
 int
 i915_do_add_request(struct intel_ring_buffer *ring,
 		    u32 *out_seqno,
-		    struct drm_file *file)
+		    struct drm_file *file,
+		    struct i915_hw_context *ctx)
 {
 	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	struct drm_i915_gem_request *request;
@@ -2037,6 +2038,11 @@ i915_do_add_request(struct intel_ring_buffer *ring,
 	request->seqno = intel_ring_get_seqno(ring);
 	request->ring = ring;
 	request->tail = request_ring_position;
+	request->ctx = ctx;
+
+	if (request->ctx)
+		kref_get(&request->ctx->ref);
+
 	request->emitted_jiffies = jiffies;
 	was_empty = list_empty(&ring->request_list);
 	list_add_tail(&request->list, &ring->request_list);
@@ -2101,6 +2107,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 
 		list_del(&request->list);
 		i915_gem_request_remove_from_client(request);
+
+		if (request->ctx)
+			kref_put(&request->ctx->ref, i915_gem_context_free);
+
 		kfree(request);
 	}
 
@@ -2195,6 +2205,10 @@ i915_gem_retire_requests_ring(struct intel_ring_buffer *ring)
 
 		list_del(&request->list);
 		i915_gem_request_remove_from_client(request);
+
+		if (request->ctx)
+			kref_put(&request->ctx->ref, i915_gem_context_free);
+
 		kfree(request);
 	}
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 258e5e0..8fb4d3c 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -124,12 +124,24 @@ static int get_context_size(struct drm_device *dev)
 	return ret;
 }
 
-static void do_destroy(struct i915_hw_context *ctx)
+void i915_gem_context_free(struct kref *ctx_ref)
+{
+	struct i915_hw_context *ctx = container_of(ctx_ref,
+						   typeof(*ctx), ref);
+	kfree(ctx);
+}
+
+static void do_release(struct i915_hw_context *ctx)
 {
 	if (ctx->file_priv)
 		idr_remove(&ctx->file_priv->context_idr, ctx->id);
 
 	drm_gem_object_unreference(&ctx->obj->base);
+}
+
+static void do_destroy(struct i915_hw_context *ctx)
+{
+	do_release(ctx);
 	kfree(ctx);
 }
 
@@ -145,6 +157,7 @@ create_hw_context(struct drm_device *dev,
 	if (ctx == NULL)
 		return ERR_PTR(-ENOMEM);
 
+	kref_init(&ctx->ref);
 	ctx->obj = i915_gem_alloc_object(dev, dev_priv->hw_context_size);
 	if (ctx->obj == NULL) {
 		kfree(ctx);
@@ -286,8 +299,8 @@ static int context_idr_cleanup(int id, void *p, void *data)
 
 	BUG_ON(id == DEFAULT_CONTEXT_ID);
 
-	do_destroy(ctx);
-
+	do_release(ctx);
+	kref_put(&ctx->ref, i915_gem_context_free);
 	return 0;
 }
 
@@ -522,7 +535,8 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 		return -ENOENT;
 	}
 
-	do_destroy(ctx);
+	do_release(ctx);
+	kref_put(&ctx->ref, i915_gem_context_free);
 
 	mutex_unlock(&dev->struct_mutex);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 013aca6..1b524e7 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -793,13 +793,14 @@ i915_gem_execbuffer_move_to_active(struct list_head *objects,
 static void
 i915_gem_execbuffer_retire_commands(struct drm_device *dev,
 				    struct drm_file *file,
-				    struct intel_ring_buffer *ring)
+				    struct intel_ring_buffer *ring,
+				    struct i915_hw_context *ctx)
 {
 	/* Unconditionally force add_request to emit a full flush. */
 	ring->gpu_caches_dirty = true;
 
 	/* Add a breadcrumb for the completion of the batch buffer */
-	(void)i915_do_add_request(ring, NULL, file);
+	(void)i915_do_add_request(ring, NULL, file, ctx);
 }
 
 static int
@@ -1074,7 +1075,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	trace_i915_gem_ring_dispatch(ring, intel_ring_get_seqno(ring), flags);
 
 	i915_gem_execbuffer_move_to_active(&eb->objects, ring);
-	i915_gem_execbuffer_retire_commands(dev, file, ring);
+	i915_gem_execbuffer_retire_commands(dev, file, ring, ctx);
 
 err:
 	eb_destroy(eb);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 04/16] drm/i915: Resurrect ring kicking for semaphores, selectively
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (2 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 03/16] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-17 21:53   ` Daniel Vetter
  2013-03-14 15:52 ` [PATCH v2 05/16] drm/i915: pass seqno to i915_hangcheck_ring_idle Mika Kuoppala
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Ben Widawsky, miku

From: Chris Wilson <chris@chris-wilson.co.uk>

Once we thought we got semaphores working, we disabled kicking the ring
if hangcheck fired whilst waiting upon a ring as it was doing more harm
than good:

commit 4e0e90dcb8a7df1229c69e30abebb59b0b3c2a1f
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Dec 14 13:56:58 2011 +0100

    drm/i915: kicking rings stuck on semaphores considered harmful

However, life is never that easy and semaphores are still causing
problems whereby the value written by one ring (bcs) is not being
propagated to the waiter (rcs). Thus the waiter never wakes up and we
declare the GPU hung, which often has unfortunate consequences, even if
we successfully reset the GPU.

But the GPU is idle as it has completed the work, just didn't notify its
clients. So we can detect the incomplete wait during hang check and
probe the target ring to see if has indeed emitted the breadcrumb seqno
following the work and then and only then kick the waiter.

Based on a suggestion by Ben Widawsky.

v2: cross-check wait with iphdr. fix signaller calculation.

References: https://bugs.freedesktop.org/show_bug.cgi?id=54226
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Ben Widawsky <ben@bwidawsk.net>
Acked-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_irq.c |   40 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 2139714..ce7efa8 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1787,6 +1787,37 @@ static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring, bool *err)
 	return false;
 }
 
+static bool semaphore_passed(struct intel_ring_buffer *ring)
+{
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	u32 acthd = intel_ring_get_active_head(ring) & HEAD_ADDR;
+	struct intel_ring_buffer *signaller;
+	u32 cmd, ipehr, acthd_min;
+
+	ipehr = I915_READ(RING_IPEHR(ring->mmio_base));
+	if ((ipehr & ~(0x3 << 16)) !=
+	    (MI_SEMAPHORE_MBOX | MI_SEMAPHORE_COMPARE | MI_SEMAPHORE_REGISTER))
+		return false;
+
+	/* ACTHD is likely pointing to the dword after the actual command,
+	 * so scan backwards until we find the MBOX.
+	 */
+	acthd_min = max((int)acthd - 3 * 4, 0);
+	do {
+		cmd = ioread32(ring->virtual_start + acthd);
+		if (cmd == ipehr)
+			break;
+
+		acthd -= 4;
+		if (acthd < acthd_min)
+			return false;
+	} while (1);
+
+	signaller = &dev_priv->ring[(ring->id + (((ipehr >> 17) & 1) + 1)) % 3];
+	return i915_seqno_passed(signaller->get_seqno(signaller, false),
+				 ioread32(ring->virtual_start+acthd+4)+1);
+}
+
 static bool kick_ring(struct intel_ring_buffer *ring)
 {
 	struct drm_device *dev = ring->dev;
@@ -1798,6 +1829,15 @@ static bool kick_ring(struct intel_ring_buffer *ring)
 		I915_WRITE_CTL(ring, tmp);
 		return true;
 	}
+
+	if (INTEL_INFO(dev)->gen >= 6 &&
+	    tmp & RING_WAIT_SEMAPHORE &&
+	    semaphore_passed(ring)) {
+		DRM_ERROR("Kicking stuck semaphore on %s\n",
+			  ring->name);
+		I915_WRITE_CTL(ring, tmp);
+		return true;
+	}
 	return false;
 }
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 05/16] drm/i915: pass seqno to i915_hangcheck_ring_idle
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (3 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 04/16] drm/i915: Resurrect ring kicking for semaphores, selectively Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 06/16] drm/i915: track ring progression using seqnos Mika Kuoppala
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

In preparation for next commit, pass seqno as a parameter
to i915_hangcheck_ring_idle as it will be used inside
i915_hangcheck_elapsed.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index ce7efa8..ed8ac1f 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1770,11 +1770,11 @@ ring_last_seqno(struct intel_ring_buffer *ring)
 			  struct drm_i915_gem_request, list)->seqno;
 }
 
-static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring, bool *err)
+static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring,
+				     u32 ring_seqno, bool *err)
 {
 	if (list_empty(&ring->request_list) ||
-	    i915_seqno_passed(ring->get_seqno(ring, false),
-			      ring_last_seqno(ring))) {
+	    i915_seqno_passed(ring_seqno, ring_last_seqno(ring))) {
 		/* Issue a wake-up to catch stuck h/w. */
 		if (waitqueue_active(&ring->irq_queue)) {
 			DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
@@ -1891,7 +1891,10 @@ void i915_hangcheck_elapsed(unsigned long data)
 	memset(acthd, 0, sizeof(acthd));
 	idle = true;
 	for_each_ring(ring, dev_priv, i) {
-	    idle &= i915_hangcheck_ring_idle(ring, &err);
+		u32 seqno;
+
+		seqno = ring->get_seqno(ring, false);
+		idle &= i915_hangcheck_ring_idle(ring, seqno, &err);
 	    acthd[i] = intel_ring_get_active_head(ring);
 	}
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 06/16] drm/i915: track ring progression using seqnos
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (4 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 05/16] drm/i915: pass seqno to i915_hangcheck_ring_idle Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-15  9:48   ` Chris Wilson
  2013-03-14 15:52 ` [PATCH v2 07/16] drm/i915: introduce i915_hangcheck_ring_hung Mika Kuoppala
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

Instead of relying in acthd, track ring seqno progression
to detect if ring has hung.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |    2 --
 drivers/gpu/drm/i915/i915_irq.c         |   30 +++++++++++++-----------------
 drivers/gpu/drm/i915/intel_ringbuffer.h |    2 ++
 3 files changed, 15 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 79cd27f..534e673 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -797,8 +797,6 @@ struct i915_gpu_error {
 #define DRM_I915_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_I915_HANGCHECK_PERIOD)
 	struct timer_list hangcheck_timer;
 	int hangcheck_count;
-	uint32_t last_acthd[I915_NUM_RINGS];
-	uint32_t prev_instdone[I915_NUM_INSTDONE_REG];
 
 	/* For reset and error_state handling. */
 	spinlock_t lock;
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index ed8ac1f..6594c97 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1880,22 +1880,19 @@ void i915_hangcheck_elapsed(unsigned long data)
 {
 	struct drm_device *dev = (struct drm_device *)data;
 	drm_i915_private_t *dev_priv = dev->dev_private;
-	uint32_t acthd[I915_NUM_RINGS], instdone[I915_NUM_INSTDONE_REG];
 	struct intel_ring_buffer *ring;
 	bool err = false, idle;
 	int i;
+	u32 seqno[I915_NUM_RINGS];
+	bool work_done;
 
 	if (!i915_enable_hangcheck)
 		return;
 
-	memset(acthd, 0, sizeof(acthd));
 	idle = true;
 	for_each_ring(ring, dev_priv, i) {
-		u32 seqno;
-
-		seqno = ring->get_seqno(ring, false);
-		idle &= i915_hangcheck_ring_idle(ring, seqno, &err);
-	    acthd[i] = intel_ring_get_active_head(ring);
+		seqno[i] = ring->get_seqno(ring, false);
+		idle &= i915_hangcheck_ring_idle(ring, seqno[i], &err);
 	}
 
 	/* If all work is done then ACTHD clearly hasn't advanced. */
@@ -1911,20 +1908,19 @@ void i915_hangcheck_elapsed(unsigned long data)
 		return;
 	}
 
-	i915_get_extra_instdone(dev, instdone);
-	if (memcmp(dev_priv->gpu_error.last_acthd, acthd,
-		   sizeof(acthd)) == 0 &&
-	    memcmp(dev_priv->gpu_error.prev_instdone, instdone,
-		   sizeof(instdone)) == 0) {
+	work_done = false;
+	for_each_ring(ring, dev_priv, i) {
+		if (ring->hangcheck_seqno != seqno[i]) {
+			work_done = true;
+			ring->hangcheck_seqno = seqno[i];
+		}
+	}
+
+	if (!work_done) {
 		if (i915_hangcheck_hung(dev))
 			return;
 	} else {
 		dev_priv->gpu_error.hangcheck_count = 0;
-
-		memcpy(dev_priv->gpu_error.last_acthd, acthd,
-		       sizeof(acthd));
-		memcpy(dev_priv->gpu_error.prev_instdone, instdone,
-		       sizeof(instdone));
 	}
 
 repeat:
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index d66208c..9599c56 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -137,6 +137,8 @@ struct  intel_ring_buffer {
 	struct i915_hw_context *default_context;
 	struct drm_i915_gem_object *last_context_obj;
 
+	u32 hangcheck_seqno;
+
 	void *private;
 };
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 07/16] drm/i915: introduce i915_hangcheck_ring_hung
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (5 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 06/16] drm/i915: track ring progression using seqnos Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 08/16] drm/i915: detect hang using per ring hangcheck_score Mika Kuoppala
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

In preparation to track per ring progress in hangcheck,
add i915_hangcheck_ring_hung.

v2: omit dev parameter (Ben Widawsky)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c |   29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 6594c97..5ba7f19 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1841,28 +1841,33 @@ static bool kick_ring(struct intel_ring_buffer *ring)
 	return false;
 }
 
+static bool i915_hangcheck_ring_hung(struct intel_ring_buffer *ring)
+{
+	if (IS_GEN2(ring->dev))
+		return false;
+
+	/* Is the chip hanging on a WAIT_FOR_EVENT?
+	 * If so we can simply poke the RB_WAIT bit
+	 * and break the hang. This should work on
+	 * all but the second generation chipsets.
+	 */
+	return !kick_ring(ring);
+}
+
 static bool i915_hangcheck_hung(struct drm_device *dev)
 {
 	drm_i915_private_t *dev_priv = dev->dev_private;
 
 	if (dev_priv->gpu_error.hangcheck_count++ > 1) {
 		bool hung = true;
+		struct intel_ring_buffer *ring;
+		int i;
 
 		DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
 		i915_handle_error(dev, true);
 
-		if (!IS_GEN2(dev)) {
-			struct intel_ring_buffer *ring;
-			int i;
-
-			/* Is the chip hanging on a WAIT_FOR_EVENT?
-			 * If so we can simply poke the RB_WAIT bit
-			 * and break the hang. This should work on
-			 * all but the second generation chipsets.
-			 */
-			for_each_ring(ring, dev_priv, i)
-				hung &= !kick_ring(ring);
-		}
+		for_each_ring(ring, dev_priv, i)
+			hung &= i915_hangcheck_ring_hung(ring);
 
 		return hung;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 08/16] drm/i915: detect hang using per ring hangcheck_score
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (6 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 07/16] drm/i915: introduce i915_hangcheck_ring_hung Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 09/16] drm/i915: remove i915_hangcheck_hung Mika Kuoppala
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

Add per ring score of possible culprit for gpu hang. If
ring is busy and not waiting, it will get the highest score
across calls to i915_hangcheck_elapsed. This way we are
most likely to find the ring that caused the hang among
the waiting ones.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c         |   63 ++++++++++++++++---------------
 drivers/gpu/drm/i915/intel_ringbuffer.h |    1 +
 2 files changed, 34 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 5ba7f19..dee4557 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -360,7 +360,6 @@ static void notify_ring(struct drm_device *dev,
 
 	wake_up_all(&ring->irq_queue);
 	if (i915_enable_hangcheck) {
-		dev_priv->gpu_error.hangcheck_count = 0;
 		mod_timer(&dev_priv->gpu_error.hangcheck_timer,
 			  round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES));
 	}
@@ -1886,52 +1885,56 @@ void i915_hangcheck_elapsed(unsigned long data)
 	struct drm_device *dev = (struct drm_device *)data;
 	drm_i915_private_t *dev_priv = dev->dev_private;
 	struct intel_ring_buffer *ring;
-	bool err = false, idle;
 	int i;
-	u32 seqno[I915_NUM_RINGS];
-	bool work_done;
+	int busy_count = 0, rings_hung = 0;
 
 	if (!i915_enable_hangcheck)
 		return;
 
-	idle = true;
 	for_each_ring(ring, dev_priv, i) {
-		seqno[i] = ring->get_seqno(ring, false);
-		idle &= i915_hangcheck_ring_idle(ring, seqno[i], &err);
-	}
+		u32 seqno;
+		bool idle, err = false;
+
+		seqno = ring->get_seqno(ring, false);
+		idle = i915_hangcheck_ring_idle(ring, seqno, &err);
 
-	/* If all work is done then ACTHD clearly hasn't advanced. */
-	if (idle) {
-		if (err) {
-			if (i915_hangcheck_hung(dev))
-				return;
+		if (idle) {
+			if (err)
+				ring->hangcheck_score++;
+			else
+				ring->hangcheck_score = 0;
+		} else {
+			busy_count++;
 
-			goto repeat;
+			if (ring->hangcheck_seqno == seqno) {
+				ring->hangcheck_score++;
+
+				/* Kick ring */
+				i915_hangcheck_ring_hung(ring);
+			} else {
+				ring->hangcheck_score = 0;
+			}
 		}
 
-		dev_priv->gpu_error.hangcheck_count = 0;
-		return;
+		ring->hangcheck_seqno = seqno;
 	}
 
-	work_done = false;
 	for_each_ring(ring, dev_priv, i) {
-		if (ring->hangcheck_seqno != seqno[i]) {
-			work_done = true;
-			ring->hangcheck_seqno = seqno[i];
+		if (ring->hangcheck_score > 1) {
+			rings_hung++;
+			DRM_ERROR("%s seems hung\n", ring->name);
 		}
 	}
 
-	if (!work_done) {
-		if (i915_hangcheck_hung(dev))
-			return;
-	} else {
-		dev_priv->gpu_error.hangcheck_count = 0;
-	}
+	if (rings_hung)
+		return i915_handle_error(dev, true);
 
-repeat:
-	/* Reset timer case chip hangs without another request being added */
-	mod_timer(&dev_priv->gpu_error.hangcheck_timer,
-		  round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES));
+	if (busy_count)
+		/* Reset timer case chip hangs without another request
+		 * being added */
+		mod_timer(&dev_priv->gpu_error.hangcheck_timer,
+			  round_jiffies_up(jiffies +
+					   DRM_I915_HANGCHECK_JIFFIES));
 }
 
 /* drm_dma.h hooks
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 9599c56..97b8f37 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -138,6 +138,7 @@ struct  intel_ring_buffer {
 	struct drm_i915_gem_object *last_context_obj;
 
 	u32 hangcheck_seqno;
+	int hangcheck_score;
 
 	void *private;
 };
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 09/16] drm/i915: remove i915_hangcheck_hung
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (7 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 08/16] drm/i915: detect hang using per ring hangcheck_score Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 10/16] drm/i915: add struct ctx_reset_state Mika Kuoppala
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

Rework of per ring hangcheck made this obsolete.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h |    1 -
 drivers/gpu/drm/i915/i915_irq.c |   21 ---------------------
 2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 534e673..a54c507 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -796,7 +796,6 @@ struct i915_gpu_error {
 #define DRM_I915_HANGCHECK_PERIOD 1500 /* in ms */
 #define DRM_I915_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_I915_HANGCHECK_PERIOD)
 	struct timer_list hangcheck_timer;
-	int hangcheck_count;
 
 	/* For reset and error_state handling. */
 	spinlock_t lock;
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index dee4557..5e91e94 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1853,27 +1853,6 @@ static bool i915_hangcheck_ring_hung(struct intel_ring_buffer *ring)
 	return !kick_ring(ring);
 }
 
-static bool i915_hangcheck_hung(struct drm_device *dev)
-{
-	drm_i915_private_t *dev_priv = dev->dev_private;
-
-	if (dev_priv->gpu_error.hangcheck_count++ > 1) {
-		bool hung = true;
-		struct intel_ring_buffer *ring;
-		int i;
-
-		DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
-		i915_handle_error(dev, true);
-
-		for_each_ring(ring, dev_priv, i)
-			hung &= i915_hangcheck_ring_hung(ring);
-
-		return hung;
-	}
-
-	return false;
-}
-
 /**
  * This is called when the chip hasn't reported back with completed
  * batchbuffers in a long time. The first time this is called we simply record
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 10/16] drm/i915: add struct ctx_reset_state
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (8 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 09/16] drm/i915: remove i915_hangcheck_hung Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-15  9:52   ` Chris Wilson
  2013-03-18 20:18   ` Ian Romanick
  2013-03-14 15:52 ` [PATCH v2 11/16] drm/i915: add reset_state for hw_contexts Mika Kuoppala
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

To count context losses, add struct ctx_reset_state for
both i915_hw_context and drm_i915_file_private.
drm_i915_file_private is used when there is no context.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_dma.c         |    4 +++-
 drivers/gpu/drm/i915/i915_drv.h         |   19 +++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_context.c |   11 +++++++++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index e16099b..7902d97 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1792,7 +1792,7 @@ int i915_driver_open(struct drm_device *dev, struct drm_file *file)
 	struct drm_i915_file_private *file_priv;
 
 	DRM_DEBUG_DRIVER("\n");
-	file_priv = kmalloc(sizeof(*file_priv), GFP_KERNEL);
+	file_priv = kzalloc(sizeof(*file_priv), GFP_KERNEL);
 	if (!file_priv)
 		return -ENOMEM;
 
@@ -1801,6 +1801,8 @@ int i915_driver_open(struct drm_device *dev, struct drm_file *file)
 	spin_lock_init(&file_priv->mm.lock);
 	INIT_LIST_HEAD(&file_priv->mm.request_list);
 
+	i915_gem_context_init_reset_state(dev, &file_priv->reset_state);
+
 	idr_init(&file_priv->context_idr);
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a54c507..d004548 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -433,6 +433,19 @@ struct i915_hw_ppgtt {
 	void (*cleanup)(struct i915_hw_ppgtt *ppgtt);
 };
 
+struct ctx_reset_state {
+	/* guilty and reset counts when context initialized */
+	unsigned long guilty_cnt;
+	unsigned long reset_cnt;
+
+	unsigned innocent;
+	unsigned guilty;
+
+	unsigned long last_guilty_reset;
+
+	/* banned to submit more work */
+	bool banned;
+};
 
 /* This must match up with the value previously used for execbuf2.rsvd1. */
 #define DEFAULT_CONTEXT_ID 0
@@ -443,6 +456,7 @@ struct i915_hw_context {
 	struct drm_i915_file_private *file_priv;
 	struct intel_ring_buffer *ring;
 	struct drm_i915_gem_object *obj;
+	struct ctx_reset_state reset_state;
 };
 
 enum no_fbc_reason {
@@ -805,6 +819,7 @@ struct i915_gpu_error {
 
 	unsigned long last_reset;
 
+	unsigned long guilty_cnt;
 	/**
 	 * State variable and reset counter controlling the reset flow
 	 *
@@ -1257,6 +1272,8 @@ struct drm_i915_file_private {
 		struct list_head request_list;
 	} mm;
 	struct idr context_idr;
+
+	struct ctx_reset_state reset_state;
 };
 
 #define INTEL_INFO(dev)	(((struct drm_i915_private *) (dev)->dev_private)->info)
@@ -1677,6 +1694,8 @@ struct i915_hw_context * __must_check
 i915_switch_context(struct intel_ring_buffer *ring,
 		    struct drm_file *file, int to_id);
 void i915_gem_context_free(struct kref *ctx_ref);
+void i915_gem_context_init_reset_state(struct drm_device *dev,
+				       struct ctx_reset_state *rs);
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 8fb4d3c..dbd14b8 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -145,6 +145,15 @@ static void do_destroy(struct i915_hw_context *ctx)
 	kfree(ctx);
 }
 
+void i915_gem_context_init_reset_state(struct drm_device *dev,
+				       struct ctx_reset_state *rs)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+
+	rs->reset_cnt = atomic_read(&dev_priv->gpu_error.reset_counter);
+	rs->guilty_cnt = dev_priv->gpu_error.guilty_cnt;
+}
+
 static struct i915_hw_context *
 create_hw_context(struct drm_device *dev,
 		  struct drm_i915_file_private *file_priv)
@@ -177,6 +186,8 @@ create_hw_context(struct drm_device *dev,
 
 	ctx->file_priv = file_priv;
 
+	i915_gem_context_init_reset_state(dev, &ctx->reset_state);
+
 again:
 	if (idr_pre_get(&file_priv->context_idr, GFP_KERNEL) == 0) {
 		ret = -ENOMEM;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 11/16] drm/i915: add reset_state for hw_contexts
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (9 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 10/16] drm/i915: add struct ctx_reset_state Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-15  9:53   ` Chris Wilson
  2013-03-14 15:52 ` [PATCH v2 12/16] drm/i915: add batch object and context to i915_add_request() Mika Kuoppala
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

For arb-robustness, every context needs to have it's own
reset state tracking. Default context will be handled in a identical
way as the no-context case in further down in the patch set.
For no-context case, the reset state will be stored in
the file_priv part.

v2: handle default context inside get_reset_state

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |    4 ++++
 drivers/gpu/drm/i915/i915_gem_context.c |   34 +++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d004548..393f6a2 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1696,6 +1696,10 @@ i915_switch_context(struct intel_ring_buffer *ring,
 void i915_gem_context_free(struct kref *ctx_ref);
 void i915_gem_context_init_reset_state(struct drm_device *dev,
 				       struct ctx_reset_state *rs);
+int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring,
+				     struct drm_file *file,
+				     u32 id,
+				     struct ctx_reset_state **rs);
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index dbd14b8..c3faaa3 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -315,6 +315,40 @@ static int context_idr_cleanup(int id, void *p, void *data)
 	return 0;
 }
 
+int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring,
+				     struct drm_file *file,
+				     u32 id,
+				     struct ctx_reset_state **rs)
+{
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct drm_i915_file_private *file_priv = file->driver_priv;
+	struct i915_hw_context *to;
+
+	if (dev_priv->hw_contexts_disabled)
+		return -ENOENT;
+
+	if (ring->id != RCS)
+		return -EINVAL;
+
+	if (rs == NULL)
+		return -EINVAL;
+
+	if (file == NULL)
+		return -EINVAL;
+
+	if (id == DEFAULT_CONTEXT_ID) {
+		*rs = &file_priv->reset_state;
+	} else {
+		to = i915_gem_context_get(file->driver_priv, id);
+		if (to == NULL)
+			return -ENOENT;
+
+		*rs = &to->reset_state;
+	}
+
+	return 0;
+}
+
 void i915_gem_context_close(struct drm_device *dev, struct drm_file *file)
 {
 	struct drm_i915_file_private *file_priv = file->driver_priv;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 12/16] drm/i915: add batch object and context to i915_add_request()
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (10 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 11/16] drm/i915: add reset_state for hw_contexts Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 13/16] drm/i915: mark rings which were waiting when hang happened Mika Kuoppala
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

In order to track down a batch buffer and context which
caused the ring to hang, store reference to bo and context
into the request struct. Request can also cause gpu to hang
after the batch in the flush section in the ring. To detect this
add start of the flush portion offset into the request.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |   13 ++++++++++---
 drivers/gpu/drm/i915/i915_gem.c            |    8 ++++++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    7 ++++---
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 393f6a2..1a4cba0 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1249,9 +1249,15 @@ struct drm_i915_gem_request {
 	/** GEM sequence number associated with this request. */
 	uint32_t seqno;
 
-	/** Postion in the ringbuffer of the end of the request */
+	/** Position in the ringbuffer of the start of the request */
+	u32 head;
+
+	/** Position in the ringbuffer of the end of the request */
 	u32 tail;
 
+	/** Batch buffer related to this request if any */
+	struct drm_i915_gem_object *batch_obj;
+
 	/** Context related to this request */
 	struct i915_hw_context *ctx;
 
@@ -1647,9 +1653,10 @@ int __must_check i915_gem_idle(struct drm_device *dev);
 int i915_do_add_request(struct intel_ring_buffer *ring,
 			u32 *seqno,
 			struct drm_file *file,
-			struct i915_hw_context *ctx);
+			struct i915_hw_context *ctx,
+			struct drm_i915_gem_object *batch_obj);
 #define i915_add_request(ring, seqno) \
-	i915_do_add_request(ring, seqno, NULL, NULL)
+	i915_do_add_request(ring, seqno, NULL, NULL, NULL)
 int __must_check i915_wait_seqno(struct intel_ring_buffer *ring,
 				 uint32_t seqno);
 int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index e905086..fa7f446 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1998,14 +1998,16 @@ int
 i915_do_add_request(struct intel_ring_buffer *ring,
 		    u32 *out_seqno,
 		    struct drm_file *file,
-		    struct i915_hw_context *ctx)
+		    struct i915_hw_context *ctx,
+		    struct drm_i915_gem_object *obj)
 {
 	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	struct drm_i915_gem_request *request;
-	u32 request_ring_position;
+	u32 request_ring_position, request_start;
 	int was_empty;
 	int ret;
 
+	request_start = intel_ring_get_tail(ring);
 	/*
 	 * Emit any outstanding flushes - execbuf can fail to emit the flush
 	 * after having emitted the batchbuffer command. Hence we need to fix
@@ -2037,7 +2039,9 @@ i915_do_add_request(struct intel_ring_buffer *ring,
 
 	request->seqno = intel_ring_get_seqno(ring);
 	request->ring = ring;
+	request->head = request_start;
 	request->tail = request_ring_position;
+	request->batch_obj = obj;
 	request->ctx = ctx;
 
 	if (request->ctx)
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 1b524e7..2f23013 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -794,13 +794,14 @@ static void
 i915_gem_execbuffer_retire_commands(struct drm_device *dev,
 				    struct drm_file *file,
 				    struct intel_ring_buffer *ring,
-				    struct i915_hw_context *ctx)
+				    struct i915_hw_context *ctx,
+				    struct drm_i915_gem_object *obj)
 {
 	/* Unconditionally force add_request to emit a full flush. */
 	ring->gpu_caches_dirty = true;
 
 	/* Add a breadcrumb for the completion of the batch buffer */
-	(void)i915_do_add_request(ring, NULL, file, ctx);
+	(void)i915_do_add_request(ring, NULL, file, ctx, obj);
 }
 
 static int
@@ -1075,7 +1076,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	trace_i915_gem_ring_dispatch(ring, intel_ring_get_seqno(ring), flags);
 
 	i915_gem_execbuffer_move_to_active(&eb->objects, ring);
-	i915_gem_execbuffer_retire_commands(dev, file, ring, ctx);
+	i915_gem_execbuffer_retire_commands(dev, file, ring, ctx, batch_obj);
 
 err:
 	eb_destroy(eb);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 13/16] drm/i915: mark rings which were waiting when hang happened
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (11 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 12/16] drm/i915: add batch object and context to i915_add_request() Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 14/16] drm/i915: find guilty batch buffer on ring resets Mika Kuoppala
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

For guilty batchbuffer analysis later on on ring resets,
mark all waiting rings so that we can skip them when
trying to find a true culprit for the gpu hang.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c         |    3 ++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |    1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 5e91e94..7d13259 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1889,7 +1889,8 @@ void i915_hangcheck_elapsed(unsigned long data)
 				ring->hangcheck_score++;
 
 				/* Kick ring */
-				i915_hangcheck_ring_hung(ring);
+				ring->hangcheck_was_waiting =
+					!i915_hangcheck_ring_hung(ring);
 			} else {
 				ring->hangcheck_score = 0;
 			}
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 97b8f37..573b0ef 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -139,6 +139,7 @@ struct  intel_ring_buffer {
 
 	u32 hangcheck_seqno;
 	int hangcheck_score;
+	bool hangcheck_was_waiting;
 
 	void *private;
 };
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 14/16] drm/i915: find guilty batch buffer on ring resets
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (12 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 13/16] drm/i915: mark rings which were waiting when hang happened Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 15/16] drm/i915: refuse to submit more batchbuffers from guilty context Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 16/16] drm/i915: add i915_gem_context_get_reset_status_ioctl Mika Kuoppala
  15 siblings, 0 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

After hang check timer has declared gpu to be hang,
rings are reset. In ring reset, when clearing
request list, do post mortem analysis to find out
the guilty batch buffer.

Select requests for further analysis by inspecting
the completed sequence number which has been updated
into the HWS page. If request was completed, it can't
be related to the hang.

For noncompleted requests mark the batch as guilty
if the ring was not waiting and the ring head was
stuck inside the buffer object or in the flush region
right after the batch. For everything else, mark
them as innocents.

v2: Fixed a typo in commit message (Ville Syrjälä)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c |   91 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index fa7f446..6d3916a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2099,9 +2099,97 @@ i915_gem_request_remove_from_client(struct drm_i915_gem_request *request)
 	spin_unlock(&file_priv->mm.lock);
 }
 
+static bool i915_head_inside_object(u32 acthd, struct drm_i915_gem_object *obj)
+{
+	if (acthd >= obj->gtt_offset &&
+	    acthd < obj->gtt_offset + obj->base.size)
+		return true;
+
+	return false;
+}
+
+static bool i915_head_inside_request(u32 acthd, u32 rs, u32 re)
+{
+	if (rs < re) {
+		if (acthd >= rs && acthd < re)
+			return true;
+	} else if (rs > re) {
+		if (acthd >= rs || acthd < re)
+			return true;
+	}
+
+	return false;
+}
+
+static bool i915_request_guilty(struct drm_i915_gem_request *request,
+				const u32 acthd, bool *inside)
+{
+	if (request->batch_obj) {
+		if (i915_head_inside_object(acthd, request->batch_obj)) {
+			*inside = true;
+			return true;
+		}
+	}
+
+	if (i915_head_inside_request(acthd, request->head, request->tail)) {
+		*inside = false;
+		return true;
+	}
+
+	return false;
+}
+
+static void i915_set_reset_status(struct intel_ring_buffer *ring,
+				  struct drm_i915_gem_request *request,
+				  u32 acthd)
+{
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct ctx_reset_state *rs = NULL;
+	bool inside, guilty;
+
+	/* Innocent until proven guilty */
+	guilty = false;
+
+	if (!ring->hangcheck_was_waiting &&
+	    i915_request_guilty(request, acthd, &inside)) {
+		DRM_ERROR("%s hung %s bo (0x%x ctx %d) at 0x%x\n",
+			  ring->name,
+			  inside ? "inside" : "flushing",
+			  request->batch_obj ?
+			  request->batch_obj->gtt_offset : 0,
+			  request->ctx ? request->ctx->id : 0,
+			  acthd);
+
+		guilty = true;
+	}
+
+	/* If contexts are disabled or this is the default context, use
+	 * file_priv->reset_state
+	 */
+	if (request->ctx && request->ctx->id != DEFAULT_CONTEXT_ID)
+		rs = &request->ctx->reset_state;
+	else if (request->file_priv)
+		rs = &request->file_priv->reset_state;
+
+	if (rs) {
+		if (guilty) {
+			rs->guilty++;
+			dev_priv->gpu_error.guilty_cnt++;
+		} else {
+			rs->innocent++;
+		}
+	}
+}
+
 static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 				      struct intel_ring_buffer *ring)
 {
+	u32 completed_seqno;
+	u32 acthd;
+
+	acthd = intel_ring_get_active_head(ring);
+	completed_seqno = ring->get_seqno(ring, false);
+
 	while (!list_empty(&ring->request_list)) {
 		struct drm_i915_gem_request *request;
 
@@ -2109,6 +2197,9 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 					   struct drm_i915_gem_request,
 					   list);
 
+		if (request->seqno > completed_seqno)
+			i915_set_reset_status(ring, request, acthd);
+
 		list_del(&request->list);
 		i915_gem_request_remove_from_client(request);
 
-- 
1.7.9.5

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 15/16] drm/i915: refuse to submit more batchbuffers from guilty context
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (13 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 14/16] drm/i915: find guilty batch buffer on ring resets Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-14 15:52 ` [PATCH v2 16/16] drm/i915: add i915_gem_context_get_reset_status_ioctl Mika Kuoppala
  15 siblings, 0 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

If context has recently submitted a faulty batchbuffers guilty of
gpu hang and decides to keep submitting more crap, ban it permanently.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |   23 ++++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h            |    1 +
 drivers/gpu/drm/i915/i915_gem.c            |    1 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   11 +++++++++++
 4 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 1ebed96..69c9856 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -817,6 +817,8 @@ int intel_gpu_reset(struct drm_device *dev)
 int i915_reset(struct drm_device *dev)
 {
 	drm_i915_private_t *dev_priv = dev->dev_private;
+	struct ctx_reset_state *gstate;
+	bool do_wedge = true;
 	int ret;
 
 	if (!i915_try_reset)
@@ -824,10 +826,29 @@ int i915_reset(struct drm_device *dev)
 
 	mutex_lock(&dev->struct_mutex);
 
+	/* i915_gem_reset will set this if it finds guilty context */
+	dev_priv->gpu_error.guilty_state = NULL;
+
 	i915_gem_reset(dev);
 
+	gstate = dev_priv->gpu_error.guilty_state;
+
+	if (gstate) {
+		if (gstate->guilty == 1) {
+			do_wedge = false;
+		} else if (!gstate->banned &&
+			   get_seconds() - gstate->last_guilty_reset < 5) {
+			gstate->banned = true;
+			do_wedge = false;
+		}
+
+		gstate->last_guilty_reset = get_seconds();
+	}
+
+	dev_priv->gpu_error.guilty_state = NULL;
+
 	ret = -ENODEV;
-	if (get_seconds() - dev_priv->gpu_error.last_reset < 5)
+	if (do_wedge && get_seconds() - dev_priv->gpu_error.last_reset < 5)
 		DRM_ERROR("GPU hanging too fast, declaring wedged!\n");
 	else
 		ret = intel_gpu_reset(dev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 1a4cba0..3e11acf 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -818,6 +818,7 @@ struct i915_gpu_error {
 	struct work_struct work;
 
 	unsigned long last_reset;
+	struct ctx_reset_state *guilty_state;
 
 	unsigned long guilty_cnt;
 	/**
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 6d3916a..de7403f 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2175,6 +2175,7 @@ static void i915_set_reset_status(struct intel_ring_buffer *ring,
 		if (guilty) {
 			rs->guilty++;
 			dev_priv->gpu_error.guilty_cnt++;
+			dev_priv->gpu_error.guilty_state = rs;
 		} else {
 			rs->innocent++;
 		}
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 2f23013..97d3887 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -841,6 +841,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	struct drm_clip_rect *cliprects = NULL;
 	struct intel_ring_buffer *ring;
 	struct i915_hw_context *ctx;
+	struct ctx_reset_state *rs;
 	u32 ctx_id = i915_execbuffer2_get_context_id(*args);
 	u32 exec_start, exec_len;
 	u32 mask, flags;
@@ -1023,6 +1024,16 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	if (ret)
 		goto err;
 
+	ret = i915_gem_context_get_reset_state(&dev_priv->ring[RCS],
+					       file, ctx_id, &rs);
+	if (ret)
+		goto err;
+
+	if (rs->banned) {
+		ret = -EIO;
+		goto err;
+	}
+
 	ctx = i915_switch_context(ring, file, ctx_id);
 	if (IS_ERR(ctx)) {
 		ret = PTR_ERR(ctx);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 16/16] drm/i915: add i915_gem_context_get_reset_status_ioctl
  2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
                   ` (14 preceding siblings ...)
  2013-03-14 15:52 ` [PATCH v2 15/16] drm/i915: refuse to submit more batchbuffers from guilty context Mika Kuoppala
@ 2013-03-14 15:52 ` Mika Kuoppala
  2013-03-15 10:01   ` Chris Wilson
  2013-03-18 20:26   ` Ian Romanick
  15 siblings, 2 replies; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-14 15:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: miku

This ioctl returns context reset status for specified context.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
CC: idr@freedesktop.org
---
 drivers/gpu/drm/i915/i915_dma.c |    1 +
 drivers/gpu/drm/i915/i915_drv.c |   61 +++++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_drv.h |    2 ++
 include/uapi/drm/i915_drm.h     |   28 ++++++++++++++++++
 4 files changed, 92 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 7902d97..c919832 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1903,6 +1903,7 @@ struct drm_ioctl_desc i915_ioctls[] = {
 	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_CREATE, i915_gem_context_create_ioctl, DRM_UNLOCKED),
 	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_DESTROY, i915_gem_context_destroy_ioctl, DRM_UNLOCKED),
 	DRM_IOCTL_DEF_DRV(I915_REG_READ, i915_reg_read_ioctl, DRM_UNLOCKED),
+	DRM_IOCTL_DEF_DRV(I915_GET_RESET_STATUS, i915_gem_context_get_reset_status_ioctl, DRM_UNLOCKED),
 };
 
 int i915_max_ioctl = DRM_ARRAY_SIZE(i915_ioctls);
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 69c9856..a4d06f2 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1267,3 +1267,64 @@ int i915_reg_read_ioctl(struct drm_device *dev,
 
 	return 0;
 }
+
+int i915_gem_context_get_reset_status_ioctl(struct drm_device *dev,
+					    void *data, struct drm_file *file)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct intel_ring_buffer *ring;
+	struct drm_i915_reset_status *args = data;
+	struct ctx_reset_state *rs = NULL;
+	unsigned long reset_cnt;
+	u32 reset_status = I915_RESET_UNKNOWN;
+	int ret;
+
+	ret = mutex_lock_interruptible(&dev->struct_mutex);
+	if (ret)
+		return ret;
+
+	ring = &dev_priv->ring[RCS];
+
+	ret = i915_gem_context_get_reset_state(ring,
+					       file,
+					       args->ctx_id,
+					       &rs);
+	if (ret)
+		goto out;
+
+	BUG_ON(!rs);
+
+	reset_cnt = atomic_read(&dev_priv->gpu_error.reset_counter);
+
+	if (reset_cnt & I915_RESET_IN_PROGRESS_FLAG ||
+	    reset_cnt == I915_WEDGED) {
+		goto out;
+	}
+
+	/* Set guilty/innocent status if only one reset was
+	 * observed and if only one guilty was found
+	 */
+	if ((rs->reset_cnt + 2) == reset_cnt &&
+	    (rs->guilty_cnt + 1) == dev_priv->gpu_error.guilty_cnt) {
+		reset_status = 0;
+
+		if (rs->guilty)
+			reset_status |= I915_RESET_BATCH_ACTIVE;
+
+		if (rs->innocent)
+			reset_status |= I915_RESET_BATCH_PENDING;
+
+		if (reset_status == 0)
+			reset_status = I915_RESET_UNKNOWN;
+	} else if (rs->reset_cnt == reset_cnt) {
+		reset_status = I915_RESET_NO_ERROR;
+	}
+
+out:
+	if (!ret)
+		args->reset_status = reset_status;
+
+	mutex_unlock(&dev->struct_mutex);
+
+	return ret ? -EINVAL : 0;
+}
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3e11acf..2e5e8e7 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1712,6 +1712,8 @@ int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 				   struct drm_file *file);
+int i915_gem_context_get_reset_status_ioctl(struct drm_device *dev,
+					    void *data, struct drm_file *file);
 
 /* i915_gem_gtt.c */
 void i915_gem_cleanup_aliasing_ppgtt(struct drm_device *dev);
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 07d5941..a195e0e 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -198,6 +198,7 @@ typedef struct _drm_i915_sarea {
 #define DRM_I915_GEM_SET_CACHING	0x2f
 #define DRM_I915_GEM_GET_CACHING	0x30
 #define DRM_I915_REG_READ		0x31
+#define DRM_I915_GET_RESET_STATUS	0x32
 
 #define DRM_IOCTL_I915_INIT		DRM_IOW( DRM_COMMAND_BASE + DRM_I915_INIT, drm_i915_init_t)
 #define DRM_IOCTL_I915_FLUSH		DRM_IO ( DRM_COMMAND_BASE + DRM_I915_FLUSH)
@@ -247,6 +248,7 @@ typedef struct _drm_i915_sarea {
 #define DRM_IOCTL_I915_GEM_CONTEXT_CREATE	DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GEM_CONTEXT_CREATE, struct drm_i915_gem_context_create)
 #define DRM_IOCTL_I915_GEM_CONTEXT_DESTROY	DRM_IOW (DRM_COMMAND_BASE + DRM_I915_GEM_CONTEXT_DESTROY, struct drm_i915_gem_context_destroy)
 #define DRM_IOCTL_I915_REG_READ			DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_REG_READ, struct drm_i915_reg_read)
+#define DRM_IOCTL_I915_GET_RESET_STATUS		DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GET_RESET_STATUS, struct drm_i915_reset_status)
 
 /* Allow drivers to submit batchbuffers directly to hardware, relying
  * on the security mechanisms provided by hardware.
@@ -980,4 +982,30 @@ struct drm_i915_reg_read {
 	__u64 offset;
 	__u64 val; /* Return value */
 };
+
+/* No reset observed */
+#define I915_RESET_NO_ERROR      0
+
+/* Context had batch processing active while
+   gpu hung and batch was guilty of gpu hang */
+#define I915_RESET_BATCH_ACTIVE  (1 << 0)
+
+/* Context had batch queued for processing while
+   reset occurred and guilty batch was found:
+   I915_RESET_BATCH_ACTIVE was set for this or
+   some other context */
+#define I915_RESET_BATCH_PENDING (1 << 1)
+
+/* Context observed gpu hung and reset but guilty context
+   was not found: I915_RESET_BATCH_ACTIVE and
+   I915_RESET_BATCH_PENDING were not set for any context */
+#define I915_RESET_UNKNOWN       (1 << 2)
+
+struct drm_i915_reset_status {
+	__u32 ctx_id;
+	__u32 flags;
+	__u32 reset_status;
+	__u32 pad;
+};
+
 #endif /* _UAPI_I915_DRM_H_ */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 02/16] drm/i915: cleanup i915_add_request
  2013-03-14 15:52 ` [PATCH v2 02/16] drm/i915: cleanup i915_add_request Mika Kuoppala
@ 2013-03-15  9:43   ` Chris Wilson
  0 siblings, 0 replies; 35+ messages in thread
From: Chris Wilson @ 2013-03-15  9:43 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx, miku

On Thu, Mar 14, 2013 at 05:52:03PM +0200, Mika Kuoppala wrote:
> Only execbuffer needs all the parameters. Cleanup everything
> else behind macro.

I'd prefer _i915_add_request over i915_do_add_request.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 03/16] drm/i915: reference count for i915_hw_contexts
  2013-03-14 15:52 ` [PATCH v2 03/16] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
@ 2013-03-15  9:44   ` Chris Wilson
  2013-04-02 22:45   ` [PATCH 1/2] [v3] " Ben Widawsky
  1 sibling, 0 replies; 35+ messages in thread
From: Chris Wilson @ 2013-03-15  9:44 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx, miku

On Thu, Mar 14, 2013 at 05:52:04PM +0200, Mika Kuoppala wrote:
> @@ -2101,6 +2107,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
>  
>  		list_del(&request->list);
>  		i915_gem_request_remove_from_client(request);
> +
> +		if (request->ctx)
> +			kref_put(&request->ctx->ref, i915_gem_context_free);
> +
>  		kfree(request);

Methinks it is time to kill this code duplication.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 06/16] drm/i915: track ring progression using seqnos
  2013-03-14 15:52 ` [PATCH v2 06/16] drm/i915: track ring progression using seqnos Mika Kuoppala
@ 2013-03-15  9:48   ` Chris Wilson
  0 siblings, 0 replies; 35+ messages in thread
From: Chris Wilson @ 2013-03-15  9:48 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx, miku

On Thu, Mar 14, 2013 at 05:52:07PM +0200, Mika Kuoppala wrote:
> index d66208c..9599c56 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -137,6 +137,8 @@ struct  intel_ring_buffer {
>  	struct i915_hw_context *default_context;
>  	struct drm_i915_gem_object *last_context_obj;
>  
> +	u32 hangcheck_seqno;

We end up with quite a few hangheck_* variables. An indication that we
should sort those out into their own struct.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 10/16] drm/i915: add struct ctx_reset_state
  2013-03-14 15:52 ` [PATCH v2 10/16] drm/i915: add struct ctx_reset_state Mika Kuoppala
@ 2013-03-15  9:52   ` Chris Wilson
  2013-03-17 21:51     ` Daniel Vetter
  2013-03-18 20:18   ` Ian Romanick
  1 sibling, 1 reply; 35+ messages in thread
From: Chris Wilson @ 2013-03-15  9:52 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx, miku

On Thu, Mar 14, 2013 at 05:52:11PM +0200, Mika Kuoppala wrote:
> To count context losses, add struct ctx_reset_state for
> both i915_hw_context and drm_i915_file_private.
> drm_i915_file_private is used when there is no context.

Being really picky, but can we device a better name than reset_state. I
keep reading 'reset' as a verb and get very confused...

Even just gpu_reset reads better. Suggestions?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 11/16] drm/i915: add reset_state for hw_contexts
  2013-03-14 15:52 ` [PATCH v2 11/16] drm/i915: add reset_state for hw_contexts Mika Kuoppala
@ 2013-03-15  9:53   ` Chris Wilson
  0 siblings, 0 replies; 35+ messages in thread
From: Chris Wilson @ 2013-03-15  9:53 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx, miku

On Thu, Mar 14, 2013 at 05:52:12PM +0200, Mika Kuoppala wrote:
> For arb-robustness, every context needs to have it's own
> reset state tracking. Default context will be handled in a identical
> way as the no-context case in further down in the patch set.
> For no-context case, the reset state will be stored in
> the file_priv part.
> 
> v2: handle default context inside get_reset_state
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         |    4 ++++
>  drivers/gpu/drm/i915/i915_gem_context.c |   34 +++++++++++++++++++++++++++++++
>  2 files changed, 38 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index d004548..393f6a2 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1696,6 +1696,10 @@ i915_switch_context(struct intel_ring_buffer *ring,
>  void i915_gem_context_free(struct kref *ctx_ref);
>  void i915_gem_context_init_reset_state(struct drm_device *dev,
>  				       struct ctx_reset_state *rs);
> +int i915_gem_context_get_reset_state(struct intel_ring_buffer *ring,
> +				     struct drm_file *file,
> +				     u32 id,
> +				     struct ctx_reset_state **rs);

Return struct ctx_reset_state, and use ERR_PTR. Also don't immediately
discard the return code in the (eventual) caller!
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 16/16] drm/i915: add i915_gem_context_get_reset_status_ioctl
  2013-03-14 15:52 ` [PATCH v2 16/16] drm/i915: add i915_gem_context_get_reset_status_ioctl Mika Kuoppala
@ 2013-03-15 10:01   ` Chris Wilson
  2013-03-18 20:26   ` Ian Romanick
  1 sibling, 0 replies; 35+ messages in thread
From: Chris Wilson @ 2013-03-15 10:01 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx, miku

On Thu, Mar 14, 2013 at 05:52:17PM +0200, Mika Kuoppala wrote:
> This ioctl returns context reset status for specified context.
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> CC: idr@freedesktop.org
> ---
>  drivers/gpu/drm/i915/i915_dma.c |    1 +
>  drivers/gpu/drm/i915/i915_drv.c |   61 +++++++++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_drv.h |    2 ++
>  include/uapi/drm/i915_drm.h     |   28 ++++++++++++++++++
>  4 files changed, 92 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index 7902d97..c919832 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -1903,6 +1903,7 @@ struct drm_ioctl_desc i915_ioctls[] = {
>  	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_CREATE, i915_gem_context_create_ioctl, DRM_UNLOCKED),
>  	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_DESTROY, i915_gem_context_destroy_ioctl, DRM_UNLOCKED),
>  	DRM_IOCTL_DEF_DRV(I915_REG_READ, i915_reg_read_ioctl, DRM_UNLOCKED),
> +	DRM_IOCTL_DEF_DRV(I915_GET_RESET_STATUS, i915_gem_context_get_reset_status_ioctl, DRM_UNLOCKED),
>  };
>  
>  int i915_max_ioctl = DRM_ARRAY_SIZE(i915_ioctls);
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 69c9856..a4d06f2 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1267,3 +1267,64 @@ int i915_reg_read_ioctl(struct drm_device *dev,
>  
>  	return 0;
>  }
> +
> +int i915_gem_context_get_reset_status_ioctl(struct drm_device *dev,
> +					    void *data, struct drm_file *file)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct intel_ring_buffer *ring;
> +	struct drm_i915_reset_status *args = data;
> +	struct ctx_reset_state *rs = NULL;
> +	unsigned long reset_cnt;
> +	u32 reset_status = I915_RESET_UNKNOWN;
> +	int ret;
> +
> +	ret = mutex_lock_interruptible(&dev->struct_mutex);
> +	if (ret)
> +		return ret;
> +
> +	ring = &dev_priv->ring[RCS];
> +
> +	ret = i915_gem_context_get_reset_state(ring,
> +					       file,
> +					       args->ctx_id,
> +					       &rs);
> +	if (ret)
> +		goto out;
See earlier comments.

> +	BUG_ON(!rs);
> +
> +	reset_cnt = atomic_read(&dev_priv->gpu_error.reset_counter);
> +
> +	if (reset_cnt & I915_RESET_IN_PROGRESS_FLAG ||
> +	    reset_cnt == I915_WEDGED) {
> +		goto out;
I915_WEDGED & I915_RESET_IN_PROGRESS_FLAGS is defined as true.

> +	}
> +
> +	/* Set guilty/innocent status if only one reset was
> +	 * observed and if only one guilty was found
> +	 */
> +	if ((rs->reset_cnt + 2) == reset_cnt &&
> +	    (rs->guilty_cnt + 1) == dev_priv->gpu_error.guilty_cnt) {
> +		reset_status = 0;
> +
> +		if (rs->guilty)
> +			reset_status |= I915_RESET_BATCH_ACTIVE;
> +
> +		if (rs->innocent)
> +			reset_status |= I915_RESET_BATCH_PENDING;
> +
> +		if (reset_status == 0)
> +			reset_status = I915_RESET_UNKNOWN;
> +	} else if (rs->reset_cnt == reset_cnt) {
> +		reset_status = I915_RESET_NO_ERROR;
> +	}
This looks very fragile and time dependent. It is not an interface I can
use...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 10/16] drm/i915: add struct ctx_reset_state
  2013-03-15  9:52   ` Chris Wilson
@ 2013-03-17 21:51     ` Daniel Vetter
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Vetter @ 2013-03-17 21:51 UTC (permalink / raw)
  To: Chris Wilson, Mika Kuoppala, intel-gfx, miku

On Fri, Mar 15, 2013 at 09:52:43AM +0000, Chris Wilson wrote:
> On Thu, Mar 14, 2013 at 05:52:11PM +0200, Mika Kuoppala wrote:
> > To count context losses, add struct ctx_reset_state for
> > both i915_hw_context and drm_i915_file_private.
> > drm_i915_file_private is used when there is no context.
> 
> Being really picky, but can we device a better name than reset_state. I
> keep reading 'reset' as a verb and get very confused...
> 
> Even just gpu_reset reads better. Suggestions?

hang_stats? failure_stats? In any case the struct definition itself needs
a i915_ prefix.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 04/16] drm/i915: Resurrect ring kicking for semaphores, selectively
  2013-03-14 15:52 ` [PATCH v2 04/16] drm/i915: Resurrect ring kicking for semaphores, selectively Mika Kuoppala
@ 2013-03-17 21:53   ` Daniel Vetter
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Vetter @ 2013-03-17 21:53 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: Daniel Vetter, intel-gfx, miku, Ben Widawsky

On Thu, Mar 14, 2013 at 05:52:05PM +0200, Mika Kuoppala wrote:
> From: Chris Wilson <chris@chris-wilson.co.uk>
> 
> Once we thought we got semaphores working, we disabled kicking the ring
> if hangcheck fired whilst waiting upon a ring as it was doing more harm
> than good:
> 
> commit 4e0e90dcb8a7df1229c69e30abebb59b0b3c2a1f
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Wed Dec 14 13:56:58 2011 +0100
> 
>     drm/i915: kicking rings stuck on semaphores considered harmful
> 
> However, life is never that easy and semaphores are still causing
> problems whereby the value written by one ring (bcs) is not being
> propagated to the waiter (rcs). Thus the waiter never wakes up and we
> declare the GPU hung, which often has unfortunate consequences, even if
> we successfully reset the GPU.
> 
> But the GPU is idle as it has completed the work, just didn't notify its
> clients. So we can detect the incomplete wait during hang check and
> probe the target ring to see if has indeed emitted the breadcrumb seqno
> following the work and then and only then kick the waiter.
> 
> Based on a suggestion by Ben Widawsky.
> 
> v2: cross-check wait with iphdr. fix signaller calculation.
> 
> References: https://bugs.freedesktop.org/show_bug.cgi?id=54226
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> Acked-by: Ben Widawsky <ben@bwidawsk.net>

I'll throw in the towel, let's all hail duct-tape. Queued for -next,
thanks for the patch.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 10/16] drm/i915: add struct ctx_reset_state
  2013-03-14 15:52 ` [PATCH v2 10/16] drm/i915: add struct ctx_reset_state Mika Kuoppala
  2013-03-15  9:52   ` Chris Wilson
@ 2013-03-18 20:18   ` Ian Romanick
  1 sibling, 0 replies; 35+ messages in thread
From: Ian Romanick @ 2013-03-18 20:18 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx, miku

On 03/14/2013 08:52 AM, Mika Kuoppala wrote:
> To count context losses, add struct ctx_reset_state for
> both i915_hw_context and drm_i915_file_private.
> drm_i915_file_private is used when there is no context.
>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_dma.c         |    4 +++-
>   drivers/gpu/drm/i915/i915_drv.h         |   19 +++++++++++++++++++
>   drivers/gpu/drm/i915/i915_gem_context.c |   11 +++++++++++
>   3 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index e16099b..7902d97 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -1792,7 +1792,7 @@ int i915_driver_open(struct drm_device *dev, struct drm_file *file)
>   	struct drm_i915_file_private *file_priv;
>
>   	DRM_DEBUG_DRIVER("\n");
> -	file_priv = kmalloc(sizeof(*file_priv), GFP_KERNEL);
> +	file_priv = kzalloc(sizeof(*file_priv), GFP_KERNEL);
>   	if (!file_priv)
>   		return -ENOMEM;
>
> @@ -1801,6 +1801,8 @@ int i915_driver_open(struct drm_device *dev, struct drm_file *file)
>   	spin_lock_init(&file_priv->mm.lock);
>   	INIT_LIST_HEAD(&file_priv->mm.request_list);
>
> +	i915_gem_context_init_reset_state(dev, &file_priv->reset_state);
> +
>   	idr_init(&file_priv->context_idr);
>
>   	return 0;
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index a54c507..d004548 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -433,6 +433,19 @@ struct i915_hw_ppgtt {
>   	void (*cleanup)(struct i915_hw_ppgtt *ppgtt);
>   };
>
> +struct ctx_reset_state {
> +	/* guilty and reset counts when context initialized */
> +	unsigned long guilty_cnt;
> +	unsigned long reset_cnt;

I think we can afford to spell out "count."  The first time I saw cnt, 
it looked like a dirty word. :)

I think this structure could you some better description of the overall 
architecture.  It's not completely obvious from the individual pieces... 
and that makes it really hard to evaluate.

reset_cnt is the number of resets since start-up.  What is guilty_cnt? 
What are innocent and guilty (below)?

All of this makes it difficult for me to tell whether or not the logic 
in patch 16 is correct... and I don't think it is.

> +
> +	unsigned innocent;
> +	unsigned guilty;
> +

         /* Time when this context was last blamed for a GPU reset.
          */
> +	unsigned long last_guilty_reset;
> +
> +	/* banned to submit more work */
> +	bool banned;
> +};
>
>   /* This must match up with the value previously used for execbuf2.rsvd1. */
>   #define DEFAULT_CONTEXT_ID 0
> @@ -443,6 +456,7 @@ struct i915_hw_context {
>   	struct drm_i915_file_private *file_priv;
>   	struct intel_ring_buffer *ring;
>   	struct drm_i915_gem_object *obj;
> +	struct ctx_reset_state reset_state;
>   };
>
>   enum no_fbc_reason {
> @@ -805,6 +819,7 @@ struct i915_gpu_error {
>
>   	unsigned long last_reset;
>
> +	unsigned long guilty_cnt;
>   	/**
>   	 * State variable and reset counter controlling the reset flow
>   	 *
> @@ -1257,6 +1272,8 @@ struct drm_i915_file_private {
>   		struct list_head request_list;
>   	} mm;
>   	struct idr context_idr;
> +
> +	struct ctx_reset_state reset_state;
>   };
>
>   #define INTEL_INFO(dev)	(((struct drm_i915_private *) (dev)->dev_private)->info)
> @@ -1677,6 +1694,8 @@ struct i915_hw_context * __must_check
>   i915_switch_context(struct intel_ring_buffer *ring,
>   		    struct drm_file *file, int to_id);
>   void i915_gem_context_free(struct kref *ctx_ref);
> +void i915_gem_context_init_reset_state(struct drm_device *dev,
> +				       struct ctx_reset_state *rs);
>   int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>   				  struct drm_file *file);
>   int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 8fb4d3c..dbd14b8 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -145,6 +145,15 @@ static void do_destroy(struct i915_hw_context *ctx)
>   	kfree(ctx);
>   }
>
> +void i915_gem_context_init_reset_state(struct drm_device *dev,
> +				       struct ctx_reset_state *rs)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +
> +	rs->reset_cnt = atomic_read(&dev_priv->gpu_error.reset_counter);
> +	rs->guilty_cnt = dev_priv->gpu_error.guilty_cnt;
> +}
> +
>   static struct i915_hw_context *
>   create_hw_context(struct drm_device *dev,
>   		  struct drm_i915_file_private *file_priv)
> @@ -177,6 +186,8 @@ create_hw_context(struct drm_device *dev,
>
>   	ctx->file_priv = file_priv;
>
> +	i915_gem_context_init_reset_state(dev, &ctx->reset_state);
> +
>   again:
>   	if (idr_pre_get(&file_priv->context_idr, GFP_KERNEL) == 0) {
>   		ret = -ENOMEM;
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 16/16] drm/i915: add i915_gem_context_get_reset_status_ioctl
  2013-03-14 15:52 ` [PATCH v2 16/16] drm/i915: add i915_gem_context_get_reset_status_ioctl Mika Kuoppala
  2013-03-15 10:01   ` Chris Wilson
@ 2013-03-18 20:26   ` Ian Romanick
  2013-03-19 12:58     ` Mika Kuoppala
  1 sibling, 1 reply; 35+ messages in thread
From: Ian Romanick @ 2013-03-18 20:26 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx, miku

On 03/14/2013 08:52 AM, Mika Kuoppala wrote:
> This ioctl returns context reset status for specified context.
>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> CC: idr@freedesktop.org
> ---
>   drivers/gpu/drm/i915/i915_dma.c |    1 +
>   drivers/gpu/drm/i915/i915_drv.c |   61 +++++++++++++++++++++++++++++++++++++++
>   drivers/gpu/drm/i915/i915_drv.h |    2 ++
>   include/uapi/drm/i915_drm.h     |   28 ++++++++++++++++++
>   4 files changed, 92 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index 7902d97..c919832 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -1903,6 +1903,7 @@ struct drm_ioctl_desc i915_ioctls[] = {
>   	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_CREATE, i915_gem_context_create_ioctl, DRM_UNLOCKED),
>   	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_DESTROY, i915_gem_context_destroy_ioctl, DRM_UNLOCKED),
>   	DRM_IOCTL_DEF_DRV(I915_REG_READ, i915_reg_read_ioctl, DRM_UNLOCKED),
> +	DRM_IOCTL_DEF_DRV(I915_GET_RESET_STATUS, i915_gem_context_get_reset_status_ioctl, DRM_UNLOCKED),
>   };
>
>   int i915_max_ioctl = DRM_ARRAY_SIZE(i915_ioctls);
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 69c9856..a4d06f2 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1267,3 +1267,64 @@ int i915_reg_read_ioctl(struct drm_device *dev,
>
>   	return 0;
>   }
> +
> +int i915_gem_context_get_reset_status_ioctl(struct drm_device *dev,
> +					    void *data, struct drm_file *file)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct intel_ring_buffer *ring;
> +	struct drm_i915_reset_status *args = data;
> +	struct ctx_reset_state *rs = NULL;
> +	unsigned long reset_cnt;
> +	u32 reset_status = I915_RESET_UNKNOWN;
> +	int ret;
> +
> +	ret = mutex_lock_interruptible(&dev->struct_mutex);
> +	if (ret)
> +		return ret;
> +
> +	ring = &dev_priv->ring[RCS];
> +
> +	ret = i915_gem_context_get_reset_state(ring,
> +					       file,
> +					       args->ctx_id,
> +					       &rs);
> +	if (ret)
> +		goto out;
> +
> +	BUG_ON(!rs);
> +
> +	reset_cnt = atomic_read(&dev_priv->gpu_error.reset_counter);
> +
> +	if (reset_cnt & I915_RESET_IN_PROGRESS_FLAG ||

In this case, I believe we're supposed to return the reset state to the 
application.  The ARB_robustness spec says:

     "If a reset status other than NO_ERROR is returned and subsequent
     calls return NO_ERROR, the context reset was encountered and
     completed. If a reset status is repeatedly returned, the context may
     be in the process of resetting."

If the reset takes a long time, it seems that even a well-behaved app 
could run afoul of the 'banned' logic.

> +	    reset_cnt == I915_WEDGED) {
> +		goto out;
> +	}
> +
> +	/* Set guilty/innocent status if only one reset was
> +	 * observed and if only one guilty was found
> +	 */
> +	if ((rs->reset_cnt + 2) == reset_cnt &&
> +	    (rs->guilty_cnt + 1) == dev_priv->gpu_error.guilty_cnt) {

This logic seems... wrong, or at least weird.  "rs->reset_cnt + 2" is 
confusing next to "if only one reset was observed".

dev_priv->gpu_error.reset_counter is the global GPU reset count since 
start-up, and rs->reset_cnt is the global GPU count since start-up when 
the context was created.  Right?

If that's the case, this will cause a context that was completely idle 
(i.e., didn't actually lose anything) to get a reset notification. 
That's an absolute deal breaker.

If that's not the case, then this architecture needs a lot more 
documentation so that people new to it can understand what's happening.

> +		reset_status = 0;
> +
> +		if (rs->guilty)
> +			reset_status |= I915_RESET_BATCH_ACTIVE;
> +
> +		if (rs->innocent)
> +			reset_status |= I915_RESET_BATCH_PENDING;
> +
> +		if (reset_status == 0)
> +			reset_status = I915_RESET_UNKNOWN;
> +	} else if (rs->reset_cnt == reset_cnt) {
> +		reset_status = I915_RESET_NO_ERROR;
> +	}
> +
> +out:
> +	if (!ret)
> +		args->reset_status = reset_status;
> +
> +	mutex_unlock(&dev->struct_mutex);
> +
> +	return ret ? -EINVAL : 0;
> +}
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 3e11acf..2e5e8e7 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1712,6 +1712,8 @@ int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>   				  struct drm_file *file);
>   int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>   				   struct drm_file *file);
> +int i915_gem_context_get_reset_status_ioctl(struct drm_device *dev,
> +					    void *data, struct drm_file *file);
>
>   /* i915_gem_gtt.c */
>   void i915_gem_cleanup_aliasing_ppgtt(struct drm_device *dev);
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 07d5941..a195e0e 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -198,6 +198,7 @@ typedef struct _drm_i915_sarea {
>   #define DRM_I915_GEM_SET_CACHING	0x2f
>   #define DRM_I915_GEM_GET_CACHING	0x30
>   #define DRM_I915_REG_READ		0x31
> +#define DRM_I915_GET_RESET_STATUS	0x32
>
>   #define DRM_IOCTL_I915_INIT		DRM_IOW( DRM_COMMAND_BASE + DRM_I915_INIT, drm_i915_init_t)
>   #define DRM_IOCTL_I915_FLUSH		DRM_IO ( DRM_COMMAND_BASE + DRM_I915_FLUSH)
> @@ -247,6 +248,7 @@ typedef struct _drm_i915_sarea {
>   #define DRM_IOCTL_I915_GEM_CONTEXT_CREATE	DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GEM_CONTEXT_CREATE, struct drm_i915_gem_context_create)
>   #define DRM_IOCTL_I915_GEM_CONTEXT_DESTROY	DRM_IOW (DRM_COMMAND_BASE + DRM_I915_GEM_CONTEXT_DESTROY, struct drm_i915_gem_context_destroy)
>   #define DRM_IOCTL_I915_REG_READ			DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_REG_READ, struct drm_i915_reg_read)
> +#define DRM_IOCTL_I915_GET_RESET_STATUS		DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GET_RESET_STATUS, struct drm_i915_reset_status)
>
>   /* Allow drivers to submit batchbuffers directly to hardware, relying
>    * on the security mechanisms provided by hardware.
> @@ -980,4 +982,30 @@ struct drm_i915_reg_read {
>   	__u64 offset;
>   	__u64 val; /* Return value */
>   };
> +
> +/* No reset observed */
> +#define I915_RESET_NO_ERROR      0
> +
> +/* Context had batch processing active while
> +   gpu hung and batch was guilty of gpu hang */
> +#define I915_RESET_BATCH_ACTIVE  (1 << 0)
> +
> +/* Context had batch queued for processing while
> +   reset occurred and guilty batch was found:
> +   I915_RESET_BATCH_ACTIVE was set for this or
> +   some other context */
> +#define I915_RESET_BATCH_PENDING (1 << 1)
> +
> +/* Context observed gpu hung and reset but guilty context
> +   was not found: I915_RESET_BATCH_ACTIVE and
> +   I915_RESET_BATCH_PENDING were not set for any context */
> +#define I915_RESET_UNKNOWN       (1 << 2)
> +
> +struct drm_i915_reset_status {
> +	__u32 ctx_id;
> +	__u32 flags;
> +	__u32 reset_status;
> +	__u32 pad;
> +};
> +
>   #endif /* _UAPI_I915_DRM_H_ */
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 16/16] drm/i915: add i915_gem_context_get_reset_status_ioctl
  2013-03-18 20:26   ` Ian Romanick
@ 2013-03-19 12:58     ` Mika Kuoppala
  2013-03-19 19:02       ` Ian Romanick
  0 siblings, 1 reply; 35+ messages in thread
From: Mika Kuoppala @ 2013-03-19 12:58 UTC (permalink / raw)
  To: Ian Romanick; +Cc: intel-gfx, miku

Ian Romanick <idr@freedesktop.org> writes:

> On 03/14/2013 08:52 AM, Mika Kuoppala wrote:
>> This ioctl returns context reset status for specified context.
>>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> CC: idr@freedesktop.org
>> ---
>>   drivers/gpu/drm/i915/i915_dma.c |    1 +
>>   drivers/gpu/drm/i915/i915_drv.c |   61 +++++++++++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/i915_drv.h |    2 ++
>>   include/uapi/drm/i915_drm.h     |   28 ++++++++++++++++++
>>   4 files changed, 92 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
>> index 7902d97..c919832 100644
>> --- a/drivers/gpu/drm/i915/i915_dma.c
>> +++ b/drivers/gpu/drm/i915/i915_dma.c
>> @@ -1903,6 +1903,7 @@ struct drm_ioctl_desc i915_ioctls[] = {
>>   	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_CREATE, i915_gem_context_create_ioctl, DRM_UNLOCKED),
>>   	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_DESTROY, i915_gem_context_destroy_ioctl, DRM_UNLOCKED),
>>   	DRM_IOCTL_DEF_DRV(I915_REG_READ, i915_reg_read_ioctl, DRM_UNLOCKED),
>> +	DRM_IOCTL_DEF_DRV(I915_GET_RESET_STATUS, i915_gem_context_get_reset_status_ioctl, DRM_UNLOCKED),
>>   };
>>
>>   int i915_max_ioctl = DRM_ARRAY_SIZE(i915_ioctls);
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index 69c9856..a4d06f2 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1267,3 +1267,64 @@ int i915_reg_read_ioctl(struct drm_device *dev,
>>
>>   	return 0;
>>   }
>> +
>> +int i915_gem_context_get_reset_status_ioctl(struct drm_device *dev,
>> +					    void *data, struct drm_file *file)
>> +{
>> +	struct drm_i915_private *dev_priv = dev->dev_private;
>> +	struct intel_ring_buffer *ring;
>> +	struct drm_i915_reset_status *args = data;
>> +	struct ctx_reset_state *rs = NULL;
>> +	unsigned long reset_cnt;
>> +	u32 reset_status = I915_RESET_UNKNOWN;
>> +	int ret;
>> +
>> +	ret = mutex_lock_interruptible(&dev->struct_mutex);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ring = &dev_priv->ring[RCS];
>> +
>> +	ret = i915_gem_context_get_reset_state(ring,
>> +					       file,
>> +					       args->ctx_id,
>> +					       &rs);
>> +	if (ret)
>> +		goto out;
>> +
>> +	BUG_ON(!rs);
>> +
>> +	reset_cnt = atomic_read(&dev_priv->gpu_error.reset_counter);
>> +
>> +	if (reset_cnt & I915_RESET_IN_PROGRESS_FLAG ||
>
> In this case, I believe we're supposed to return the reset state to the 
> application.  The ARB_robustness spec says:
>
>      "If a reset status other than NO_ERROR is returned and subsequent
>      calls return NO_ERROR, the context reset was encountered and
>      completed. If a reset status is repeatedly returned, the context may
>      be in the process of resetting."
>
> If the reset takes a long time, it seems that even a well-behaved app 
> could run afoul of the 'banned' logic.

As there reset status is initialized to I915_RESET_UNKNOWN,
we return it if the reset is in progress or gpu is wedged.

>> +	    reset_cnt == I915_WEDGED) {
>> +		goto out;
>> +	}
>> +
>> +	/* Set guilty/innocent status if only one reset was
>> +	 * observed and if only one guilty was found
>> +	 */
>> +	if ((rs->reset_cnt + 2) == reset_cnt &&
>> +	    (rs->guilty_cnt + 1) == dev_priv->gpu_error.guilty_cnt) {
>
> This logic seems... wrong, or at least weird.  "rs->reset_cnt + 2" is 
> confusing next to "if only one reset was observed".
>
> dev_priv->gpu_error.reset_counter is the global GPU reset count since 
> start-up, and rs->reset_cnt is the global GPU count since start-up when 
> the context was created.  Right?

Right. The confusing part in here is the
dev_priv->gpu_error.reset_counter. If it is odd, reset is in progress,
if it is even, the reset has been handled and all is well. That is why +2

> If that's the case, this will cause a context that was completely idle 
> (i.e., didn't actually lose anything) to get a reset notification. 
> That's an absolute deal breaker.

This was then misunderstood by me. I will make it so that if you have
no batches submitted, you wont observe a reset.

> If that's not the case, then this architecture needs a lot more 
> documentation so that people new to it can understand what's happening.

Agreed. If we don't need to care about the contexts where there
were no batches submitted, the logic will be simpler tho.

-Mika

>> +		reset_status = 0;
>> +
>> +		if (rs->guilty)
>> +			reset_status |= I915_RESET_BATCH_ACTIVE;
>> +
>> +		if (rs->innocent)
>> +			reset_status |= I915_RESET_BATCH_PENDING;
>> +
>> +		if (reset_status == 0)
>> +			reset_status = I915_RESET_UNKNOWN;
>> +	} else if (rs->reset_cnt == reset_cnt) {
>> +		reset_status = I915_RESET_NO_ERROR;
>> +	}
>> +
>> +out:
>> +	if (!ret)
>> +		args->reset_status = reset_status;
>> +
>> +	mutex_unlock(&dev->struct_mutex);
>> +
>> +	return ret ? -EINVAL : 0;
>> +}
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index 3e11acf..2e5e8e7 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -1712,6 +1712,8 @@ int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>   				  struct drm_file *file);
>>   int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>>   				   struct drm_file *file);
>> +int i915_gem_context_get_reset_status_ioctl(struct drm_device *dev,
>> +					    void *data, struct drm_file *file);
>>
>>   /* i915_gem_gtt.c */
>>   void i915_gem_cleanup_aliasing_ppgtt(struct drm_device *dev);
>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>> index 07d5941..a195e0e 100644
>> --- a/include/uapi/drm/i915_drm.h
>> +++ b/include/uapi/drm/i915_drm.h
>> @@ -198,6 +198,7 @@ typedef struct _drm_i915_sarea {
>>   #define DRM_I915_GEM_SET_CACHING	0x2f
>>   #define DRM_I915_GEM_GET_CACHING	0x30
>>   #define DRM_I915_REG_READ		0x31
>> +#define DRM_I915_GET_RESET_STATUS	0x32
>>
>>   #define DRM_IOCTL_I915_INIT		DRM_IOW( DRM_COMMAND_BASE + DRM_I915_INIT, drm_i915_init_t)
>>   #define DRM_IOCTL_I915_FLUSH		DRM_IO ( DRM_COMMAND_BASE + DRM_I915_FLUSH)
>> @@ -247,6 +248,7 @@ typedef struct _drm_i915_sarea {
>>   #define DRM_IOCTL_I915_GEM_CONTEXT_CREATE	DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GEM_CONTEXT_CREATE, struct drm_i915_gem_context_create)
>>   #define DRM_IOCTL_I915_GEM_CONTEXT_DESTROY	DRM_IOW (DRM_COMMAND_BASE + DRM_I915_GEM_CONTEXT_DESTROY, struct drm_i915_gem_context_destroy)
>>   #define DRM_IOCTL_I915_REG_READ			DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_REG_READ, struct drm_i915_reg_read)
>> +#define DRM_IOCTL_I915_GET_RESET_STATUS		DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GET_RESET_STATUS, struct drm_i915_reset_status)
>>
>>   /* Allow drivers to submit batchbuffers directly to hardware, relying
>>    * on the security mechanisms provided by hardware.
>> @@ -980,4 +982,30 @@ struct drm_i915_reg_read {
>>   	__u64 offset;
>>   	__u64 val; /* Return value */
>>   };
>> +
>> +/* No reset observed */
>> +#define I915_RESET_NO_ERROR      0
>> +
>> +/* Context had batch processing active while
>> +   gpu hung and batch was guilty of gpu hang */
>> +#define I915_RESET_BATCH_ACTIVE  (1 << 0)
>> +
>> +/* Context had batch queued for processing while
>> +   reset occurred and guilty batch was found:
>> +   I915_RESET_BATCH_ACTIVE was set for this or
>> +   some other context */
>> +#define I915_RESET_BATCH_PENDING (1 << 1)
>> +
>> +/* Context observed gpu hung and reset but guilty context
>> +   was not found: I915_RESET_BATCH_ACTIVE and
>> +   I915_RESET_BATCH_PENDING were not set for any context */
>> +#define I915_RESET_UNKNOWN       (1 << 2)
>> +
>> +struct drm_i915_reset_status {
>> +	__u32 ctx_id;
>> +	__u32 flags;
>> +	__u32 reset_status;
>> +	__u32 pad;
>> +};
>> +
>>   #endif /* _UAPI_I915_DRM_H_ */
>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 16/16] drm/i915: add i915_gem_context_get_reset_status_ioctl
  2013-03-19 12:58     ` Mika Kuoppala
@ 2013-03-19 19:02       ` Ian Romanick
  2013-03-19 19:21         ` Daniel Vetter
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Romanick @ 2013-03-19 19:02 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx, miku

On 03/19/2013 05:58 AM, Mika Kuoppala wrote:
> Ian Romanick <idr@freedesktop.org> writes:
>
>> On 03/14/2013 08:52 AM, Mika Kuoppala wrote:
>>> This ioctl returns context reset status for specified context.
>>>
>>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>>> CC: idr@freedesktop.org
>>> ---
>>>    drivers/gpu/drm/i915/i915_dma.c |    1 +
>>>    drivers/gpu/drm/i915/i915_drv.c |   61 +++++++++++++++++++++++++++++++++++++++
>>>    drivers/gpu/drm/i915/i915_drv.h |    2 ++
>>>    include/uapi/drm/i915_drm.h     |   28 ++++++++++++++++++
>>>    4 files changed, 92 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
>>> index 7902d97..c919832 100644
>>> --- a/drivers/gpu/drm/i915/i915_dma.c
>>> +++ b/drivers/gpu/drm/i915/i915_dma.c
>>> @@ -1903,6 +1903,7 @@ struct drm_ioctl_desc i915_ioctls[] = {
>>>    	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_CREATE, i915_gem_context_create_ioctl, DRM_UNLOCKED),
>>>    	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_DESTROY, i915_gem_context_destroy_ioctl, DRM_UNLOCKED),
>>>    	DRM_IOCTL_DEF_DRV(I915_REG_READ, i915_reg_read_ioctl, DRM_UNLOCKED),
>>> +	DRM_IOCTL_DEF_DRV(I915_GET_RESET_STATUS, i915_gem_context_get_reset_status_ioctl, DRM_UNLOCKED),
>>>    };
>>>
>>>    int i915_max_ioctl = DRM_ARRAY_SIZE(i915_ioctls);
>>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>>> index 69c9856..a4d06f2 100644
>>> --- a/drivers/gpu/drm/i915/i915_drv.c
>>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>>> @@ -1267,3 +1267,64 @@ int i915_reg_read_ioctl(struct drm_device *dev,
>>>
>>>    	return 0;
>>>    }
>>> +
>>> +int i915_gem_context_get_reset_status_ioctl(struct drm_device *dev,
>>> +					    void *data, struct drm_file *file)
>>> +{
>>> +	struct drm_i915_private *dev_priv = dev->dev_private;
>>> +	struct intel_ring_buffer *ring;
>>> +	struct drm_i915_reset_status *args = data;
>>> +	struct ctx_reset_state *rs = NULL;
>>> +	unsigned long reset_cnt;
>>> +	u32 reset_status = I915_RESET_UNKNOWN;
>>> +	int ret;
>>> +
>>> +	ret = mutex_lock_interruptible(&dev->struct_mutex);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	ring = &dev_priv->ring[RCS];
>>> +
>>> +	ret = i915_gem_context_get_reset_state(ring,
>>> +					       file,
>>> +					       args->ctx_id,
>>> +					       &rs);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	BUG_ON(!rs);
>>> +
>>> +	reset_cnt = atomic_read(&dev_priv->gpu_error.reset_counter);
>>> +
>>> +	if (reset_cnt & I915_RESET_IN_PROGRESS_FLAG ||
>>
>> In this case, I believe we're supposed to return the reset state to the
>> application.  The ARB_robustness spec says:
>>
>>       "If a reset status other than NO_ERROR is returned and subsequent
>>       calls return NO_ERROR, the context reset was encountered and
>>       completed. If a reset status is repeatedly returned, the context may
>>       be in the process of resetting."
>>
>> If the reset takes a long time, it seems that even a well-behaved app
>> could run afoul of the 'banned' logic.
>
> As there reset status is initialized to I915_RESET_UNKNOWN,
> we return it if the reset is in progress or gpu is wedged.

Hmm... so user space will see I915_RESET_UNKNOWN until the reset is 
done, then it will (usually) see either I915_RESET_BATCH_ACTIVE or 
I915_RESET_BATCH_PENDING.  I think that should be okay.

>>> +	    reset_cnt == I915_WEDGED) {
>>> +		goto out;
>>> +	}
>>> +
>>> +	/* Set guilty/innocent status if only one reset was
>>> +	 * observed and if only one guilty was found
>>> +	 */
>>> +	if ((rs->reset_cnt + 2) == reset_cnt &&
>>> +	    (rs->guilty_cnt + 1) == dev_priv->gpu_error.guilty_cnt) {
>>
>> This logic seems... wrong, or at least weird.  "rs->reset_cnt + 2" is
>> confusing next to "if only one reset was observed".
>>
>> dev_priv->gpu_error.reset_counter is the global GPU reset count since
>> start-up, and rs->reset_cnt is the global GPU count since start-up when
>> the context was created.  Right?
>
> Right. The confusing part in here is the
> dev_priv->gpu_error.reset_counter. If it is odd, reset is in progress,
> if it is even, the reset has been handled and all is well. That is why +2

That's a clever hack, I'm assuming, to use atomic operations instead of 
locks.   Dear God that's awful to understand... it's a tiny bit more 
clear looking back at the 'reset_cnt & I915_RESET_IN_PROGRESS_FLAG'. 
Perhaps we could get some wrapper macros RESET_IN_PROGRESS() and 
RESET_ACTUAL_COUNT() or something?

>> If that's the case, this will cause a context that was completely idle
>> (i.e., didn't actually lose anything) to get a reset notification.
>> That's an absolute deal breaker.
>
> This was then misunderstood by me. I will make it so that if you have
> no batches submitted, you wont observe a reset.
>
>> If that's not the case, then this architecture needs a lot more
>> documentation so that people new to it can understand what's happening.
>
> Agreed. If we don't need to care about the contexts where there
> were no batches submitted, the logic will be simpler tho.
>
> -Mika
>
>>> +		reset_status = 0;
>>> +
>>> +		if (rs->guilty)
>>> +			reset_status |= I915_RESET_BATCH_ACTIVE;
>>> +
>>> +		if (rs->innocent)
>>> +			reset_status |= I915_RESET_BATCH_PENDING;
>>> +
>>> +		if (reset_status == 0)
>>> +			reset_status = I915_RESET_UNKNOWN;
>>> +	} else if (rs->reset_cnt == reset_cnt) {
>>> +		reset_status = I915_RESET_NO_ERROR;
>>> +	}
>>> +
>>> +out:
>>> +	if (!ret)
>>> +		args->reset_status = reset_status;
>>> +
>>> +	mutex_unlock(&dev->struct_mutex);
>>> +
>>> +	return ret ? -EINVAL : 0;
>>> +}
>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>>> index 3e11acf..2e5e8e7 100644
>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>> @@ -1712,6 +1712,8 @@ int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>>>    				  struct drm_file *file);
>>>    int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>>>    				   struct drm_file *file);
>>> +int i915_gem_context_get_reset_status_ioctl(struct drm_device *dev,
>>> +					    void *data, struct drm_file *file);
>>>
>>>    /* i915_gem_gtt.c */
>>>    void i915_gem_cleanup_aliasing_ppgtt(struct drm_device *dev);
>>> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
>>> index 07d5941..a195e0e 100644
>>> --- a/include/uapi/drm/i915_drm.h
>>> +++ b/include/uapi/drm/i915_drm.h
>>> @@ -198,6 +198,7 @@ typedef struct _drm_i915_sarea {
>>>    #define DRM_I915_GEM_SET_CACHING	0x2f
>>>    #define DRM_I915_GEM_GET_CACHING	0x30
>>>    #define DRM_I915_REG_READ		0x31
>>> +#define DRM_I915_GET_RESET_STATUS	0x32
>>>
>>>    #define DRM_IOCTL_I915_INIT		DRM_IOW( DRM_COMMAND_BASE + DRM_I915_INIT, drm_i915_init_t)
>>>    #define DRM_IOCTL_I915_FLUSH		DRM_IO ( DRM_COMMAND_BASE + DRM_I915_FLUSH)
>>> @@ -247,6 +248,7 @@ typedef struct _drm_i915_sarea {
>>>    #define DRM_IOCTL_I915_GEM_CONTEXT_CREATE	DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GEM_CONTEXT_CREATE, struct drm_i915_gem_context_create)
>>>    #define DRM_IOCTL_I915_GEM_CONTEXT_DESTROY	DRM_IOW (DRM_COMMAND_BASE + DRM_I915_GEM_CONTEXT_DESTROY, struct drm_i915_gem_context_destroy)
>>>    #define DRM_IOCTL_I915_REG_READ			DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_REG_READ, struct drm_i915_reg_read)
>>> +#define DRM_IOCTL_I915_GET_RESET_STATUS		DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GET_RESET_STATUS, struct drm_i915_reset_status)
>>>
>>>    /* Allow drivers to submit batchbuffers directly to hardware, relying
>>>     * on the security mechanisms provided by hardware.
>>> @@ -980,4 +982,30 @@ struct drm_i915_reg_read {
>>>    	__u64 offset;
>>>    	__u64 val; /* Return value */
>>>    };
>>> +
>>> +/* No reset observed */
>>> +#define I915_RESET_NO_ERROR      0
>>> +
>>> +/* Context had batch processing active while
>>> +   gpu hung and batch was guilty of gpu hang */
>>> +#define I915_RESET_BATCH_ACTIVE  (1 << 0)
>>> +
>>> +/* Context had batch queued for processing while
>>> +   reset occurred and guilty batch was found:
>>> +   I915_RESET_BATCH_ACTIVE was set for this or
>>> +   some other context */
>>> +#define I915_RESET_BATCH_PENDING (1 << 1)
>>> +
>>> +/* Context observed gpu hung and reset but guilty context
>>> +   was not found: I915_RESET_BATCH_ACTIVE and
>>> +   I915_RESET_BATCH_PENDING were not set for any context */
>>> +#define I915_RESET_UNKNOWN       (1 << 2)
>>> +
>>> +struct drm_i915_reset_status {
>>> +	__u32 ctx_id;
>>> +	__u32 flags;
>>> +	__u32 reset_status;
>>> +	__u32 pad;
>>> +};
>>> +
>>>    #endif /* _UAPI_I915_DRM_H_ */
>>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 16/16] drm/i915: add i915_gem_context_get_reset_status_ioctl
  2013-03-19 19:02       ` Ian Romanick
@ 2013-03-19 19:21         ` Daniel Vetter
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Vetter @ 2013-03-19 19:21 UTC (permalink / raw)
  To: Ian Romanick; +Cc: intel-gfx, miku

On Tue, Mar 19, 2013 at 12:02:48PM -0700, Ian Romanick wrote:
> On 03/19/2013 05:58 AM, Mika Kuoppala wrote:
> >Ian Romanick <idr@freedesktop.org> writes:
> >
> >>On 03/14/2013 08:52 AM, Mika Kuoppala wrote:
> >>>This ioctl returns context reset status for specified context.
> >>>
> >>>Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> >>>CC: idr@freedesktop.org
> >>>---
> >>>   drivers/gpu/drm/i915/i915_dma.c |    1 +
> >>>   drivers/gpu/drm/i915/i915_drv.c |   61 +++++++++++++++++++++++++++++++++++++++
> >>>   drivers/gpu/drm/i915/i915_drv.h |    2 ++
> >>>   include/uapi/drm/i915_drm.h     |   28 ++++++++++++++++++
> >>>   4 files changed, 92 insertions(+)
> >>>
> >>>diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> >>>index 7902d97..c919832 100644
> >>>--- a/drivers/gpu/drm/i915/i915_dma.c
> >>>+++ b/drivers/gpu/drm/i915/i915_dma.c
> >>>@@ -1903,6 +1903,7 @@ struct drm_ioctl_desc i915_ioctls[] = {
> >>>   	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_CREATE, i915_gem_context_create_ioctl, DRM_UNLOCKED),
> >>>   	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_DESTROY, i915_gem_context_destroy_ioctl, DRM_UNLOCKED),
> >>>   	DRM_IOCTL_DEF_DRV(I915_REG_READ, i915_reg_read_ioctl, DRM_UNLOCKED),
> >>>+	DRM_IOCTL_DEF_DRV(I915_GET_RESET_STATUS, i915_gem_context_get_reset_status_ioctl, DRM_UNLOCKED),
> >>>   };
> >>>
> >>>   int i915_max_ioctl = DRM_ARRAY_SIZE(i915_ioctls);
> >>>diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> >>>index 69c9856..a4d06f2 100644
> >>>--- a/drivers/gpu/drm/i915/i915_drv.c
> >>>+++ b/drivers/gpu/drm/i915/i915_drv.c
> >>>@@ -1267,3 +1267,64 @@ int i915_reg_read_ioctl(struct drm_device *dev,
> >>>
> >>>   	return 0;
> >>>   }
> >>>+
> >>>+int i915_gem_context_get_reset_status_ioctl(struct drm_device *dev,
> >>>+					    void *data, struct drm_file *file)
> >>>+{
> >>>+	struct drm_i915_private *dev_priv = dev->dev_private;
> >>>+	struct intel_ring_buffer *ring;
> >>>+	struct drm_i915_reset_status *args = data;
> >>>+	struct ctx_reset_state *rs = NULL;
> >>>+	unsigned long reset_cnt;
> >>>+	u32 reset_status = I915_RESET_UNKNOWN;
> >>>+	int ret;
> >>>+
> >>>+	ret = mutex_lock_interruptible(&dev->struct_mutex);
> >>>+	if (ret)
> >>>+		return ret;
> >>>+
> >>>+	ring = &dev_priv->ring[RCS];
> >>>+
> >>>+	ret = i915_gem_context_get_reset_state(ring,
> >>>+					       file,
> >>>+					       args->ctx_id,
> >>>+					       &rs);
> >>>+	if (ret)
> >>>+		goto out;
> >>>+
> >>>+	BUG_ON(!rs);
> >>>+
> >>>+	reset_cnt = atomic_read(&dev_priv->gpu_error.reset_counter);
> >>>+
> >>>+	if (reset_cnt & I915_RESET_IN_PROGRESS_FLAG ||
> >>
> >>In this case, I believe we're supposed to return the reset state to the
> >>application.  The ARB_robustness spec says:
> >>
> >>      "If a reset status other than NO_ERROR is returned and subsequent
> >>      calls return NO_ERROR, the context reset was encountered and
> >>      completed. If a reset status is repeatedly returned, the context may
> >>      be in the process of resetting."
> >>
> >>If the reset takes a long time, it seems that even a well-behaved app
> >>could run afoul of the 'banned' logic.
> >
> >As there reset status is initialized to I915_RESET_UNKNOWN,
> >we return it if the reset is in progress or gpu is wedged.
> 
> Hmm... so user space will see I915_RESET_UNKNOWN until the reset is
> done, then it will (usually) see either I915_RESET_BATCH_ACTIVE or
> I915_RESET_BATCH_PENDING.  I think that should be okay.
> 
> >>>+	    reset_cnt == I915_WEDGED) {
> >>>+		goto out;
> >>>+	}
> >>>+
> >>>+	/* Set guilty/innocent status if only one reset was
> >>>+	 * observed and if only one guilty was found
> >>>+	 */
> >>>+	if ((rs->reset_cnt + 2) == reset_cnt &&
> >>>+	    (rs->guilty_cnt + 1) == dev_priv->gpu_error.guilty_cnt) {
> >>
> >>This logic seems... wrong, or at least weird.  "rs->reset_cnt + 2" is
> >>confusing next to "if only one reset was observed".
> >>
> >>dev_priv->gpu_error.reset_counter is the global GPU reset count since
> >>start-up, and rs->reset_cnt is the global GPU count since start-up when
> >>the context was created.  Right?
> >
> >Right. The confusing part in here is the
> >dev_priv->gpu_error.reset_counter. If it is odd, reset is in progress,
> >if it is even, the reset has been handled and all is well. That is why +2
> 
> That's a clever hack, I'm assuming, to use atomic operations instead
> of locks.   Dear God that's awful to understand... it's a tiny bit
> more clear looking back at the 'reset_cnt &
> I915_RESET_IN_PROGRESS_FLAG'. Perhaps we could get some wrapper
> macros RESET_IN_PROGRESS() and RESET_ACTUAL_COUNT() or something?

Those exist and are called i915_reset_in_progress and
i915_terminally_wedged (the later for the case where the gpu reset failed
and things are terminally hosed). Since the kernel thus far only cared
whether the reset state changed (either from good -> reset_in_progress,
back or to the terminal state) without ever missing a state transition, it
only compares the counter and doesn't care about the actual reset count
one bit. Hence why I didn't add another helper.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/2] [v3] drm/i915: reference count for i915_hw_contexts
  2013-03-14 15:52 ` [PATCH v2 03/16] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
  2013-03-15  9:44   ` Chris Wilson
@ 2013-04-02 22:45   ` Ben Widawsky
  2013-04-02 22:45     ` [PATCH 2/2] drm/i915: Print all contexts in debugfs Ben Widawsky
  2013-04-03  1:27     ` [PATCH 1/2] [v3] drm/i915: reference count for i915_hw_contexts Ben Widawsky
  1 sibling, 2 replies; 35+ messages in thread
From: Ben Widawsky @ 2013-04-02 22:45 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky, Mika Kuoppala

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>

In preparation to do analysis of which context was
guilty of gpu hung, store kreffed context pointer
into request struct.

This allows us to inspect contexts when gpu is reset
even if those contexts would already be released
by userspace.

v2: track i915_hw_context pointers instead of using ctx_ids
    (from Chris Wilson)

v3 (Ben): Get rid of do_release() and handle refcounting more compactly.
(recommended by Chris)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_drv.h            | 10 ++++++++--
 drivers/gpu/drm/i915/i915_gem.c            | 16 +++++++++++++++-
 drivers/gpu/drm/i915/i915_gem_context.c    | 17 +++++++++++++----
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  7 ++++---
 4 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 5b0c699..b88b67d 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -437,6 +437,7 @@ struct i915_hw_ppgtt {
 /* This must match up with the value previously used for execbuf2.rsvd1. */
 #define DEFAULT_CONTEXT_ID 0
 struct i915_hw_context {
+	struct kref ref;
 	int id;
 	bool is_initialized;
 	struct drm_i915_file_private *file_priv;
@@ -1240,6 +1241,9 @@ struct drm_i915_gem_request {
 	/** Postion in the ringbuffer of the end of the request */
 	u32 tail;
 
+	/** Context related to this request */
+	struct i915_hw_context *ctx;
+
 	/** Time at which this request was emitted, in jiffies. */
 	unsigned long emitted_jiffies;
 
@@ -1630,9 +1634,10 @@ int __must_check i915_gpu_idle(struct drm_device *dev);
 int __must_check i915_gem_idle(struct drm_device *dev);
 int i915_do_add_request(struct intel_ring_buffer *ring,
 			u32 *seqno,
-			struct drm_file *file);
+			struct drm_file *file,
+			struct i915_hw_context *ctx);
 #define i915_add_request(ring, seqno) \
-	i915_do_add_request(ring, seqno, NULL)
+	i915_do_add_request(ring, seqno, NULL, NULL)
 int __must_check i915_wait_seqno(struct intel_ring_buffer *ring,
 				 uint32_t seqno);
 int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
@@ -1676,6 +1681,7 @@ void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
 struct i915_hw_context * __must_check
 i915_switch_context(struct intel_ring_buffer *ring,
 		    struct drm_file *file, int to_id);
+void i915_gem_context_free(struct kref *ctx_ref);
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index fbbe7d9..e55c4a8 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1997,7 +1997,8 @@ i915_gem_get_seqno(struct drm_device *dev, u32 *seqno)
 int
 i915_do_add_request(struct intel_ring_buffer *ring,
 		    u32 *out_seqno,
-		    struct drm_file *file)
+		    struct drm_file *file,
+		    struct i915_hw_context *ctx)
 {
 	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	struct drm_i915_gem_request *request;
@@ -2037,6 +2038,11 @@ i915_do_add_request(struct intel_ring_buffer *ring,
 	request->seqno = intel_ring_get_seqno(ring);
 	request->ring = ring;
 	request->tail = request_ring_position;
+	request->ctx = ctx;
+
+	if (request->ctx)
+		kref_get(&request->ctx->ref);
+
 	request->emitted_jiffies = jiffies;
 	was_empty = list_empty(&ring->request_list);
 	list_add_tail(&request->list, &ring->request_list);
@@ -2101,6 +2107,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 
 		list_del(&request->list);
 		i915_gem_request_remove_from_client(request);
+
+		if (request->ctx)
+			kref_put(&request->ctx->ref, i915_gem_context_free);
+
 		kfree(request);
 	}
 
@@ -2195,6 +2205,10 @@ i915_gem_retire_requests_ring(struct intel_ring_buffer *ring)
 
 		list_del(&request->list);
 		i915_gem_request_remove_from_client(request);
+
+		if (request->ctx)
+			kref_put(&request->ctx->ref, i915_gem_context_free);
+
 		kfree(request);
 	}
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index ddf26a6..19feeb6 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -126,11 +126,18 @@ static int get_context_size(struct drm_device *dev)
 
 static void do_destroy(struct i915_hw_context *ctx)
 {
+	drm_gem_object_unreference(&ctx->obj->base);
+	kfree(ctx);
+}
+
+void i915_gem_context_free(struct kref *ctx_ref)
+{
+	struct i915_hw_context *ctx = container_of(ctx_ref,
+						   typeof(*ctx), ref);
 	if (ctx->file_priv)
 		idr_remove(&ctx->file_priv->context_idr, ctx->id);
 
-	drm_gem_object_unreference(&ctx->obj->base);
-	kfree(ctx);
+	do_destroy(ctx);
 }
 
 static struct i915_hw_context *
@@ -145,6 +152,7 @@ create_hw_context(struct drm_device *dev,
 	if (ctx == NULL)
 		return ERR_PTR(-ENOMEM);
 
+	kref_init(&ctx->ref);
 	ctx->obj = i915_gem_alloc_object(dev, dev_priv->hw_context_size);
 	if (ctx->obj == NULL) {
 		kfree(ctx);
@@ -275,7 +283,8 @@ static int context_idr_cleanup(int id, void *p, void *data)
 
 	BUG_ON(id == DEFAULT_CONTEXT_ID);
 
-	do_destroy(ctx);
+	ctx->file_priv = NULL;
+	kref_put(&ctx->ref, i915_gem_context_free);
 
 	return 0;
 }
@@ -511,7 +520,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 		return -ENOENT;
 	}
 
-	do_destroy(ctx);
+	kref_put(&ctx->ref, i915_gem_context_free);
 
 	mutex_unlock(&dev->struct_mutex);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index d87b94b..7f58b2e 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -794,13 +794,14 @@ i915_gem_execbuffer_move_to_active(struct list_head *objects,
 static void
 i915_gem_execbuffer_retire_commands(struct drm_device *dev,
 				    struct drm_file *file,
-				    struct intel_ring_buffer *ring)
+				    struct intel_ring_buffer *ring,
+				    struct i915_hw_context *ctx)
 {
 	/* Unconditionally force add_request to emit a full flush. */
 	ring->gpu_caches_dirty = true;
 
 	/* Add a breadcrumb for the completion of the batch buffer */
-	(void)i915_do_add_request(ring, NULL, file);
+	(void)i915_do_add_request(ring, NULL, file, ctx);
 }
 
 static int
@@ -1076,7 +1077,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	trace_i915_gem_ring_dispatch(ring, intel_ring_get_seqno(ring), flags);
 
 	i915_gem_execbuffer_move_to_active(&eb->objects, ring);
-	i915_gem_execbuffer_retire_commands(dev, file, ring);
+	i915_gem_execbuffer_retire_commands(dev, file, ring, ctx);
 
 err:
 	eb_destroy(eb);
-- 
1.8.2

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/2] drm/i915: Print all contexts in debugfs
  2013-04-02 22:45   ` [PATCH 1/2] [v3] " Ben Widawsky
@ 2013-04-02 22:45     ` Ben Widawsky
  2013-04-03  1:27     ` [PATCH 1/2] [v3] drm/i915: reference count for i915_hw_contexts Ben Widawsky
  1 sibling, 0 replies; 35+ messages in thread
From: Ben Widawsky @ 2013-04-02 22:45 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky

Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 7299ea4..70369e4 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1479,11 +1479,23 @@ static int i915_gem_framebuffer_info(struct seq_file *m, void *data)
 	return 0;
 }
 
+static int context_show(int id, void *p, void *data)
+{
+	struct i915_hw_context *ctx = p;
+	struct seq_file *m = data;
+
+	seq_printf(m, "context = %d\n", id);
+	describe_obj(m, ctx->obj);
+	seq_printf(m, "\n");
+	return 0;
+}
+
 static int i915_context_status(struct seq_file *m, void *unused)
 {
 	struct drm_info_node *node = (struct drm_info_node *) m->private;
 	struct drm_device *dev = node->minor->dev;
 	drm_i915_private_t *dev_priv = dev->dev_private;
+	struct drm_file *file;
 	struct intel_ring_buffer *ring;
 	int ret, i;
 
@@ -1511,6 +1523,13 @@ static int i915_context_status(struct seq_file *m, void *unused)
 		}
 	}
 
+	list_for_each_entry(file, &dev->filelist, lhead) {
+		struct drm_i915_file_private *file_priv = file->driver_priv;
+		seq_printf(m, "File = %p ", file);
+		idr_for_each(&file_priv->context_idr, context_show, m);
+		seq_printf(m, "\n");
+	}
+
 	mutex_unlock(&dev->mode_config.mutex);
 
 	return 0;
-- 
1.8.2

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] [v3] drm/i915: reference count for i915_hw_contexts
  2013-04-02 22:45   ` [PATCH 1/2] [v3] " Ben Widawsky
  2013-04-02 22:45     ` [PATCH 2/2] drm/i915: Print all contexts in debugfs Ben Widawsky
@ 2013-04-03  1:27     ` Ben Widawsky
  2013-04-03 11:56       ` Chris Wilson
  1 sibling, 1 reply; 35+ messages in thread
From: Ben Widawsky @ 2013-04-03  1:27 UTC (permalink / raw)
  To: intel-gfx, Daniel Vetter, Chris Wilson; +Cc: Mika Kuoppala

On Tue, Apr 02, 2013 at 03:45:42PM -0700, Ben Widawsky wrote:
> From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> 
> In preparation to do analysis of which context was
> guilty of gpu hung, store kreffed context pointer
> into request struct.
> 
> This allows us to inspect contexts when gpu is reset
> even if those contexts would already be released
> by userspace.
> 
> v2: track i915_hw_context pointers instead of using ctx_ids
>     (from Chris Wilson)
> 
> v3 (Ben): Get rid of do_release() and handle refcounting more compactly.
> (recommended by Chris)
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>

Now I remember why my version of reference counting was so much more
complicated. In my case, I want to keep the last context around instead
of the last context object. To do this we can't do a kref_put until
we've switched to the next context (similar to how we manage the context
object). I want to do this since the context stores the PPGTT which will
currently be in use. I need to switch PDEs at context switch time.

So the below isn't really useful to me, I think, and I believe I need a
more complex refcounting mechanism as I described on IRC earlier today.

Daniel, Chris, thoughts?


> ---
>  drivers/gpu/drm/i915/i915_drv.h            | 10 ++++++++--
>  drivers/gpu/drm/i915/i915_gem.c            | 16 +++++++++++++++-
>  drivers/gpu/drm/i915/i915_gem_context.c    | 17 +++++++++++++----
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  7 ++++---
>  4 files changed, 40 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 5b0c699..b88b67d 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -437,6 +437,7 @@ struct i915_hw_ppgtt {
>  /* This must match up with the value previously used for execbuf2.rsvd1. */
>  #define DEFAULT_CONTEXT_ID 0
>  struct i915_hw_context {
> +	struct kref ref;
>  	int id;
>  	bool is_initialized;
>  	struct drm_i915_file_private *file_priv;
> @@ -1240,6 +1241,9 @@ struct drm_i915_gem_request {
>  	/** Postion in the ringbuffer of the end of the request */
>  	u32 tail;
>  
> +	/** Context related to this request */
> +	struct i915_hw_context *ctx;
> +
>  	/** Time at which this request was emitted, in jiffies. */
>  	unsigned long emitted_jiffies;
>  
> @@ -1630,9 +1634,10 @@ int __must_check i915_gpu_idle(struct drm_device *dev);
>  int __must_check i915_gem_idle(struct drm_device *dev);
>  int i915_do_add_request(struct intel_ring_buffer *ring,
>  			u32 *seqno,
> -			struct drm_file *file);
> +			struct drm_file *file,
> +			struct i915_hw_context *ctx);
>  #define i915_add_request(ring, seqno) \
> -	i915_do_add_request(ring, seqno, NULL)
> +	i915_do_add_request(ring, seqno, NULL, NULL)
>  int __must_check i915_wait_seqno(struct intel_ring_buffer *ring,
>  				 uint32_t seqno);
>  int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
> @@ -1676,6 +1681,7 @@ void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
>  struct i915_hw_context * __must_check
>  i915_switch_context(struct intel_ring_buffer *ring,
>  		    struct drm_file *file, int to_id);
> +void i915_gem_context_free(struct kref *ctx_ref);
>  int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
>  				  struct drm_file *file);
>  int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index fbbe7d9..e55c4a8 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1997,7 +1997,8 @@ i915_gem_get_seqno(struct drm_device *dev, u32 *seqno)
>  int
>  i915_do_add_request(struct intel_ring_buffer *ring,
>  		    u32 *out_seqno,
> -		    struct drm_file *file)
> +		    struct drm_file *file,
> +		    struct i915_hw_context *ctx)
>  {
>  	drm_i915_private_t *dev_priv = ring->dev->dev_private;
>  	struct drm_i915_gem_request *request;
> @@ -2037,6 +2038,11 @@ i915_do_add_request(struct intel_ring_buffer *ring,
>  	request->seqno = intel_ring_get_seqno(ring);
>  	request->ring = ring;
>  	request->tail = request_ring_position;
> +	request->ctx = ctx;
> +
> +	if (request->ctx)
> +		kref_get(&request->ctx->ref);
> +
>  	request->emitted_jiffies = jiffies;
>  	was_empty = list_empty(&ring->request_list);
>  	list_add_tail(&request->list, &ring->request_list);
> @@ -2101,6 +2107,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
>  
>  		list_del(&request->list);
>  		i915_gem_request_remove_from_client(request);
> +
> +		if (request->ctx)
> +			kref_put(&request->ctx->ref, i915_gem_context_free);
> +
>  		kfree(request);
>  	}
>  
> @@ -2195,6 +2205,10 @@ i915_gem_retire_requests_ring(struct intel_ring_buffer *ring)
>  
>  		list_del(&request->list);
>  		i915_gem_request_remove_from_client(request);
> +
> +		if (request->ctx)
> +			kref_put(&request->ctx->ref, i915_gem_context_free);
> +
>  		kfree(request);
>  	}
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index ddf26a6..19feeb6 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -126,11 +126,18 @@ static int get_context_size(struct drm_device *dev)
>  
>  static void do_destroy(struct i915_hw_context *ctx)
>  {
> +	drm_gem_object_unreference(&ctx->obj->base);
> +	kfree(ctx);
> +}
> +
> +void i915_gem_context_free(struct kref *ctx_ref)
> +{
> +	struct i915_hw_context *ctx = container_of(ctx_ref,
> +						   typeof(*ctx), ref);
>  	if (ctx->file_priv)
>  		idr_remove(&ctx->file_priv->context_idr, ctx->id);
>  
> -	drm_gem_object_unreference(&ctx->obj->base);
> -	kfree(ctx);
> +	do_destroy(ctx);
>  }
>  
>  static struct i915_hw_context *
> @@ -145,6 +152,7 @@ create_hw_context(struct drm_device *dev,
>  	if (ctx == NULL)
>  		return ERR_PTR(-ENOMEM);
>  
> +	kref_init(&ctx->ref);
>  	ctx->obj = i915_gem_alloc_object(dev, dev_priv->hw_context_size);
>  	if (ctx->obj == NULL) {
>  		kfree(ctx);
> @@ -275,7 +283,8 @@ static int context_idr_cleanup(int id, void *p, void *data)
>  
>  	BUG_ON(id == DEFAULT_CONTEXT_ID);
>  
> -	do_destroy(ctx);
> +	ctx->file_priv = NULL;
> +	kref_put(&ctx->ref, i915_gem_context_free);
>  
>  	return 0;
>  }
> @@ -511,7 +520,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>  		return -ENOENT;
>  	}
>  
> -	do_destroy(ctx);
> +	kref_put(&ctx->ref, i915_gem_context_free);
>  
>  	mutex_unlock(&dev->struct_mutex);
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index d87b94b..7f58b2e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -794,13 +794,14 @@ i915_gem_execbuffer_move_to_active(struct list_head *objects,
>  static void
>  i915_gem_execbuffer_retire_commands(struct drm_device *dev,
>  				    struct drm_file *file,
> -				    struct intel_ring_buffer *ring)
> +				    struct intel_ring_buffer *ring,
> +				    struct i915_hw_context *ctx)
>  {
>  	/* Unconditionally force add_request to emit a full flush. */
>  	ring->gpu_caches_dirty = true;
>  
>  	/* Add a breadcrumb for the completion of the batch buffer */
> -	(void)i915_do_add_request(ring, NULL, file);
> +	(void)i915_do_add_request(ring, NULL, file, ctx);
>  }
>  
>  static int
> @@ -1076,7 +1077,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
>  	trace_i915_gem_ring_dispatch(ring, intel_ring_get_seqno(ring), flags);
>  
>  	i915_gem_execbuffer_move_to_active(&eb->objects, ring);
> -	i915_gem_execbuffer_retire_commands(dev, file, ring);
> +	i915_gem_execbuffer_retire_commands(dev, file, ring, ctx);
>  
>  err:
>  	eb_destroy(eb);
> -- 
> 1.8.2
> 

-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] [v3] drm/i915: reference count for i915_hw_contexts
  2013-04-03  1:27     ` [PATCH 1/2] [v3] drm/i915: reference count for i915_hw_contexts Ben Widawsky
@ 2013-04-03 11:56       ` Chris Wilson
  2013-04-03 16:37         ` Daniel Vetter
  0 siblings, 1 reply; 35+ messages in thread
From: Chris Wilson @ 2013-04-03 11:56 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: intel-gfx, Mika Kuoppala

On Tue, Apr 02, 2013 at 06:27:00PM -0700, Ben Widawsky wrote:
> On Tue, Apr 02, 2013 at 03:45:42PM -0700, Ben Widawsky wrote:
> > From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > 
> > In preparation to do analysis of which context was
> > guilty of gpu hung, store kreffed context pointer
> > into request struct.
> > 
> > This allows us to inspect contexts when gpu is reset
> > even if those contexts would already be released
> > by userspace.
> > 
> > v2: track i915_hw_context pointers instead of using ctx_ids
> >     (from Chris Wilson)
> > 
> > v3 (Ben): Get rid of do_release() and handle refcounting more compactly.
> > (recommended by Chris)
> > 
> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> 
> Now I remember why my version of reference counting was so much more
> complicated. In my case, I want to keep the last context around instead
> of the last context object. To do this we can't do a kref_put until
> we've switched to the next context (similar to how we manage the context
> object). I want to do this since the context stores the PPGTT which will
> currently be in use. I need to switch PDEs at context switch time.

This seems feasible using requests and a callback from retire. The
alternative is something hairy like intel_overlay, hence my desire for
keeping all ring operations as a i915_gem_request.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/2] [v3] drm/i915: reference count for i915_hw_contexts
  2013-04-03 11:56       ` Chris Wilson
@ 2013-04-03 16:37         ` Daniel Vetter
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Vetter @ 2013-04-03 16:37 UTC (permalink / raw)
  To: Chris Wilson, Ben Widawsky, intel-gfx, Daniel Vetter,
	Mika Kuoppala, Mika Kuoppala

On Wed, Apr 03, 2013 at 12:56:11PM +0100, Chris Wilson wrote:
> On Tue, Apr 02, 2013 at 06:27:00PM -0700, Ben Widawsky wrote:
> > On Tue, Apr 02, 2013 at 03:45:42PM -0700, Ben Widawsky wrote:
> > > From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > 
> > > In preparation to do analysis of which context was
> > > guilty of gpu hung, store kreffed context pointer
> > > into request struct.
> > > 
> > > This allows us to inspect contexts when gpu is reset
> > > even if those contexts would already be released
> > > by userspace.
> > > 
> > > v2: track i915_hw_context pointers instead of using ctx_ids
> > >     (from Chris Wilson)
> > > 
> > > v3 (Ben): Get rid of do_release() and handle refcounting more compactly.
> > > (recommended by Chris)
> > > 
> > > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> > > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > 
> > Now I remember why my version of reference counting was so much more
> > complicated. In my case, I want to keep the last context around instead
> > of the last context object. To do this we can't do a kref_put until
> > we've switched to the next context (similar to how we manage the context
> > object). I want to do this since the context stores the PPGTT which will
> > currently be in use. I need to switch PDEs at context switch time.
> 
> This seems feasible using requests and a callback from retire. The
> alternative is something hairy like intel_overlay, hence my desire for
> keeping all ring operations as a i915_gem_request.

As long as the request object keeps a ref while the request is still
oustanding and the ring itself keeps a ref to whatever is the currently
last scheduled context, everything should work out fine. So I don't think
we need to jump through any complicated hoops here.

One quick bikeshed on the patch itself though: I'd like to see some static
inlines for kref_get/put on contexts ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2013-04-03 16:34 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-14 15:52 [PATCH v2 00/16] arb robustness enablers v2 Mika Kuoppala
2013-03-14 15:52 ` [PATCH v2 01/16] drm/i915: return context from i915_switch_context() Mika Kuoppala
2013-03-14 15:52 ` [PATCH v2 02/16] drm/i915: cleanup i915_add_request Mika Kuoppala
2013-03-15  9:43   ` Chris Wilson
2013-03-14 15:52 ` [PATCH v2 03/16] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
2013-03-15  9:44   ` Chris Wilson
2013-04-02 22:45   ` [PATCH 1/2] [v3] " Ben Widawsky
2013-04-02 22:45     ` [PATCH 2/2] drm/i915: Print all contexts in debugfs Ben Widawsky
2013-04-03  1:27     ` [PATCH 1/2] [v3] drm/i915: reference count for i915_hw_contexts Ben Widawsky
2013-04-03 11:56       ` Chris Wilson
2013-04-03 16:37         ` Daniel Vetter
2013-03-14 15:52 ` [PATCH v2 04/16] drm/i915: Resurrect ring kicking for semaphores, selectively Mika Kuoppala
2013-03-17 21:53   ` Daniel Vetter
2013-03-14 15:52 ` [PATCH v2 05/16] drm/i915: pass seqno to i915_hangcheck_ring_idle Mika Kuoppala
2013-03-14 15:52 ` [PATCH v2 06/16] drm/i915: track ring progression using seqnos Mika Kuoppala
2013-03-15  9:48   ` Chris Wilson
2013-03-14 15:52 ` [PATCH v2 07/16] drm/i915: introduce i915_hangcheck_ring_hung Mika Kuoppala
2013-03-14 15:52 ` [PATCH v2 08/16] drm/i915: detect hang using per ring hangcheck_score Mika Kuoppala
2013-03-14 15:52 ` [PATCH v2 09/16] drm/i915: remove i915_hangcheck_hung Mika Kuoppala
2013-03-14 15:52 ` [PATCH v2 10/16] drm/i915: add struct ctx_reset_state Mika Kuoppala
2013-03-15  9:52   ` Chris Wilson
2013-03-17 21:51     ` Daniel Vetter
2013-03-18 20:18   ` Ian Romanick
2013-03-14 15:52 ` [PATCH v2 11/16] drm/i915: add reset_state for hw_contexts Mika Kuoppala
2013-03-15  9:53   ` Chris Wilson
2013-03-14 15:52 ` [PATCH v2 12/16] drm/i915: add batch object and context to i915_add_request() Mika Kuoppala
2013-03-14 15:52 ` [PATCH v2 13/16] drm/i915: mark rings which were waiting when hang happened Mika Kuoppala
2013-03-14 15:52 ` [PATCH v2 14/16] drm/i915: find guilty batch buffer on ring resets Mika Kuoppala
2013-03-14 15:52 ` [PATCH v2 15/16] drm/i915: refuse to submit more batchbuffers from guilty context Mika Kuoppala
2013-03-14 15:52 ` [PATCH v2 16/16] drm/i915: add i915_gem_context_get_reset_status_ioctl Mika Kuoppala
2013-03-15 10:01   ` Chris Wilson
2013-03-18 20:26   ` Ian Romanick
2013-03-19 12:58     ` Mika Kuoppala
2013-03-19 19:02       ` Ian Romanick
2013-03-19 19:21         ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.