All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/16] arb robustness enablers v3
@ 2013-04-04 15:32 Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 01/16] drm/i915: return context from i915_switch_context() Mika Kuoppala
                   ` (16 more replies)
  0 siblings, 17 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

Hi,

I have addressed all the feedback I got from v2 series.

Most notably the interface has changed.

Now those users who don't care about the exact counts can use batch_pending
and batch_active like they would be flags (see i-g-t testcase),
assuming they destroy the context after hang is dealt with.
For those users who want more detailed heuristics before deciding some action,
they can use batch_pending and batch_active as counts.

Non process context related statistics are only available
if CAP_SYS_ADMIN is hold.

Whole series can be found in:
https://github.com/mkuoppal/linux/commits/arb-robustness

Test case for i-g-t is in:
https://github.com/mkuoppal/intel-gpu-tools/commits/arb-robustness

Thanks for your feedback,
-Mika

Mika Kuoppala (16):
  drm/i915: return context from i915_switch_context()
  drm/i915: cleanup i915_add_request
  drm/i915: reference count for i915_hw_contexts
  drm/i915: pass seqno to i915_hangcheck_ring_idle
  drm/i915: track ring progression using seqnos
  drm/i915: introduce i915_hangcheck_ring_hung
  drm/i915: detect hang using per ring hangcheck_score
  drm/i915: remove i915_hangcheck_hung
  drm/i915: add struct i915_ctx_hang_stats
  drm/i915: add i915_gem_context_get_hang_stats()
  drm/i915: add batch object and context to i915_add_request()
  drm/i915: mark rings which were waiting when hang happened
  drm/i915: find guilty batch buffer on ring resets
  drm/i915: refuse to submit more batchbuffers from guilty context
  drm/i915: add i915_reset_count
  drm/i915: add i915_get_reset_stats_ioctl

 drivers/gpu/drm/i915/i915_dma.c            |    3 +-
 drivers/gpu/drm/i915/i915_drv.c            |   60 +++++++++++-
 drivers/gpu/drm/i915/i915_drv.h            |   73 +++++++++++---
 drivers/gpu/drm/i915/i915_gem.c            |  141 ++++++++++++++++++++++++----
 drivers/gpu/drm/i915/i915_gem_context.c    |   79 ++++++++++++----
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   28 +++++-
 drivers/gpu/drm/i915/i915_irq.c            |  113 ++++++++++------------
 drivers/gpu/drm/i915/intel_overlay.c       |    4 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c    |    2 +-
 drivers/gpu/drm/i915/intel_ringbuffer.h    |    8 ++
 include/uapi/drm/i915_drm.h                |   17 ++++
 11 files changed, 410 insertions(+), 118 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 01/16] drm/i915: return context from i915_switch_context()
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-20 18:10   ` Ben Widawsky
  2013-04-04 15:32 ` [PATCH v3 02/16] drm/i915: cleanup i915_add_request Mika Kuoppala
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

In preparation for the next commit, return context that
was switched to from i915_switch_context().

v2: context in return value instead of param. (Ben Widawsky)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |    5 +++--
 drivers/gpu/drm/i915/i915_gem.c            |    8 ++++---
 drivers/gpu/drm/i915/i915_gem_context.c    |   32 +++++++++++++++++-----------
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    7 ++++--
 4 files changed, 32 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 44fca0b..630982b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1687,8 +1687,9 @@ struct dma_buf *i915_gem_prime_export(struct drm_device *dev,
 void i915_gem_context_init(struct drm_device *dev);
 void i915_gem_context_fini(struct drm_device *dev);
 void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
-int i915_switch_context(struct intel_ring_buffer *ring,
-			struct drm_file *file, int to_id);
+struct i915_hw_context * __must_check
+i915_switch_context(struct intel_ring_buffer *ring,
+		    struct drm_file *file, int to_id);
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 911bd40..db804cc 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2522,9 +2522,11 @@ int i915_gpu_idle(struct drm_device *dev)
 
 	/* Flush everything onto the inactive list. */
 	for_each_ring(ring, dev_priv, i) {
-		ret = i915_switch_context(ring, NULL, DEFAULT_CONTEXT_ID);
-		if (ret)
-			return ret;
+		struct i915_hw_context *ctx;
+
+		ctx = i915_switch_context(ring, NULL, DEFAULT_CONTEXT_ID);
+		if (IS_ERR(ctx))
+			return PTR_ERR(ctx);
 
 		ret = intel_ring_idle(ring);
 		if (ret)
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 94d873a..ddf26a6 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -417,41 +417,47 @@ static int do_switch(struct i915_hw_context *to)
 /**
  * i915_switch_context() - perform a GPU context switch.
  * @ring: ring for which we'll execute the context switch
- * @file_priv: file_priv associated with the context, may be NULL
- * @id: context id number
- * @seqno: sequence number by which the new context will be switched to
- * @flags:
+ * @file: file associated with the context, may be NULL
+ * @to_id: context id number
+ * @return: context that was switched to or ERR_PTR on error
  *
  * The context life cycle is simple. The context refcount is incremented and
  * decremented by 1 and create and destroy. If the context is in use by the GPU,
  * it will have a refoucnt > 1. This allows us to destroy the context abstract
  * object while letting the normal object tracking destroy the backing BO.
  */
-int i915_switch_context(struct intel_ring_buffer *ring,
-			struct drm_file *file,
-			int to_id)
+
+struct i915_hw_context *
+i915_switch_context(struct intel_ring_buffer *ring,
+		    struct drm_file *file,
+		    int to_id)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
-	struct i915_hw_context *to;
+	struct i915_hw_context *to = NULL;
+	int ret = 0;
 
 	if (dev_priv->hw_contexts_disabled)
-		return 0;
+		return NULL;
 
 	if (ring != &dev_priv->ring[RCS])
-		return 0;
+		return NULL;
 
 	if (to_id == DEFAULT_CONTEXT_ID) {
 		to = ring->default_context;
 	} else {
 		if (file == NULL)
-			return -EINVAL;
+			return ERR_PTR(-EINVAL);
 
 		to = i915_gem_context_get(file->driver_priv, to_id);
 		if (to == NULL)
-			return -ENOENT;
+			return ERR_PTR(-ENOENT);
 	}
 
-	return do_switch(to);
+	ret = do_switch(to);
+	if (ret)
+		return ERR_PTR(ret);
+
+	return to;
 }
 
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index a96b6a3..e0686ca 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -841,6 +841,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	struct drm_i915_gem_object *batch_obj;
 	struct drm_clip_rect *cliprects = NULL;
 	struct intel_ring_buffer *ring;
+	struct i915_hw_context *ctx;
 	u32 ctx_id = i915_execbuffer2_get_context_id(*args);
 	u32 exec_start, exec_len;
 	u32 mask, flags;
@@ -1023,9 +1024,11 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	if (ret)
 		goto err;
 
-	ret = i915_switch_context(ring, file, ctx_id);
-	if (ret)
+	ctx = i915_switch_context(ring, file, ctx_id);
+	if (IS_ERR(ctx)) {
+		ret = PTR_ERR(ctx);
 		goto err;
+	}
 
 	if (ring == &dev_priv->ring[RCS] &&
 	    mode != dev_priv->relative_constants_mode) {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 02/16] drm/i915: cleanup i915_add_request
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 01/16] drm/i915: return context from i915_switch_context() Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-20 18:36   ` Ben Widawsky
  2013-04-04 15:32 ` [PATCH v3 03/16] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

Only execbuffer needs all the parameters. Cleanup everything
else behind macro.

v2: _i915_add_request as function name (Chris Wilson)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |    8 +++++---
 drivers/gpu/drm/i915/i915_gem.c            |   11 +++++------
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    2 +-
 drivers/gpu/drm/i915/intel_overlay.c       |    4 ++--
 drivers/gpu/drm/i915/intel_ringbuffer.c    |    2 +-
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 630982b..e5e2083 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1644,9 +1644,11 @@ void i915_gem_init_ppgtt(struct drm_device *dev);
 void i915_gem_cleanup_ringbuffer(struct drm_device *dev);
 int __must_check i915_gpu_idle(struct drm_device *dev);
 int __must_check i915_gem_idle(struct drm_device *dev);
-int i915_add_request(struct intel_ring_buffer *ring,
-		     struct drm_file *file,
-		     u32 *seqno);
+int _i915_add_request(struct intel_ring_buffer *ring,
+		      u32 *seqno,
+		      struct drm_file *file);
+#define i915_add_request(ring, seqno) \
+	_i915_add_request(ring, seqno, NULL)
 int __must_check i915_wait_seqno(struct intel_ring_buffer *ring,
 				 uint32_t seqno);
 int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index db804cc..d3ce0a7 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -959,7 +959,7 @@ i915_gem_check_olr(struct intel_ring_buffer *ring, u32 seqno)
 
 	ret = 0;
 	if (seqno == ring->outstanding_lazy_request)
-		ret = i915_add_request(ring, NULL, NULL);
+		ret = i915_add_request(ring, NULL);
 
 	return ret;
 }
@@ -1999,10 +1999,9 @@ i915_gem_get_seqno(struct drm_device *dev, u32 *seqno)
 	return 0;
 }
 
-int
-i915_add_request(struct intel_ring_buffer *ring,
-		 struct drm_file *file,
-		 u32 *out_seqno)
+int _i915_add_request(struct intel_ring_buffer *ring,
+		      u32 *out_seqno,
+		      struct drm_file *file)
 {
 	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	struct drm_i915_gem_request *request;
@@ -2267,7 +2266,7 @@ i915_gem_retire_work_handler(struct work_struct *work)
 	idle = true;
 	for_each_ring(ring, dev_priv, i) {
 		if (ring->gpu_caches_dirty)
-			i915_add_request(ring, NULL, NULL);
+			i915_add_request(ring, NULL);
 
 		idle &= list_empty(&ring->request_list);
 	}
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index e0686ca..b413962 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -802,7 +802,7 @@ i915_gem_execbuffer_retire_commands(struct drm_device *dev,
 	ring->gpu_caches_dirty = true;
 
 	/* Add a breadcrumb for the completion of the batch buffer */
-	(void)i915_add_request(ring, file, NULL);
+	(void)_i915_add_request(ring, NULL, file);
 }
 
 static int
diff --git a/drivers/gpu/drm/i915/intel_overlay.c b/drivers/gpu/drm/i915/intel_overlay.c
index 67a2501..40e509c 100644
--- a/drivers/gpu/drm/i915/intel_overlay.c
+++ b/drivers/gpu/drm/i915/intel_overlay.c
@@ -217,7 +217,7 @@ static int intel_overlay_do_wait_request(struct intel_overlay *overlay,
 	int ret;
 
 	BUG_ON(overlay->last_flip_req);
-	ret = i915_add_request(ring, NULL, &overlay->last_flip_req);
+	ret = i915_add_request(ring, &overlay->last_flip_req);
 	if (ret)
 		return ret;
 
@@ -286,7 +286,7 @@ static int intel_overlay_continue(struct intel_overlay *overlay,
 	intel_ring_emit(ring, flip_addr);
 	intel_ring_advance(ring);
 
-	return i915_add_request(ring, NULL, &overlay->last_flip_req);
+	return i915_add_request(ring, &overlay->last_flip_req);
 }
 
 static void intel_overlay_release_old_vid_tail(struct intel_overlay *overlay)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 1d5d613..9584bc1 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1423,7 +1423,7 @@ int intel_ring_idle(struct intel_ring_buffer *ring)
 
 	/* We need to add any requests required to flush the objects and ring */
 	if (ring->outstanding_lazy_request) {
-		ret = i915_add_request(ring, NULL, NULL);
+		ret = i915_add_request(ring, NULL);
 		if (ret)
 			return ret;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 03/16] drm/i915: reference count for i915_hw_contexts
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 01/16] drm/i915: return context from i915_switch_context() Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 02/16] drm/i915: cleanup i915_add_request Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-09 22:18   ` Chris Wilson
  2013-04-04 15:32 ` [PATCH v3 04/16] drm/i915: pass seqno to i915_hangcheck_ring_idle Mika Kuoppala
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

In preparation to do analysis of which context was
guilty of gpu hung, store kreffed context pointer
into request struct.

This allows us to inspect contexts when gpu is reset
even if those contexts would already be released
by userspace.

v2: track i915_hw_context pointers instead of using ctx_ids
    (from Chris Wilson)

v3 (Ben): Get rid of do_release() and handle refcounting more compactly.
    (recommended by Chis)

v4: kref_* put inside static inlines (Daniel Vetter)
    remove code duplication on freeing context (Chris Wilson)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> (v2)
Signed-off-by: Ben Widawsky <ben@bwidawsk.net> (v3)
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> (v4)
---
 drivers/gpu/drm/i915/i915_drv.h            |   20 ++++++++++++++++++--
 drivers/gpu/drm/i915/i915_gem.c            |   27 ++++++++++++++++++++-------
 drivers/gpu/drm/i915/i915_gem_context.c    |   19 ++++++++++++++-----
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    7 ++++---
 4 files changed, 56 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e5e2083..3c85c1b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -456,6 +456,7 @@ struct i915_hw_ppgtt {
 /* This must match up with the value previously used for execbuf2.rsvd1. */
 #define DEFAULT_CONTEXT_ID 0
 struct i915_hw_context {
+	struct kref ref;
 	int id;
 	bool is_initialized;
 	struct drm_i915_file_private *file_priv;
@@ -1260,6 +1261,9 @@ struct drm_i915_gem_request {
 	/** Postion in the ringbuffer of the end of the request */
 	u32 tail;
 
+	/** Context related to this request */
+	struct i915_hw_context *ctx;
+
 	/** Time at which this request was emitted, in jiffies. */
 	unsigned long emitted_jiffies;
 
@@ -1646,9 +1650,10 @@ int __must_check i915_gpu_idle(struct drm_device *dev);
 int __must_check i915_gem_idle(struct drm_device *dev);
 int _i915_add_request(struct intel_ring_buffer *ring,
 		      u32 *seqno,
-		      struct drm_file *file);
+		      struct drm_file *file,
+		      struct i915_hw_context *ctx);
 #define i915_add_request(ring, seqno) \
-	_i915_add_request(ring, seqno, NULL)
+	_i915_add_request(ring, seqno, NULL, NULL)
 int __must_check i915_wait_seqno(struct intel_ring_buffer *ring,
 				 uint32_t seqno);
 int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
@@ -1692,6 +1697,17 @@ void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
 struct i915_hw_context * __must_check
 i915_switch_context(struct intel_ring_buffer *ring,
 		    struct drm_file *file, int to_id);
+void i915_gem_context_free(struct kref *ctx_ref);
+static inline void i915_gem_context_reference(struct i915_hw_context *ctx)
+{
+	kref_get(&ctx->ref);
+}
+
+static inline void i915_gem_context_unreference(struct i915_hw_context *ctx)
+{
+	kref_put(&ctx->ref, i915_gem_context_free);
+}
+
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index d3ce0a7..f586f9c4 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2001,7 +2001,8 @@ i915_gem_get_seqno(struct drm_device *dev, u32 *seqno)
 
 int _i915_add_request(struct intel_ring_buffer *ring,
 		      u32 *out_seqno,
-		      struct drm_file *file)
+		      struct drm_file *file,
+		      struct i915_hw_context *ctx)
 {
 	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	struct drm_i915_gem_request *request;
@@ -2041,6 +2042,11 @@ int _i915_add_request(struct intel_ring_buffer *ring,
 	request->seqno = intel_ring_get_seqno(ring);
 	request->ring = ring;
 	request->tail = request_ring_position;
+	request->ctx = ctx;
+
+	if (request->ctx)
+		i915_gem_context_reference(ctx);
+
 	request->emitted_jiffies = jiffies;
 	was_empty = list_empty(&ring->request_list);
 	list_add_tail(&request->list, &ring->request_list);
@@ -2093,6 +2099,17 @@ i915_gem_request_remove_from_client(struct drm_i915_gem_request *request)
 	spin_unlock(&file_priv->mm.lock);
 }
 
+static void i915_gem_free_request(struct drm_i915_gem_request *request)
+{
+	list_del(&request->list);
+	i915_gem_request_remove_from_client(request);
+
+	if (request->ctx)
+		i915_gem_context_unreference(request->ctx);
+
+	kfree(request);
+}
+
 static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 				      struct intel_ring_buffer *ring)
 {
@@ -2103,9 +2120,7 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 					   struct drm_i915_gem_request,
 					   list);
 
-		list_del(&request->list);
-		i915_gem_request_remove_from_client(request);
-		kfree(request);
+		i915_gem_free_request(request);
 	}
 
 	while (!list_empty(&ring->active_list)) {
@@ -2197,9 +2212,7 @@ i915_gem_retire_requests_ring(struct intel_ring_buffer *ring)
 		 */
 		ring->last_retired_head = request->tail;
 
-		list_del(&request->list);
-		i915_gem_request_remove_from_client(request);
-		kfree(request);
+		i915_gem_free_request(request);
 	}
 
 	/* Move any buffers on the active list that are no longer referenced
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index ddf26a6..de01a18 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -126,11 +126,19 @@ static int get_context_size(struct drm_device *dev)
 
 static void do_destroy(struct i915_hw_context *ctx)
 {
+	drm_gem_object_unreference(&ctx->obj->base);
+	kfree(ctx);
+}
+
+void i915_gem_context_free(struct kref *ctx_ref)
+{
+	struct i915_hw_context *ctx = container_of(ctx_ref,
+						   typeof(*ctx), ref);
+
 	if (ctx->file_priv)
 		idr_remove(&ctx->file_priv->context_idr, ctx->id);
 
-	drm_gem_object_unreference(&ctx->obj->base);
-	kfree(ctx);
+	do_destroy(ctx);
 }
 
 static struct i915_hw_context *
@@ -145,6 +153,7 @@ create_hw_context(struct drm_device *dev,
 	if (ctx == NULL)
 		return ERR_PTR(-ENOMEM);
 
+	kref_init(&ctx->ref);
 	ctx->obj = i915_gem_alloc_object(dev, dev_priv->hw_context_size);
 	if (ctx->obj == NULL) {
 		kfree(ctx);
@@ -275,8 +284,8 @@ static int context_idr_cleanup(int id, void *p, void *data)
 
 	BUG_ON(id == DEFAULT_CONTEXT_ID);
 
-	do_destroy(ctx);
-
+	ctx->file_priv = NULL;
+	i915_gem_context_unreference(ctx);
 	return 0;
 }
 
@@ -511,7 +520,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 		return -ENOENT;
 	}
 
-	do_destroy(ctx);
+	i915_gem_context_unreference(ctx);
 
 	mutex_unlock(&dev->struct_mutex);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index b413962..757b52d 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -796,13 +796,14 @@ i915_gem_execbuffer_move_to_active(struct list_head *objects,
 static void
 i915_gem_execbuffer_retire_commands(struct drm_device *dev,
 				    struct drm_file *file,
-				    struct intel_ring_buffer *ring)
+				    struct intel_ring_buffer *ring,
+				    struct i915_hw_context *ctx)
 {
 	/* Unconditionally force add_request to emit a full flush. */
 	ring->gpu_caches_dirty = true;
 
 	/* Add a breadcrumb for the completion of the batch buffer */
-	(void)_i915_add_request(ring, NULL, file);
+	(void)_i915_add_request(ring, NULL, file, ctx);
 }
 
 static int
@@ -1077,7 +1078,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	trace_i915_gem_ring_dispatch(ring, intel_ring_get_seqno(ring), flags);
 
 	i915_gem_execbuffer_move_to_active(&eb->objects, ring);
-	i915_gem_execbuffer_retire_commands(dev, file, ring);
+	i915_gem_execbuffer_retire_commands(dev, file, ring, ctx);
 
 err:
 	eb_destroy(eb);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 04/16] drm/i915: pass seqno to i915_hangcheck_ring_idle
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (2 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 03/16] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 05/16] drm/i915: track ring progression using seqnos Mika Kuoppala
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

In preparation for next commit, pass seqno as a parameter
to i915_hangcheck_ring_idle as it will be used inside
i915_hangcheck_elapsed.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 4c5bdd0..45c83fb 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1846,11 +1846,11 @@ ring_last_seqno(struct intel_ring_buffer *ring)
 			  struct drm_i915_gem_request, list)->seqno;
 }
 
-static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring, bool *err)
+static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring,
+				     u32 ring_seqno, bool *err)
 {
 	if (list_empty(&ring->request_list) ||
-	    i915_seqno_passed(ring->get_seqno(ring, false),
-			      ring_last_seqno(ring))) {
+	    i915_seqno_passed(ring_seqno, ring_last_seqno(ring))) {
 		/* Issue a wake-up to catch stuck h/w. */
 		if (waitqueue_active(&ring->irq_queue)) {
 			DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
@@ -1967,7 +1967,10 @@ void i915_hangcheck_elapsed(unsigned long data)
 	memset(acthd, 0, sizeof(acthd));
 	idle = true;
 	for_each_ring(ring, dev_priv, i) {
-	    idle &= i915_hangcheck_ring_idle(ring, &err);
+		u32 seqno;
+
+		seqno = ring->get_seqno(ring, false);
+		idle &= i915_hangcheck_ring_idle(ring, seqno, &err);
 	    acthd[i] = intel_ring_get_active_head(ring);
 	}
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 05/16] drm/i915: track ring progression using seqnos
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (3 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 04/16] drm/i915: pass seqno to i915_hangcheck_ring_idle Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-20 18:43   ` Ben Widawsky
  2013-04-04 15:32 ` [PATCH v3 06/16] drm/i915: introduce i915_hangcheck_ring_hung Mika Kuoppala
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

Instead of relying in acthd, track ring seqno progression
to detect if ring has hung.

v2: put hangcheck stuff inside struct (Chris Wilson)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |    2 --
 drivers/gpu/drm/i915/i915_irq.c         |   30 +++++++++++++-----------------
 drivers/gpu/drm/i915/intel_ringbuffer.h |    6 ++++++
 3 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3c85c1b..0404116 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -816,8 +816,6 @@ struct i915_gpu_error {
 #define DRM_I915_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_I915_HANGCHECK_PERIOD)
 	struct timer_list hangcheck_timer;
 	int hangcheck_count;
-	uint32_t last_acthd[I915_NUM_RINGS];
-	uint32_t prev_instdone[I915_NUM_INSTDONE_REG];
 
 	/* For reset and error_state handling. */
 	spinlock_t lock;
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 45c83fb..0363871 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1956,22 +1956,19 @@ void i915_hangcheck_elapsed(unsigned long data)
 {
 	struct drm_device *dev = (struct drm_device *)data;
 	drm_i915_private_t *dev_priv = dev->dev_private;
-	uint32_t acthd[I915_NUM_RINGS], instdone[I915_NUM_INSTDONE_REG];
 	struct intel_ring_buffer *ring;
 	bool err = false, idle;
 	int i;
+	u32 seqno[I915_NUM_RINGS];
+	bool work_done;
 
 	if (!i915_enable_hangcheck)
 		return;
 
-	memset(acthd, 0, sizeof(acthd));
 	idle = true;
 	for_each_ring(ring, dev_priv, i) {
-		u32 seqno;
-
-		seqno = ring->get_seqno(ring, false);
-		idle &= i915_hangcheck_ring_idle(ring, seqno, &err);
-	    acthd[i] = intel_ring_get_active_head(ring);
+		seqno[i] = ring->get_seqno(ring, false);
+		idle &= i915_hangcheck_ring_idle(ring, seqno[i], &err);
 	}
 
 	/* If all work is done then ACTHD clearly hasn't advanced. */
@@ -1987,20 +1984,19 @@ void i915_hangcheck_elapsed(unsigned long data)
 		return;
 	}
 
-	i915_get_extra_instdone(dev, instdone);
-	if (memcmp(dev_priv->gpu_error.last_acthd, acthd,
-		   sizeof(acthd)) == 0 &&
-	    memcmp(dev_priv->gpu_error.prev_instdone, instdone,
-		   sizeof(instdone)) == 0) {
+	work_done = false;
+	for_each_ring(ring, dev_priv, i) {
+		if (ring->hangcheck.seqno != seqno[i]) {
+			work_done = true;
+			ring->hangcheck.seqno = seqno[i];
+		}
+	}
+
+	if (!work_done) {
 		if (i915_hangcheck_hung(dev))
 			return;
 	} else {
 		dev_priv->gpu_error.hangcheck_count = 0;
-
-		memcpy(dev_priv->gpu_error.last_acthd, acthd,
-		       sizeof(acthd));
-		memcpy(dev_priv->gpu_error.prev_instdone, instdone,
-		       sizeof(instdone));
 	}
 
 repeat:
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index d66208c..844381e 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -37,6 +37,10 @@ struct  intel_hw_status_page {
 #define I915_READ_SYNC_0(ring) I915_READ(RING_SYNC_0((ring)->mmio_base))
 #define I915_READ_SYNC_1(ring) I915_READ(RING_SYNC_1((ring)->mmio_base))
 
+struct intel_ring_hangcheck {
+	u32 seqno;
+};
+
 struct  intel_ring_buffer {
 	const char	*name;
 	enum intel_ring_id {
@@ -137,6 +141,8 @@ struct  intel_ring_buffer {
 	struct i915_hw_context *default_context;
 	struct drm_i915_gem_object *last_context_obj;
 
+	struct intel_ring_hangcheck hangcheck;
+
 	void *private;
 };
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 06/16] drm/i915: introduce i915_hangcheck_ring_hung
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (4 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 05/16] drm/i915: track ring progression using seqnos Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 07/16] drm/i915: detect hang using per ring hangcheck_score Mika Kuoppala
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

In preparation to track per ring progress in hangcheck,
add i915_hangcheck_ring_hung.

v2: omit dev parameter (Ben Widawsky)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c |   29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 0363871..7342a96 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1917,28 +1917,33 @@ static bool kick_ring(struct intel_ring_buffer *ring)
 	return false;
 }
 
+static bool i915_hangcheck_ring_hung(struct intel_ring_buffer *ring)
+{
+	if (IS_GEN2(ring->dev))
+		return false;
+
+	/* Is the chip hanging on a WAIT_FOR_EVENT?
+	 * If so we can simply poke the RB_WAIT bit
+	 * and break the hang. This should work on
+	 * all but the second generation chipsets.
+	 */
+	return !kick_ring(ring);
+}
+
 static bool i915_hangcheck_hung(struct drm_device *dev)
 {
 	drm_i915_private_t *dev_priv = dev->dev_private;
 
 	if (dev_priv->gpu_error.hangcheck_count++ > 1) {
 		bool hung = true;
+		struct intel_ring_buffer *ring;
+		int i;
 
 		DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
 		i915_handle_error(dev, true);
 
-		if (!IS_GEN2(dev)) {
-			struct intel_ring_buffer *ring;
-			int i;
-
-			/* Is the chip hanging on a WAIT_FOR_EVENT?
-			 * If so we can simply poke the RB_WAIT bit
-			 * and break the hang. This should work on
-			 * all but the second generation chipsets.
-			 */
-			for_each_ring(ring, dev_priv, i)
-				hung &= !kick_ring(ring);
-		}
+		for_each_ring(ring, dev_priv, i)
+			hung &= i915_hangcheck_ring_hung(ring);
 
 		return hung;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 07/16] drm/i915: detect hang using per ring hangcheck_score
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (5 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 06/16] drm/i915: introduce i915_hangcheck_ring_hung Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 08/16] drm/i915: remove i915_hangcheck_hung Mika Kuoppala
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

Add per ring score of possible culprit for gpu hang. If
ring is busy and not waiting, it will get the highest score
across calls to i915_hangcheck_elapsed. This way we are
most likely to find the ring that caused the hang among
the waiting ones.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c         |   63 ++++++++++++++++---------------
 drivers/gpu/drm/i915/intel_ringbuffer.h |    1 +
 2 files changed, 34 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 7342a96..a41ab2d 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -413,7 +413,6 @@ static void notify_ring(struct drm_device *dev,
 
 	wake_up_all(&ring->irq_queue);
 	if (i915_enable_hangcheck) {
-		dev_priv->gpu_error.hangcheck_count = 0;
 		mod_timer(&dev_priv->gpu_error.hangcheck_timer,
 			  round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES));
 	}
@@ -1962,52 +1961,56 @@ void i915_hangcheck_elapsed(unsigned long data)
 	struct drm_device *dev = (struct drm_device *)data;
 	drm_i915_private_t *dev_priv = dev->dev_private;
 	struct intel_ring_buffer *ring;
-	bool err = false, idle;
 	int i;
-	u32 seqno[I915_NUM_RINGS];
-	bool work_done;
+	int busy_count = 0, rings_hung = 0;
 
 	if (!i915_enable_hangcheck)
 		return;
 
-	idle = true;
 	for_each_ring(ring, dev_priv, i) {
-		seqno[i] = ring->get_seqno(ring, false);
-		idle &= i915_hangcheck_ring_idle(ring, seqno[i], &err);
-	}
+		u32 seqno;
+		bool idle, err = false;
+
+		seqno = ring->get_seqno(ring, false);
+		idle = i915_hangcheck_ring_idle(ring, seqno, &err);
 
-	/* If all work is done then ACTHD clearly hasn't advanced. */
-	if (idle) {
-		if (err) {
-			if (i915_hangcheck_hung(dev))
-				return;
+		if (idle) {
+			if (err)
+				ring->hangcheck.score++;
+			else
+				ring->hangcheck.score = 0;
+		} else {
+			busy_count++;
 
-			goto repeat;
+			if (ring->hangcheck.seqno == seqno) {
+				ring->hangcheck.score++;
+
+				/* Kick ring */
+				i915_hangcheck_ring_hung(ring);
+			} else {
+				ring->hangcheck.score = 0;
+			}
 		}
 
-		dev_priv->gpu_error.hangcheck_count = 0;
-		return;
+		ring->hangcheck.seqno = seqno;
 	}
 
-	work_done = false;
 	for_each_ring(ring, dev_priv, i) {
-		if (ring->hangcheck.seqno != seqno[i]) {
-			work_done = true;
-			ring->hangcheck.seqno = seqno[i];
+		if (ring->hangcheck.score > 1) {
+			rings_hung++;
+			DRM_ERROR("%s seems hung\n", ring->name);
 		}
 	}
 
-	if (!work_done) {
-		if (i915_hangcheck_hung(dev))
-			return;
-	} else {
-		dev_priv->gpu_error.hangcheck_count = 0;
-	}
+	if (rings_hung)
+		return i915_handle_error(dev, true);
 
-repeat:
-	/* Reset timer case chip hangs without another request being added */
-	mod_timer(&dev_priv->gpu_error.hangcheck_timer,
-		  round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES));
+	if (busy_count)
+		/* Reset timer case chip hangs without another request
+		 * being added */
+		mod_timer(&dev_priv->gpu_error.hangcheck_timer,
+			  round_jiffies_up(jiffies +
+					   DRM_I915_HANGCHECK_JIFFIES));
 }
 
 /* drm_dma.h hooks
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 844381e..503e913 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -39,6 +39,7 @@ struct  intel_hw_status_page {
 
 struct intel_ring_hangcheck {
 	u32 seqno;
+	int score;
 };
 
 struct  intel_ring_buffer {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 08/16] drm/i915: remove i915_hangcheck_hung
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (6 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 07/16] drm/i915: detect hang using per ring hangcheck_score Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 09/16] drm/i915: add struct i915_ctx_hang_stats Mika Kuoppala
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

Rework of per ring hangcheck made this obsolete.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h |    1 -
 drivers/gpu/drm/i915/i915_irq.c |   21 ---------------------
 2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 0404116..5ccfc6c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -815,7 +815,6 @@ struct i915_gpu_error {
 #define DRM_I915_HANGCHECK_PERIOD 1500 /* in ms */
 #define DRM_I915_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_I915_HANGCHECK_PERIOD)
 	struct timer_list hangcheck_timer;
-	int hangcheck_count;
 
 	/* For reset and error_state handling. */
 	spinlock_t lock;
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index a41ab2d..df40bb2 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1929,27 +1929,6 @@ static bool i915_hangcheck_ring_hung(struct intel_ring_buffer *ring)
 	return !kick_ring(ring);
 }
 
-static bool i915_hangcheck_hung(struct drm_device *dev)
-{
-	drm_i915_private_t *dev_priv = dev->dev_private;
-
-	if (dev_priv->gpu_error.hangcheck_count++ > 1) {
-		bool hung = true;
-		struct intel_ring_buffer *ring;
-		int i;
-
-		DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
-		i915_handle_error(dev, true);
-
-		for_each_ring(ring, dev_priv, i)
-			hung &= i915_hangcheck_ring_hung(ring);
-
-		return hung;
-	}
-
-	return false;
-}
-
 /**
  * This is called when the chip hasn't reported back with completed
  * batchbuffers in a long time. The first time this is called we simply record
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 09/16] drm/i915: add struct i915_ctx_hang_stats
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (7 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 08/16] drm/i915: remove i915_hangcheck_hung Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 10/16] drm/i915: add i915_gem_context_get_hang_stats() Mika Kuoppala
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

To count context losses, add struct i915_ctx_hang_stats for
both i915_hw_context and drm_i915_file_private.
drm_i915_file_private is used when there is no context.

v2: renamed and cleaned up the struct (Chris Wilson, Ian Romanick)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_dma.c |    2 +-
 drivers/gpu/drm/i915/i915_drv.h |   10 ++++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 4be58e3..6693c7c 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1789,7 +1789,7 @@ int i915_driver_open(struct drm_device *dev, struct drm_file *file)
 	struct drm_i915_file_private *file_priv;
 
 	DRM_DEBUG_DRIVER("\n");
-	file_priv = kmalloc(sizeof(*file_priv), GFP_KERNEL);
+	file_priv = kzalloc(sizeof(*file_priv), GFP_KERNEL);
 	if (!file_priv)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 5ccfc6c..9bbcf0c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -452,6 +452,13 @@ struct i915_hw_ppgtt {
 	void (*cleanup)(struct i915_hw_ppgtt *ppgtt);
 };
 
+struct i915_ctx_hang_stats {
+	/* This context had batch pending when hang was declared */
+	unsigned batch_pending;
+
+	/* This context had batch active when hang was declared */
+	unsigned batch_active;
+};
 
 /* This must match up with the value previously used for execbuf2.rsvd1. */
 #define DEFAULT_CONTEXT_ID 0
@@ -462,6 +469,7 @@ struct i915_hw_context {
 	struct drm_i915_file_private *file_priv;
 	struct intel_ring_buffer *ring;
 	struct drm_i915_gem_object *obj;
+	struct i915_ctx_hang_stats hang_stats;
 };
 
 enum no_fbc_reason {
@@ -1278,6 +1286,8 @@ struct drm_i915_file_private {
 		struct list_head request_list;
 	} mm;
 	struct idr context_idr;
+
+	struct i915_ctx_hang_stats hang_stats;
 };
 
 #define INTEL_INFO(dev)	(((struct drm_i915_private *) (dev)->dev_private)->info)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 10/16] drm/i915: add i915_gem_context_get_hang_stats()
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (8 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 09/16] drm/i915: add struct i915_ctx_hang_stats Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 11/16] drm/i915: add batch object and context to i915_add_request() Mika Kuoppala
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

To get context hang statistics for specified context,
add i915_gem_context_get_hang_stats().

For arb-robustness, every context needs to have its own
hang statistics tracking. Added function will return
the user specified context statistics or in case of
default context, statistics from drm_i915_file_private.

v2: handle default context inside get_reset_state

v3: return struct pointer instead of passing it in as param
    (Chris Wilson)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |    4 ++++
 drivers/gpu/drm/i915/i915_gem_context.c |   28 ++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 9bbcf0c..3863bd4 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1715,6 +1715,10 @@ static inline void i915_gem_context_unreference(struct i915_hw_context *ctx)
 	kref_put(&ctx->ref, i915_gem_context_free);
 }
 
+struct i915_ctx_hang_stats * __must_check
+i915_gem_context_get_hang_stats(struct intel_ring_buffer *ring,
+				struct drm_file *file,
+				u32 id);
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index de01a18..1c7a51d 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -289,6 +289,34 @@ static int context_idr_cleanup(int id, void *p, void *data)
 	return 0;
 }
 
+struct i915_ctx_hang_stats *
+i915_gem_context_get_hang_stats(struct intel_ring_buffer *ring,
+				struct drm_file *file,
+				u32 id)
+{
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct drm_i915_file_private *file_priv = file->driver_priv;
+	struct i915_hw_context *to;
+
+	if (dev_priv->hw_contexts_disabled)
+		return ERR_PTR(-ENOENT);
+
+	if (ring->id != RCS)
+		return ERR_PTR(-EINVAL);
+
+	if (file == NULL)
+		return ERR_PTR(-EINVAL);
+
+	if (id == DEFAULT_CONTEXT_ID)
+		return &file_priv->hang_stats;
+
+	to = i915_gem_context_get(file->driver_priv, id);
+	if (to == NULL)
+		return ERR_PTR(-ENOENT);
+
+	return &to->hang_stats;
+}
+
 void i915_gem_context_close(struct drm_device *dev, struct drm_file *file)
 {
 	struct drm_i915_file_private *file_priv = file->driver_priv;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 11/16] drm/i915: add batch object and context to i915_add_request()
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (9 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 10/16] drm/i915: add i915_gem_context_get_hang_stats() Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 12/16] drm/i915: mark rings which were waiting when hang happened Mika Kuoppala
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

In order to track down a batch buffer and context which
caused the ring to hang, store reference to bo and context
into the request struct. Request can also cause gpu to hang
after the batch in the flush section in the ring. To detect this
add start of the flush portion offset into the request.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |   13 ++++++++++---
 drivers/gpu/drm/i915/i915_gem.c            |    8 ++++++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    7 ++++---
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3863bd4..8223908 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1263,9 +1263,15 @@ struct drm_i915_gem_request {
 	/** GEM sequence number associated with this request. */
 	uint32_t seqno;
 
-	/** Postion in the ringbuffer of the end of the request */
+	/** Position in the ringbuffer of the start of the request */
+	u32 head;
+
+	/** Position in the ringbuffer of the end of the request */
 	u32 tail;
 
+	/** Batch buffer related to this request if any */
+	struct drm_i915_gem_object *batch_obj;
+
 	/** Context related to this request */
 	struct i915_hw_context *ctx;
 
@@ -1658,9 +1664,10 @@ int __must_check i915_gem_idle(struct drm_device *dev);
 int _i915_add_request(struct intel_ring_buffer *ring,
 		      u32 *seqno,
 		      struct drm_file *file,
-		      struct i915_hw_context *ctx);
+		      struct i915_hw_context *ctx,
+		      struct drm_i915_gem_object *batch_obj);
 #define i915_add_request(ring, seqno) \
-	_i915_add_request(ring, seqno, NULL, NULL)
+	_i915_add_request(ring, seqno, NULL, NULL, NULL)
 int __must_check i915_wait_seqno(struct intel_ring_buffer *ring,
 				 uint32_t seqno);
 int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index f586f9c4..547eaf5 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2002,14 +2002,16 @@ i915_gem_get_seqno(struct drm_device *dev, u32 *seqno)
 int _i915_add_request(struct intel_ring_buffer *ring,
 		      u32 *out_seqno,
 		      struct drm_file *file,
-		      struct i915_hw_context *ctx)
+		      struct i915_hw_context *ctx,
+		      struct drm_i915_gem_object *obj)
 {
 	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	struct drm_i915_gem_request *request;
-	u32 request_ring_position;
+	u32 request_ring_position, request_start;
 	int was_empty;
 	int ret;
 
+	request_start = intel_ring_get_tail(ring);
 	/*
 	 * Emit any outstanding flushes - execbuf can fail to emit the flush
 	 * after having emitted the batchbuffer command. Hence we need to fix
@@ -2041,7 +2043,9 @@ int _i915_add_request(struct intel_ring_buffer *ring,
 
 	request->seqno = intel_ring_get_seqno(ring);
 	request->ring = ring;
+	request->head = request_start;
 	request->tail = request_ring_position;
+	request->batch_obj = obj;
 	request->ctx = ctx;
 
 	if (request->ctx)
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 757b52d..bd1750a 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -797,13 +797,14 @@ static void
 i915_gem_execbuffer_retire_commands(struct drm_device *dev,
 				    struct drm_file *file,
 				    struct intel_ring_buffer *ring,
-				    struct i915_hw_context *ctx)
+				    struct i915_hw_context *ctx,
+				    struct drm_i915_gem_object *obj)
 {
 	/* Unconditionally force add_request to emit a full flush. */
 	ring->gpu_caches_dirty = true;
 
 	/* Add a breadcrumb for the completion of the batch buffer */
-	(void)_i915_add_request(ring, NULL, file, ctx);
+	(void)_i915_add_request(ring, NULL, file, ctx, obj);
 }
 
 static int
@@ -1078,7 +1079,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	trace_i915_gem_ring_dispatch(ring, intel_ring_get_seqno(ring), flags);
 
 	i915_gem_execbuffer_move_to_active(&eb->objects, ring);
-	i915_gem_execbuffer_retire_commands(dev, file, ring, ctx);
+	i915_gem_execbuffer_retire_commands(dev, file, ring, ctx, batch_obj);
 
 err:
 	eb_destroy(eb);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 12/16] drm/i915: mark rings which were waiting when hang happened
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (10 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 11/16] drm/i915: add batch object and context to i915_add_request() Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 13/16] drm/i915: find guilty batch buffer on ring resets Mika Kuoppala
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

For guilty batchbuffer analysis later on on ring resets,
mark all waiting rings so that we can skip them when
trying to find a true culprit for the gpu hang.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c         |    3 ++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |    1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index df40bb2..5acc46e 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1965,7 +1965,8 @@ void i915_hangcheck_elapsed(unsigned long data)
 				ring->hangcheck.score++;
 
 				/* Kick ring */
-				i915_hangcheck_ring_hung(ring);
+				ring->hangcheck.was_waiting =
+					!i915_hangcheck_ring_hung(ring);
 			} else {
 				ring->hangcheck.score = 0;
 			}
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 503e913..49c71ff 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -40,6 +40,7 @@ struct  intel_hw_status_page {
 struct intel_ring_hangcheck {
 	u32 seqno;
 	int score;
+	bool was_waiting;
 };
 
 struct  intel_ring_buffer {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 13/16] drm/i915: find guilty batch buffer on ring resets
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (11 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 12/16] drm/i915: mark rings which were waiting when hang happened Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 14/16] drm/i915: refuse to submit more batchbuffers from guilty context Mika Kuoppala
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

After hang check timer has declared gpu to be hang,
rings are reset. In ring reset, when clearing
request list, do post mortem analysis to find out
the guilty batch buffer.

Select requests for further analysis by inspecting
the completed sequence number which has been updated
into the HWS page. If request was completed, it can't
be related to the hang.

For noncompleted requests mark the batch as guilty
if the ring was not waiting and the ring head was
stuck inside the buffer object or in the flush region
right after the batch. For everything else, mark
them as innocents.

v2: Fixed a typo in commit message (Ville Syrjälä)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c |   88 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 547eaf5..475b6ad 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2103,6 +2103,85 @@ i915_gem_request_remove_from_client(struct drm_i915_gem_request *request)
 	spin_unlock(&file_priv->mm.lock);
 }
 
+static bool i915_head_inside_object(u32 acthd, struct drm_i915_gem_object *obj)
+{
+	if (acthd >= obj->gtt_offset &&
+	    acthd < obj->gtt_offset + obj->base.size)
+		return true;
+
+	return false;
+}
+
+static bool i915_head_inside_request(u32 acthd, u32 rs, u32 re)
+{
+	if (rs < re) {
+		if (acthd >= rs && acthd < re)
+			return true;
+	} else if (rs > re) {
+		if (acthd >= rs || acthd < re)
+			return true;
+	}
+
+	return false;
+}
+
+static bool i915_request_guilty(struct drm_i915_gem_request *request,
+				const u32 acthd, bool *inside)
+{
+	if (request->batch_obj) {
+		if (i915_head_inside_object(acthd, request->batch_obj)) {
+			*inside = true;
+			return true;
+		}
+	}
+
+	if (i915_head_inside_request(acthd, request->head, request->tail)) {
+		*inside = false;
+		return true;
+	}
+
+	return false;
+}
+
+static void i915_set_reset_status(struct intel_ring_buffer *ring,
+				  struct drm_i915_gem_request *request,
+				  u32 acthd)
+{
+	struct i915_ctx_hang_stats *hs = NULL;
+	bool inside, guilty;
+
+	/* Innocent until proven guilty */
+	guilty = false;
+
+	if (!ring->hangcheck.was_waiting &&
+	    i915_request_guilty(request, acthd, &inside)) {
+		DRM_ERROR("%s hung %s bo (0x%x ctx %d) at 0x%x\n",
+			  ring->name,
+			  inside ? "inside" : "flushing",
+			  request->batch_obj ?
+			  request->batch_obj->gtt_offset : 0,
+			  request->ctx ? request->ctx->id : 0,
+			  acthd);
+
+		guilty = true;
+	}
+
+	/* If contexts are disabled or this is the default context, use
+	 * file_priv->reset_state
+	 */
+	if (request->ctx && request->ctx->id != DEFAULT_CONTEXT_ID)
+		hs = &request->ctx->hang_stats;
+	else if (request->file_priv)
+		hs = &request->file_priv->hang_stats;
+
+	if (hs) {
+		if (guilty)
+			hs->batch_active++;
+		else
+			hs->batch_pending++;
+	}
+}
+
 static void i915_gem_free_request(struct drm_i915_gem_request *request)
 {
 	list_del(&request->list);
@@ -2117,6 +2196,12 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
 static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 				      struct intel_ring_buffer *ring)
 {
+	u32 completed_seqno;
+	u32 acthd;
+
+	acthd = intel_ring_get_active_head(ring);
+	completed_seqno = ring->get_seqno(ring, false);
+
 	while (!list_empty(&ring->request_list)) {
 		struct drm_i915_gem_request *request;
 
@@ -2124,6 +2209,9 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 					   struct drm_i915_gem_request,
 					   list);
 
+		if (request->seqno > completed_seqno)
+			i915_set_reset_status(ring, request, acthd);
+
 		i915_gem_free_request(request);
 	}
 
-- 
1.7.9.5

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 14/16] drm/i915: refuse to submit more batchbuffers from guilty context
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (12 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 13/16] drm/i915: find guilty batch buffer on ring resets Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-11 15:13   ` Mika Kuoppala
  2013-04-16 11:32   ` [PATCH " Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 15/16] drm/i915: add i915_reset_count Mika Kuoppala
                   ` (2 subsequent siblings)
  16 siblings, 2 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

If context has recently submitted a faulty batchbuffers guilty of
gpu hang and decides to keep submitting more crap, ban it permanently.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |   23 ++++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h            |    7 +++++++
 drivers/gpu/drm/i915/i915_gem.c            |    7 +++++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   13 +++++++++++++
 4 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index a5b8aa9..0928f11 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -852,6 +852,8 @@ int intel_gpu_reset(struct drm_device *dev)
 int i915_reset(struct drm_device *dev)
 {
 	drm_i915_private_t *dev_priv = dev->dev_private;
+	struct i915_ctx_hang_stats *hs;
+	bool do_wedge = true;
 	int ret;
 
 	if (!i915_try_reset)
@@ -859,10 +861,29 @@ int i915_reset(struct drm_device *dev)
 
 	mutex_lock(&dev->struct_mutex);
 
+	/* i915_gem_reset will set this if it finds guilty context */
+	dev_priv->gpu_error.hang_stats = NULL;
+
 	i915_gem_reset(dev);
 
+	hs = dev_priv->gpu_error.hang_stats;
+
+	if (hs) {
+		if (hs->batch_active == 1) {
+			do_wedge = false;
+		} else if (!hs->banned &&
+			   get_seconds() - hs->batch_active_reset_ts < 5) {
+			hs->banned = true;
+			do_wedge = false;
+		}
+
+		hs->batch_active_reset_ts = get_seconds();
+	}
+
+	dev_priv->gpu_error.hang_stats = NULL;
+
 	ret = -ENODEV;
-	if (get_seconds() - dev_priv->gpu_error.last_reset < 5)
+	if (do_wedge && get_seconds() - dev_priv->gpu_error.last_reset < 5)
 		DRM_ERROR("GPU hanging too fast, declaring wedged!\n");
 	else
 		ret = intel_gpu_reset(dev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 8223908..30ba79c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -458,6 +458,12 @@ struct i915_ctx_hang_stats {
 
 	/* This context had batch active when hang was declared */
 	unsigned batch_active;
+
+	/* Time when this context was last blamed for a GPU reset */
+	unsigned long batch_active_reset_ts;
+
+	/* This context is banned to submit more work */
+	bool banned;
 };
 
 /* This must match up with the value previously used for execbuf2.rsvd1. */
@@ -831,6 +837,7 @@ struct i915_gpu_error {
 	struct work_struct work;
 
 	unsigned long last_reset;
+	struct i915_ctx_hang_stats *hang_stats;
 
 	/**
 	 * State variable and reset counter controlling the reset flow
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 475b6ad..ca5c9c3 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2147,6 +2147,7 @@ static void i915_set_reset_status(struct intel_ring_buffer *ring,
 				  struct drm_i915_gem_request *request,
 				  u32 acthd)
 {
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
 	struct i915_ctx_hang_stats *hs = NULL;
 	bool inside, guilty;
 
@@ -2175,10 +2176,12 @@ static void i915_set_reset_status(struct intel_ring_buffer *ring,
 		hs = &request->file_priv->hang_stats;
 
 	if (hs) {
-		if (guilty)
+		if (guilty) {
 			hs->batch_active++;
-		else
+			dev_priv->gpu_error.hang_stats = hs;
+		} else {
 			hs->batch_pending++;
+		}
 	}
 }
 
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index bd1750a..f1b1ea9 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -844,6 +844,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	struct drm_clip_rect *cliprects = NULL;
 	struct intel_ring_buffer *ring;
 	struct i915_hw_context *ctx;
+	struct i915_ctx_hang_stats *hs;
 	u32 ctx_id = i915_execbuffer2_get_context_id(*args);
 	u32 exec_start, exec_len;
 	u32 mask, flags;
@@ -1026,6 +1027,18 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	if (ret)
 		goto err;
 
+	hs = i915_gem_context_get_hang_stats(&dev_priv->ring[RCS],
+					     file, ctx_id);
+	if (IS_ERR(hs)) {
+		ret = PTR_ERR(hs);
+		goto err;
+	}
+
+	if (hs->banned) {
+		ret = -EIO;
+		goto err;
+	}
+
 	ctx = i915_switch_context(ring, file, ctx_id);
 	if (IS_ERR(ctx)) {
 		ret = PTR_ERR(ctx);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 15/16] drm/i915: add i915_reset_count
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (13 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 14/16] drm/i915: refuse to submit more batchbuffers from guilty context Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-04 15:32 ` [PATCH v3 16/16] drm/i915: add i915_get_reset_stats_ioctl Mika Kuoppala
  2013-04-24 23:27 ` [PATCH v3 00/16] arb robustness enablers v3 Ben Widawsky
  16 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

Use reset_counter to track how many actual resets
have been done. Reset in progress is enough to increment
the counter.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h |    9 +++++++--
 drivers/gpu/drm/i915/i915_irq.c |    2 +-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 30ba79c..e682077 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -867,7 +867,7 @@ struct i915_gpu_error {
 	 * being true.
 	 */
 #define I915_RESET_IN_PROGRESS_FLAG	1
-#define I915_WEDGED			0xffffffff
+#define I915_WEDGED			(1 << 31)
 
 	/**
 	 * Waitqueue to signal when the reset has completed. Used by clients
@@ -1651,7 +1651,12 @@ static inline bool i915_reset_in_progress(struct i915_gpu_error *error)
 
 static inline bool i915_terminally_wedged(struct i915_gpu_error *error)
 {
-	return atomic_read(&error->reset_counter) == I915_WEDGED;
+	return atomic_read(&error->reset_counter) & I915_WEDGED;
+}
+
+static inline u32 i915_reset_count(struct i915_gpu_error *error)
+{
+	return ((atomic_read(&error->reset_counter) & (~I915_WEDGED)) + 1) / 2;
 }
 
 void i915_gem_reset(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 5acc46e..e7358de 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -987,7 +987,7 @@ static void i915_error_work_func(struct work_struct *work)
 			kobject_uevent_env(&dev->primary->kdev.kobj,
 					   KOBJ_CHANGE, reset_done_event);
 		} else {
-			atomic_set(&error->reset_counter, I915_WEDGED);
+			atomic_set_mask(I915_WEDGED, &error->reset_counter);
 		}
 
 		for_each_ring(ring, dev_priv, i)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 16/16] drm/i915: add i915_get_reset_stats_ioctl
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (14 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 15/16] drm/i915: add i915_reset_count Mika Kuoppala
@ 2013-04-04 15:32 ` Mika Kuoppala
  2013-04-24 23:27 ` [PATCH v3 00/16] arb robustness enablers v3 Ben Widawsky
  16 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-04 15:32 UTC (permalink / raw)
  To: intel-gfx

This ioctl returns reset stats for specified context.

The struct returned contains context loss counters.

reset_count:    all resets across all contexts
batch_active:   active batches lost on resets
batch_pending:  pending batches lost on resets

v2: get rid of state tracking completely and deliver only counts. Idea
    from Chris Wilson.

v3: fix commit message

v4: default context handled inside i915_gem_contest_get_hang_stats

v5: reset_count only for priviledged process

v6: ctx=0 needs CAP_SYS_ADMIN for batch_* counters (Chris Wilson)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Ian Romanick <idr@freedesktop.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_dma.c |    1 +
 drivers/gpu/drm/i915/i915_drv.c |   37 +++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_drv.h |    2 ++
 include/uapi/drm/i915_drm.h     |   17 +++++++++++++++++
 4 files changed, 57 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 6693c7c..f4d7645 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1898,6 +1898,7 @@ struct drm_ioctl_desc i915_ioctls[] = {
 	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_CREATE, i915_gem_context_create_ioctl, DRM_UNLOCKED),
 	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_DESTROY, i915_gem_context_destroy_ioctl, DRM_UNLOCKED),
 	DRM_IOCTL_DEF_DRV(I915_REG_READ, i915_reg_read_ioctl, DRM_UNLOCKED),
+	DRM_IOCTL_DEF_DRV(I915_GET_RESET_STATS, i915_get_reset_stats_ioctl, DRM_UNLOCKED),
 };
 
 int i915_max_ioctl = DRM_ARRAY_SIZE(i915_ioctls);
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 0928f11..861f396 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1302,3 +1302,40 @@ int i915_reg_read_ioctl(struct drm_device *dev,
 
 	return 0;
 }
+
+int i915_get_reset_stats_ioctl(struct drm_device *dev,
+			       void *data, struct drm_file *file)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct intel_ring_buffer *ring;
+	struct drm_i915_reset_stats *args = data;
+	struct i915_ctx_hang_stats *hs;
+	int ret;
+
+	if (args->ctx_id == 0 && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	ret = mutex_lock_interruptible(&dev->struct_mutex);
+	if (ret)
+		return ret;
+
+	ring = &dev_priv->ring[RCS];
+
+	hs = i915_gem_context_get_hang_stats(ring, file, args->ctx_id);
+	if (IS_ERR_OR_NULL(hs)) {
+		mutex_unlock(&dev->struct_mutex);
+		return hs == NULL ? -ENODEV : PTR_ERR(hs);
+	}
+
+	if (capable(CAP_SYS_ADMIN))
+		args->reset_count = i915_reset_count(&dev_priv->gpu_error);
+	else
+		args->reset_count = 0;
+
+	args->batch_active = hs->batch_active;
+	args->batch_pending = hs->batch_pending;
+
+	mutex_unlock(&dev->struct_mutex);
+
+	return 0;
+}
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e682077..23064f4 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1896,6 +1896,8 @@ extern int intel_enable_rc6(const struct drm_device *dev);
 extern bool i915_semaphore_is_enabled(struct drm_device *dev);
 int i915_reg_read_ioctl(struct drm_device *dev, void *data,
 			struct drm_file *file);
+int i915_get_reset_stats_ioctl(struct drm_device *dev, void *data,
+			       struct drm_file *file);
 
 /* overlay */
 #ifdef CONFIG_DEBUG_FS
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 07d5941..75cafc1 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -198,6 +198,7 @@ typedef struct _drm_i915_sarea {
 #define DRM_I915_GEM_SET_CACHING	0x2f
 #define DRM_I915_GEM_GET_CACHING	0x30
 #define DRM_I915_REG_READ		0x31
+#define DRM_I915_GET_RESET_STATS	0x32
 
 #define DRM_IOCTL_I915_INIT		DRM_IOW( DRM_COMMAND_BASE + DRM_I915_INIT, drm_i915_init_t)
 #define DRM_IOCTL_I915_FLUSH		DRM_IO ( DRM_COMMAND_BASE + DRM_I915_FLUSH)
@@ -247,6 +248,7 @@ typedef struct _drm_i915_sarea {
 #define DRM_IOCTL_I915_GEM_CONTEXT_CREATE	DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GEM_CONTEXT_CREATE, struct drm_i915_gem_context_create)
 #define DRM_IOCTL_I915_GEM_CONTEXT_DESTROY	DRM_IOW (DRM_COMMAND_BASE + DRM_I915_GEM_CONTEXT_DESTROY, struct drm_i915_gem_context_destroy)
 #define DRM_IOCTL_I915_REG_READ			DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_REG_READ, struct drm_i915_reg_read)
+#define DRM_IOCTL_I915_GET_RESET_STATS		DRM_IOWR (DRM_COMMAND_BASE + DRM_I915_GET_RESET_STATS, struct drm_i915_reset_stats)
 
 /* Allow drivers to submit batchbuffers directly to hardware, relying
  * on the security mechanisms provided by hardware.
@@ -980,4 +982,19 @@ struct drm_i915_reg_read {
 	__u64 offset;
 	__u64 val; /* Return value */
 };
+
+struct drm_i915_reset_stats {
+	__u32 ctx_id;
+	__u32 flags;
+
+	/* For all contexts */
+	__u32 reset_count;
+
+	/* For this context */
+	__u32 batch_active;
+	__u32 batch_pending;
+
+	__u32 pad;
+};
+
 #endif /* _UAPI_I915_DRM_H_ */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 03/16] drm/i915: reference count for i915_hw_contexts
  2013-04-04 15:32 ` [PATCH v3 03/16] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
@ 2013-04-09 22:18   ` Chris Wilson
  2013-04-10  0:12     ` [PATCH] " Ben Widawsky
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Wilson @ 2013-04-09 22:18 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Thu, Apr 04, 2013 at 06:32:35PM +0300, Mika Kuoppala wrote:
> In preparation to do analysis of which context was
> guilty of gpu hung, store kreffed context pointer
> into request struct.
> 
> This allows us to inspect contexts when gpu is reset
> even if those contexts would already be released
> by userspace.
> 
> v2: track i915_hw_context pointers instead of using ctx_ids
>     (from Chris Wilson)
> 
> v3 (Ben): Get rid of do_release() and handle refcounting more compactly.
>     (recommended by Chis)
> 
> v4: kref_* put inside static inlines (Daniel Vetter)
>     remove code duplication on freeing context (Chris Wilson)
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> (v2)
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net> (v3)
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> (v4)
> ---
> @@ -511,7 +520,7 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>  		return -ENOENT;
>  	}
>  
> -	do_destroy(ctx);
> +	i915_gem_context_unreference(ctx);

This needs to remove the ctx from the idr as well as dropping the
reference associated with the id.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] drm/i915: reference count for i915_hw_contexts
  2013-04-09 22:18   ` Chris Wilson
@ 2013-04-10  0:12     ` Ben Widawsky
  2013-04-20 18:11       ` Ben Widawsky
  0 siblings, 1 reply; 30+ messages in thread
From: Ben Widawsky @ 2013-04-10  0:12 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ben Widawsky, Mika Kuoppala

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>

In preparation to do analysis of which context was
guilty of gpu hung, store kreffed context pointer
into request struct.

This allows us to inspect contexts when gpu is reset
even if those contexts would already be released
by userspace.

v2: track i915_hw_context pointers instead of using ctx_ids
    (from Chris Wilson)

v3 (Ben): Get rid of do_release() and handle refcounting more compactly.
    (recommended by Chis)

v4: kref_* put inside static inlines (Daniel Vetter)
    remove code duplication on freeing context (Chris Wilson)

v5: idr_remove and ctx->file_priv = NULL in destroy ioctl (Chris)
This actually will cause a problem if one destroys a context and later
refers to the idea of the context (multiple contexts may have the same
id, but only 1 will exist in the idr).

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> (v2)
Signed-off-by: Ben Widawsky <ben@bwidawsk.net> (v3)
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> (v4)
Signed-off-by: Ben Widawsky <ben@bwidawsk.net> (v5)
---
 drivers/gpu/drm/i915/i915_drv.h            | 20 ++++++++++++++++++--
 drivers/gpu/drm/i915/i915_gem.c            | 27 ++++++++++++++++++++-------
 drivers/gpu/drm/i915/i915_gem_context.c    | 21 ++++++++++++++++-----
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  7 ++++---
 4 files changed, 58 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 00a0da8..156c5a0 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -456,6 +456,7 @@ struct i915_hw_ppgtt {
 /* This must match up with the value previously used for execbuf2.rsvd1. */
 #define DEFAULT_CONTEXT_ID 0
 struct i915_hw_context {
+	struct kref ref;
 	int id;
 	bool is_initialized;
 	struct drm_i915_file_private *file_priv;
@@ -1261,6 +1262,9 @@ struct drm_i915_gem_request {
 	/** Postion in the ringbuffer of the end of the request */
 	u32 tail;
 
+	/** Context related to this request */
+	struct i915_hw_context *ctx;
+
 	/** Time at which this request was emitted, in jiffies. */
 	unsigned long emitted_jiffies;
 
@@ -1648,9 +1652,10 @@ int __must_check i915_gpu_idle(struct drm_device *dev);
 int __must_check i915_gem_idle(struct drm_device *dev);
 int _i915_add_request(struct intel_ring_buffer *ring,
 		      u32 *seqno,
-		      struct drm_file *file);
+		      struct drm_file *file,
+		      struct i915_hw_context *ctx);
 #define i915_add_request(ring, seqno) \
-	_i915_add_request(ring, seqno, NULL)
+	_i915_add_request(ring, seqno, NULL, NULL)
 int __must_check i915_wait_seqno(struct intel_ring_buffer *ring,
 				 uint32_t seqno);
 int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
@@ -1694,6 +1699,17 @@ void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
 struct i915_hw_context * __must_check
 i915_switch_context(struct intel_ring_buffer *ring,
 		    struct drm_file *file, int to_id);
+void i915_gem_context_free(struct kref *ctx_ref);
+static inline void i915_gem_context_reference(struct i915_hw_context *ctx)
+{
+	kref_get(&ctx->ref);
+}
+
+static inline void i915_gem_context_unreference(struct i915_hw_context *ctx)
+{
+	kref_put(&ctx->ref, i915_gem_context_free);
+}
+
 int i915_gem_context_create_ioctl(struct drm_device *dev, void *data,
 				  struct drm_file *file);
 int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 4e8d7b2..98b04db 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2001,7 +2001,8 @@ i915_gem_get_seqno(struct drm_device *dev, u32 *seqno)
 
 int _i915_add_request(struct intel_ring_buffer *ring,
 		      u32 *out_seqno,
-		      struct drm_file *file)
+		      struct drm_file *file,
+		      struct i915_hw_context *ctx)
 {
 	drm_i915_private_t *dev_priv = ring->dev->dev_private;
 	struct drm_i915_gem_request *request;
@@ -2041,6 +2042,11 @@ int _i915_add_request(struct intel_ring_buffer *ring,
 	request->seqno = intel_ring_get_seqno(ring);
 	request->ring = ring;
 	request->tail = request_ring_position;
+	request->ctx = ctx;
+
+	if (request->ctx)
+		i915_gem_context_reference(ctx);
+
 	request->emitted_jiffies = jiffies;
 	was_empty = list_empty(&ring->request_list);
 	list_add_tail(&request->list, &ring->request_list);
@@ -2093,6 +2099,17 @@ i915_gem_request_remove_from_client(struct drm_i915_gem_request *request)
 	spin_unlock(&file_priv->mm.lock);
 }
 
+static void i915_gem_free_request(struct drm_i915_gem_request *request)
+{
+	list_del(&request->list);
+	i915_gem_request_remove_from_client(request);
+
+	if (request->ctx)
+		i915_gem_context_unreference(request->ctx);
+
+	kfree(request);
+}
+
 static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 				      struct intel_ring_buffer *ring)
 {
@@ -2103,9 +2120,7 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 					   struct drm_i915_gem_request,
 					   list);
 
-		list_del(&request->list);
-		i915_gem_request_remove_from_client(request);
-		kfree(request);
+		i915_gem_free_request(request);
 	}
 
 	while (!list_empty(&ring->active_list)) {
@@ -2197,9 +2212,7 @@ i915_gem_retire_requests_ring(struct intel_ring_buffer *ring)
 		 */
 		ring->last_retired_head = request->tail;
 
-		list_del(&request->list);
-		i915_gem_request_remove_from_client(request);
-		kfree(request);
+		i915_gem_free_request(request);
 	}
 
 	/* Move any buffers on the active list that are no longer referenced
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 7f36248..747a750 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -126,11 +126,19 @@ static int get_context_size(struct drm_device *dev)
 
 static void do_destroy(struct i915_hw_context *ctx)
 {
+	drm_gem_object_unreference(&ctx->obj->base);
+	kfree(ctx);
+}
+
+void i915_gem_context_free(struct kref *ctx_ref)
+{
+	struct i915_hw_context *ctx = container_of(ctx_ref,
+						   typeof(*ctx), ref);
+
 	if (ctx->file_priv)
 		idr_remove(&ctx->file_priv->context_idr, ctx->id);
 
-	drm_gem_object_unreference(&ctx->obj->base);
-	kfree(ctx);
+	do_destroy(ctx);
 }
 
 static struct i915_hw_context *
@@ -145,6 +153,7 @@ create_hw_context(struct drm_device *dev,
 	if (ctx == NULL)
 		return ERR_PTR(-ENOMEM);
 
+	kref_init(&ctx->ref);
 	ctx->obj = i915_gem_alloc_object(dev, dev_priv->hw_context_size);
 	if (ctx->obj == NULL) {
 		kfree(ctx);
@@ -282,8 +291,8 @@ static int context_idr_cleanup(int id, void *p, void *data)
 
 	BUG_ON(id == DEFAULT_CONTEXT_ID);
 
-	do_destroy(ctx);
-
+	ctx->file_priv = NULL;
+	i915_gem_context_unreference(ctx);
 	return 0;
 }
 
@@ -518,7 +527,9 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 		return -ENOENT;
 	}
 
-	do_destroy(ctx);
+	idr_remove(&ctx->file_priv->context_idr, ctx->id);
+	ctx->file_priv = NULL;
+	i915_gem_context_unreference(ctx);
 
 	mutex_unlock(&dev->struct_mutex);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 6624937..dd278ed 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -796,13 +796,14 @@ i915_gem_execbuffer_move_to_active(struct list_head *objects,
 static void
 i915_gem_execbuffer_retire_commands(struct drm_device *dev,
 				    struct drm_file *file,
-				    struct intel_ring_buffer *ring)
+				    struct intel_ring_buffer *ring,
+				    struct i915_hw_context *ctx)
 {
 	/* Unconditionally force add_request to emit a full flush. */
 	ring->gpu_caches_dirty = true;
 
 	/* Add a breadcrumb for the completion of the batch buffer */
-	(void)_i915_add_request(ring, NULL, file);
+	(void)_i915_add_request(ring, NULL, file, ctx);
 }
 
 static int
@@ -1077,7 +1078,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	trace_i915_gem_ring_dispatch(ring, intel_ring_get_seqno(ring), flags);
 
 	i915_gem_execbuffer_move_to_active(&eb->objects, ring);
-	i915_gem_execbuffer_retire_commands(dev, file, ring);
+	i915_gem_execbuffer_retire_commands(dev, file, ring, ctx);
 
 err:
 	eb_destroy(eb);
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 14/16] drm/i915: refuse to submit more batchbuffers from guilty context
  2013-04-04 15:32 ` [PATCH v3 14/16] drm/i915: refuse to submit more batchbuffers from guilty context Mika Kuoppala
@ 2013-04-11 15:13   ` Mika Kuoppala
  2013-04-16 11:32   ` [PATCH " Mika Kuoppala
  1 sibling, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-11 15:13 UTC (permalink / raw)
  To: intel-gfx

Mika Kuoppala <mika.kuoppala@linux.intel.com> writes:

> If context has recently submitted a faulty batchbuffers guilty of
> gpu hang and decides to keep submitting more crap, ban it permanently.
>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c            |   23 ++++++++++++++++++++++-
>  drivers/gpu/drm/i915/i915_drv.h            |    7 +++++++
>  drivers/gpu/drm/i915/i915_gem.c            |    7 +++++--
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c |   13 +++++++++++++
>  4 files changed, 47 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index a5b8aa9..0928f11 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -852,6 +852,8 @@ int intel_gpu_reset(struct drm_device *dev)
>  int i915_reset(struct drm_device *dev)
>  {
>  	drm_i915_private_t *dev_priv = dev->dev_private;
> +	struct i915_ctx_hang_stats *hs;
> +	bool do_wedge = true;
>  	int ret;
>  
>  	if (!i915_try_reset)
> @@ -859,10 +861,29 @@ int i915_reset(struct drm_device *dev)
>  
>  	mutex_lock(&dev->struct_mutex);
>  
> +	/* i915_gem_reset will set this if it finds guilty context */
> +	dev_priv->gpu_error.hang_stats = NULL;
> +
>  	i915_gem_reset(dev);
>  
> +	hs = dev_priv->gpu_error.hang_stats;
> +
> +	if (hs) {
> +		if (hs->batch_active == 1) {
> +			do_wedge = false;
> +		} else if (!hs->banned &&
> +			   get_seconds() - hs->batch_active_reset_ts < 5) {
> +			hs->banned = true;
> +			do_wedge = false;
> +		}
> +
> +		hs->batch_active_reset_ts = get_seconds();
> +	}
> +
> +	dev_priv->gpu_error.hang_stats = NULL;
> +
>  	ret = -ENODEV;
> -	if (get_seconds() - dev_priv->gpu_error.last_reset < 5)
> +	if (do_wedge && get_seconds() - dev_priv->gpu_error.last_reset < 5)
>  		DRM_ERROR("GPU hanging too fast, declaring wedged!\n");
>  	else
>  		ret = intel_gpu_reset(dev);
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 8223908..30ba79c 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -458,6 +458,12 @@ struct i915_ctx_hang_stats {
>  
>  	/* This context had batch active when hang was declared */
>  	unsigned batch_active;
> +
> +	/* Time when this context was last blamed for a GPU reset */
> +	unsigned long batch_active_reset_ts;
> +
> +	/* This context is banned to submit more work */
> +	bool banned;
>  };
>  
>  /* This must match up with the value previously used for execbuf2.rsvd1. */
> @@ -831,6 +837,7 @@ struct i915_gpu_error {
>  	struct work_struct work;
>  
>  	unsigned long last_reset;
> +	struct i915_ctx_hang_stats *hang_stats;
>  
>  	/**
>  	 * State variable and reset counter controlling the reset flow
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 475b6ad..ca5c9c3 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2147,6 +2147,7 @@ static void i915_set_reset_status(struct intel_ring_buffer *ring,
>  				  struct drm_i915_gem_request *request,
>  				  u32 acthd)
>  {
> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
>  	struct i915_ctx_hang_stats *hs = NULL;
>  	bool inside, guilty;
>  
> @@ -2175,10 +2176,12 @@ static void i915_set_reset_status(struct intel_ring_buffer *ring,
>  		hs = &request->file_priv->hang_stats;
>  
>  	if (hs) {
> -		if (guilty)
> +		if (guilty) {
>  			hs->batch_active++;
> -		else
> +			dev_priv->gpu_error.hang_stats = hs;

This is wrong. The context can be destroyed shortly after,
before reset is handled making this a dangling pointer.


> +		} else {
>  			hs->batch_pending++;
> +		}
>  	}
>  }
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index bd1750a..f1b1ea9 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -844,6 +844,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
>  	struct drm_clip_rect *cliprects = NULL;
>  	struct intel_ring_buffer *ring;
>  	struct i915_hw_context *ctx;
> +	struct i915_ctx_hang_stats *hs;
>  	u32 ctx_id = i915_execbuffer2_get_context_id(*args);
>  	u32 exec_start, exec_len;
>  	u32 mask, flags;
> @@ -1026,6 +1027,18 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
>  	if (ret)
>  		goto err;
>  
> +	hs = i915_gem_context_get_hang_stats(&dev_priv->ring[RCS],
> +					     file, ctx_id);
> +	if (IS_ERR(hs)) {
> +		ret = PTR_ERR(hs);
> +		goto err;
> +	}
> +
> +	if (hs->banned) {
> +		ret = -EIO;
> +		goto err;
> +	}
> +
>  	ctx = i915_switch_context(ring, file, ctx_id);
>  	if (IS_ERR(ctx)) {
>  		ret = PTR_ERR(ctx);
> -- 
> 1.7.9.5

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 14/16] drm/i915: refuse to submit more batchbuffers from guilty context
  2013-04-04 15:32 ` [PATCH v3 14/16] drm/i915: refuse to submit more batchbuffers from guilty context Mika Kuoppala
  2013-04-11 15:13   ` Mika Kuoppala
@ 2013-04-16 11:32   ` Mika Kuoppala
  2013-04-16 13:59     ` Chris Wilson
  1 sibling, 1 reply; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-16 11:32 UTC (permalink / raw)
  To: intel-gfx

If context has recently submitted a faulty batchbuffers guilty of
gpu hang and decides to keep submitting more crap, ban it permanently.

v2: Store guilty ban status bool in gpu_error instead of pointers
    that might become danling before hang is declared.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |    6 +++++-
 drivers/gpu/drm/i915/i915_drv.h            |    9 +++++++++
 drivers/gpu/drm/i915/i915_gem.c            |   14 ++++++++++++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   13 +++++++++++++
 4 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index bddb9a5..ae689b4 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -885,10 +885,14 @@ int i915_reset(struct drm_device *dev)
 
 	mutex_lock(&dev->struct_mutex);
 
+	/* i915_gem_reset() will set this */
+	dev_priv->gpu_error.ctx_banned = false;
+
 	i915_gem_reset(dev);
 
 	ret = -ENODEV;
-	if (get_seconds() - dev_priv->gpu_error.last_reset < 5)
+	if (!dev_priv->gpu_error.ctx_banned &&
+	    get_seconds() - dev_priv->gpu_error.last_reset < 5)
 		DRM_ERROR("GPU hanging too fast, declaring wedged!\n");
 	else
 		ret = intel_gpu_reset(dev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3b393ed..1945224 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -459,6 +459,12 @@ struct i915_ctx_hang_stats {
 
 	/* This context had batch active when hang was declared */
 	unsigned batch_active;
+
+	/* Time when this context was last blamed for a GPU reset */
+	unsigned long batch_active_reset_ts;
+
+	/* This context is banned to submit more work */
+	bool banned;
 };
 
 /* This must match up with the value previously used for execbuf2.rsvd1. */
@@ -835,6 +841,9 @@ struct i915_gpu_error {
 
 	unsigned long last_reset;
 
+	/* During reset handling, guilty context found and banned */
+	bool ctx_banned;
+
 	/**
 	 * State variable and reset counter controlling the reset flow
 	 *
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 0e87765..134528a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2147,6 +2147,7 @@ static void i915_set_reset_status(struct intel_ring_buffer *ring,
 				  struct drm_i915_gem_request *request,
 				  u32 acthd)
 {
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
 	struct i915_ctx_hang_stats *hs = NULL;
 	bool inside, guilty;
 
@@ -2175,10 +2176,19 @@ static void i915_set_reset_status(struct intel_ring_buffer *ring,
 		hs = &request->file_priv->hang_stats;
 
 	if (hs) {
-		if (guilty)
+		if (guilty) {
+			if (!hs->banned &&
+			    get_seconds() - hs->batch_active_reset_ts < 5) {
+				hs->banned = true;
+				DRM_ERROR("context hanging too fast, "
+					  "declaring banned\n");
+				dev_priv->gpu_error.ctx_banned = true;
+			}
 			hs->batch_active++;
-		else
+			hs->batch_active_reset_ts = get_seconds();
+		} else {
 			hs->batch_pending++;
+		}
 	}
 }
 
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index bd1750a..f1b1ea9 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -844,6 +844,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	struct drm_clip_rect *cliprects = NULL;
 	struct intel_ring_buffer *ring;
 	struct i915_hw_context *ctx;
+	struct i915_ctx_hang_stats *hs;
 	u32 ctx_id = i915_execbuffer2_get_context_id(*args);
 	u32 exec_start, exec_len;
 	u32 mask, flags;
@@ -1026,6 +1027,18 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	if (ret)
 		goto err;
 
+	hs = i915_gem_context_get_hang_stats(&dev_priv->ring[RCS],
+					     file, ctx_id);
+	if (IS_ERR(hs)) {
+		ret = PTR_ERR(hs);
+		goto err;
+	}
+
+	if (hs->banned) {
+		ret = -EIO;
+		goto err;
+	}
+
 	ctx = i915_switch_context(ring, file, ctx_id);
 	if (IS_ERR(ctx)) {
 		ret = PTR_ERR(ctx);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 14/16] drm/i915: refuse to submit more batchbuffers from guilty context
  2013-04-16 11:32   ` [PATCH " Mika Kuoppala
@ 2013-04-16 13:59     ` Chris Wilson
  2013-04-17 10:11       ` [PATCH v3 " Mika Kuoppala
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Wilson @ 2013-04-16 13:59 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Tue, Apr 16, 2013 at 02:32:13PM +0300, Mika Kuoppala wrote:
> If context has recently submitted a faulty batchbuffers guilty of
> gpu hang and decides to keep submitting more crap, ban it permanently.
> 
> v2: Store guilty ban status bool in gpu_error instead of pointers
>     that might become danling before hang is declared.

Meh. Stashing away a return parameter inside gpu_error, not pretty.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 14/16] drm/i915: refuse to submit more batchbuffers from guilty context
  2013-04-16 13:59     ` Chris Wilson
@ 2013-04-17 10:11       ` Mika Kuoppala
  0 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-17 10:11 UTC (permalink / raw)
  To: intel-gfx

If context has recently submitted a faulty batchbuffers guilty of
gpu hang and decides to keep submitting more crap, ban it permanently.

v2: Store guilty ban status bool in gpu_error instead of pointers
    that might become danling before hang is declared.

v3: Use return value for banned status instead of stashing state
    into gpu_error (Chris Wilson)

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            |    5 ++--
 drivers/gpu/drm/i915/i915_drv.h            |    8 ++++++-
 drivers/gpu/drm/i915/i915_gem.c            |   35 +++++++++++++++++++++-------
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   13 +++++++++++
 4 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 9ebe895..fe6f8d2 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -883,16 +883,17 @@ int i915_reset(struct drm_device *dev)
 {
 	drm_i915_private_t *dev_priv = dev->dev_private;
 	int ret;
+	bool ctx_banned;
 
 	if (!i915_try_reset)
 		return 0;
 
 	mutex_lock(&dev->struct_mutex);
 
-	i915_gem_reset(dev);
+	ctx_banned = i915_gem_reset(dev);
 
 	ret = -ENODEV;
-	if (get_seconds() - dev_priv->gpu_error.last_reset < 5)
+	if (!ctx_banned && get_seconds() - dev_priv->gpu_error.last_reset < 5)
 		DRM_ERROR("GPU hanging too fast, declaring wedged!\n");
 	else
 		ret = intel_gpu_reset(dev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d7fd16b..ecd979e 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -459,6 +459,12 @@ struct i915_ctx_hang_stats {
 
 	/* This context had batch active when hang was declared */
 	unsigned batch_active;
+
+	/* Time when this context was last blamed for a GPU reset */
+	unsigned long batch_active_reset_ts;
+
+	/* This context is banned to submit more work */
+	bool banned;
 };
 
 /* This must match up with the value previously used for execbuf2.rsvd1. */
@@ -1662,7 +1668,7 @@ static inline bool i915_terminally_wedged(struct i915_gpu_error *error)
 	return atomic_read(&error->reset_counter) == I915_WEDGED;
 }
 
-void i915_gem_reset(struct drm_device *dev);
+bool i915_gem_reset(struct drm_device *dev);
 void i915_gem_clflush_object(struct drm_i915_gem_object *obj);
 int __must_check i915_gem_object_set_domain(struct drm_i915_gem_object *obj,
 					    uint32_t read_domains,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 0e87765..646bb7e 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2143,15 +2143,15 @@ static bool i915_request_guilty(struct drm_i915_gem_request *request,
 	return false;
 }
 
-static void i915_set_reset_status(struct intel_ring_buffer *ring,
+static bool i915_set_reset_status(struct intel_ring_buffer *ring,
 				  struct drm_i915_gem_request *request,
 				  u32 acthd)
 {
 	struct i915_ctx_hang_stats *hs = NULL;
-	bool inside, guilty;
+	bool inside, guilty, banned;
 
 	/* Innocent until proven guilty */
-	guilty = false;
+	guilty = banned = false;
 
 	if (!ring->hangcheck.was_waiting &&
 	    i915_request_guilty(request, acthd, &inside)) {
@@ -2175,11 +2175,21 @@ static void i915_set_reset_status(struct intel_ring_buffer *ring,
 		hs = &request->file_priv->hang_stats;
 
 	if (hs) {
-		if (guilty)
+		if (guilty) {
+			if (!hs->banned &&
+			    get_seconds() - hs->batch_active_reset_ts < 5) {
+				hs->banned = banned = true;
+				DRM_ERROR("context hanging too fast, "
+					  "declaring banned\n");
+			}
 			hs->batch_active++;
-		else
+			hs->batch_active_reset_ts = get_seconds();
+		} else {
 			hs->batch_pending++;
+		}
 	}
+
+	return banned;
 }
 
 static void i915_gem_free_request(struct drm_i915_gem_request *request)
@@ -2193,11 +2203,12 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
 	kfree(request);
 }
 
-static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
+static bool i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 				      struct intel_ring_buffer *ring)
 {
 	u32 completed_seqno;
 	u32 acthd;
+	bool ctx_banned = false;
 
 	acthd = intel_ring_get_active_head(ring);
 	completed_seqno = ring->get_seqno(ring, false);
@@ -2210,7 +2221,8 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 					   list);
 
 		if (request->seqno > completed_seqno)
-			i915_set_reset_status(ring, request, acthd);
+			ctx_banned |= i915_set_reset_status(ring,
+							    request, acthd);
 
 		i915_gem_free_request(request);
 	}
@@ -2224,6 +2236,8 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 
 		i915_gem_object_move_to_inactive(obj);
 	}
+
+	return ctx_banned;
 }
 
 static void i915_gem_reset_fences(struct drm_device *dev)
@@ -2247,15 +2261,16 @@ static void i915_gem_reset_fences(struct drm_device *dev)
 	INIT_LIST_HEAD(&dev_priv->mm.fence_list);
 }
 
-void i915_gem_reset(struct drm_device *dev)
+bool i915_gem_reset(struct drm_device *dev)
 {
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct drm_i915_gem_object *obj;
 	struct intel_ring_buffer *ring;
 	int i;
+	bool ctx_banned = false;
 
 	for_each_ring(ring, dev_priv, i)
-		i915_gem_reset_ring_lists(dev_priv, ring);
+		ctx_banned |= i915_gem_reset_ring_lists(dev_priv, ring);
 
 	/* Move everything out of the GPU domains to ensure we do any
 	 * necessary invalidation upon reuse.
@@ -2269,6 +2284,8 @@ void i915_gem_reset(struct drm_device *dev)
 
 	/* The fence registers are invalidated so clear them out */
 	i915_gem_reset_fences(dev);
+
+	return ctx_banned;
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index bd1750a..f1b1ea9 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -844,6 +844,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	struct drm_clip_rect *cliprects = NULL;
 	struct intel_ring_buffer *ring;
 	struct i915_hw_context *ctx;
+	struct i915_ctx_hang_stats *hs;
 	u32 ctx_id = i915_execbuffer2_get_context_id(*args);
 	u32 exec_start, exec_len;
 	u32 mask, flags;
@@ -1026,6 +1027,18 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	if (ret)
 		goto err;
 
+	hs = i915_gem_context_get_hang_stats(&dev_priv->ring[RCS],
+					     file, ctx_id);
+	if (IS_ERR(hs)) {
+		ret = PTR_ERR(hs);
+		goto err;
+	}
+
+	if (hs->banned) {
+		ret = -EIO;
+		goto err;
+	}
+
 	ctx = i915_switch_context(ring, file, ctx_id);
 	if (IS_ERR(ctx)) {
 		ret = PTR_ERR(ctx);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 01/16] drm/i915: return context from i915_switch_context()
  2013-04-04 15:32 ` [PATCH v3 01/16] drm/i915: return context from i915_switch_context() Mika Kuoppala
@ 2013-04-20 18:10   ` Ben Widawsky
  0 siblings, 0 replies; 30+ messages in thread
From: Ben Widawsky @ 2013-04-20 18:10 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Thu, Apr 04, 2013 at 06:32:33PM +0300, Mika Kuoppala wrote:
> In preparation for the next commit, return context that
> was switched to from i915_switch_context().
> 
> v2: context in return value instead of param. (Ben Widawsky)
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>

I don't see this used in the next commit. While true that I initially
wrote the patch this way, I think having functions return more than just
an int is best avoided whenever possible.

Furthermore, I am immediately going to change the ring->last_context_obj
to ring->last_context, which should nullify the need to return the last
context anyway. I already have, and can submit that patch if needed.

[snip]
-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] drm/i915: reference count for i915_hw_contexts
  2013-04-10  0:12     ` [PATCH] " Ben Widawsky
@ 2013-04-20 18:11       ` Ben Widawsky
  0 siblings, 0 replies; 30+ messages in thread
From: Ben Widawsky @ 2013-04-20 18:11 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

On Tue, Apr 09, 2013 at 05:12:22PM -0700, Ben Widawsky wrote:
> From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> 
> In preparation to do analysis of which context was
> guilty of gpu hung, store kreffed context pointer
> into request struct.
> 
> This allows us to inspect contexts when gpu is reset
> even if those contexts would already be released
> by userspace.
> 
> v2: track i915_hw_context pointers instead of using ctx_ids
>     (from Chris Wilson)
> 
> v3 (Ben): Get rid of do_release() and handle refcounting more compactly.
>     (recommended by Chis)
> 
> v4: kref_* put inside static inlines (Daniel Vetter)
>     remove code duplication on freeing context (Chris Wilson)
> 
> v5: idr_remove and ctx->file_priv = NULL in destroy ioctl (Chris)
> This actually will cause a problem if one destroys a context and later
> refers to the idea of the context (multiple contexts may have the same
> id, but only 1 will exist in the idr).
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> (v2)
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net> (v3)
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> (v4)
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net> (v5)

I've tested this pretty darn thoroughly by now. I'm not sure if I can
add an r-b to something I've already signed off by, but this is also:
Tested-by: Ben Widawsky <ben@bwidawsk.net>

[snip]

-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 02/16] drm/i915: cleanup i915_add_request
  2013-04-04 15:32 ` [PATCH v3 02/16] drm/i915: cleanup i915_add_request Mika Kuoppala
@ 2013-04-20 18:36   ` Ben Widawsky
  0 siblings, 0 replies; 30+ messages in thread
From: Ben Widawsky @ 2013-04-20 18:36 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Thu, Apr 04, 2013 at 06:32:34PM +0300, Mika Kuoppala wrote:
> Only execbuffer needs all the parameters. Cleanup everything
> else behind macro.
> 
> v2: _i915_add_request as function name (Chris Wilson)
> 

Wouldn't the convention be __i915_add_request?

> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>

In terms of cleanups however, I would like to put the return parameter
(seqno) last. A later patch in the series makes that even more desirable
IMO.

I wouldn't bother with this patch, personally.

Anyway, because I can't find anything functionally incorrect, an
unenthusiastic:
Reviewed-by: Ben Widawsky <ben@bwidawsk.net>

If you fix up the argument list ordering, a somewhat enthusiastic:
Reviewed-by: Ben Widawsky <ben@bwidawsk.net>

[snip]
-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 05/16] drm/i915: track ring progression using seqnos
  2013-04-04 15:32 ` [PATCH v3 05/16] drm/i915: track ring progression using seqnos Mika Kuoppala
@ 2013-04-20 18:43   ` Ben Widawsky
  2013-04-21 21:07     ` Ben Widawsky
  0 siblings, 1 reply; 30+ messages in thread
From: Ben Widawsky @ 2013-04-20 18:43 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Thu, Apr 04, 2013 at 06:32:37PM +0300, Mika Kuoppala wrote:
> Instead of relying in acthd, track ring seqno progression
> to detect if ring has hung.
> 
> v2: put hangcheck stuff inside struct (Chris Wilson)
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>

This patch really scares me. I think we don't want to ever stop using
ACTHD. AFAICT, ACTHD is aways something we want to track in regards to
the GPU not making progress. (unless we're worried about cycles in the
CS, but there, seqno would equally not help us track progress).

The two danger cases are:
an infinite loop in a shader
never called MI_BATCH_BUFFER_END

And for both, *I think* ACTHD won't change.

I believe what the patch really wants is to stop using instdone, but
continue to use ACTHD as a fallback.

I think if you can keep ACTHD around, I'd be willing to r-b this.

Discussion moved to IRC.


[snip]
-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 05/16] drm/i915: track ring progression using seqnos
  2013-04-20 18:43   ` Ben Widawsky
@ 2013-04-21 21:07     ` Ben Widawsky
  2013-04-22 13:36       ` Mika Kuoppala
  0 siblings, 1 reply; 30+ messages in thread
From: Ben Widawsky @ 2013-04-21 21:07 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Sat, Apr 20, 2013 at 11:43:51AM -0700, Ben Widawsky wrote:
> On Thu, Apr 04, 2013 at 06:32:37PM +0300, Mika Kuoppala wrote:
> > Instead of relying in acthd, track ring seqno progression
> > to detect if ring has hung.
> > 
> > v2: put hangcheck stuff inside struct (Chris Wilson)
> > 
> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> 
> This patch really scares me. I think we don't want to ever stop using
> ACTHD. AFAICT, ACTHD is aways something we want to track in regards to
> the GPU not making progress. (unless we're worried about cycles in the
> CS, but there, seqno would equally not help us track progress).
> 
> The two danger cases are:
> an infinite loop in a shader
> never called MI_BATCH_BUFFER_END
> 
> And for both, *I think* ACTHD won't change.

I'm wrong. I was confusing HEAD with ACTHD. ACTHD is supposed to show
the address of either the batch, or the ring. HEAD is something we could
use instead though. Just a thought.

> 
> I believe what the patch really wants is to stop using instdone, but
> continue to use ACTHD as a fallback.
> 
> I think if you can keep ACTHD around, I'd be willing to r-b this.
> 
> Discussion moved to IRC.
> 
> 
> [snip]
> -- 
> Ben Widawsky, Intel Open Source Technology Center
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 05/16] drm/i915: track ring progression using seqnos
  2013-04-21 21:07     ` Ben Widawsky
@ 2013-04-22 13:36       ` Mika Kuoppala
  0 siblings, 0 replies; 30+ messages in thread
From: Mika Kuoppala @ 2013-04-22 13:36 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: intel-gfx

Ben Widawsky <ben@bwidawsk.net> writes:

> On Sat, Apr 20, 2013 at 11:43:51AM -0700, Ben Widawsky wrote:
>> On Thu, Apr 04, 2013 at 06:32:37PM +0300, Mika Kuoppala wrote:
>> > Instead of relying in acthd, track ring seqno progression
>> > to detect if ring has hung.
>> > 
>> > v2: put hangcheck stuff inside struct (Chris Wilson)
>> > 
>> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> 
>> This patch really scares me. I think we don't want to ever stop using
>> ACTHD. AFAICT, ACTHD is aways something we want to track in regards to
>> the GPU not making progress. (unless we're worried about cycles in the
>> CS, but there, seqno would equally not help us track progress).
>> 
>> The two danger cases are:
>> an infinite loop in a shader
>> never called MI_BATCH_BUFFER_END
>> 
>> And for both, *I think* ACTHD won't change.
>
> I'm wrong. I was confusing HEAD with ACTHD. ACTHD is supposed to show
> the address of either the batch, or the ring. HEAD is something we could
> use instead though. Just a thought.

Yes, i confirmed this with my test case. ACTHD will change inside the 
ring and also inside the batchbuffer.

The reason I chose seqnos is because that is the least hw (gen)
dependant way to measure the if progress is being made.
And such most close of what userspace sees as progress.
If batch doesn't get retired for whatever reason, reset will happen.

This way we can detect hangs, looped batchbuffers and infinite,
or more precisely too long loops, inside shaders.

I will need to update the commit message to reflect
the fact that with this patch, the batch needs to complete
in 3 second time window.

-Mika


>> 
>> I believe what the patch really wants is to stop using instdone, but
>> continue to use ACTHD as a fallback.
>> 
>> I think if you can keep ACTHD around, I'd be willing to r-b this.
>> 
>> Discussion moved to IRC.
>> 
>> 
>> [snip]
>> -- 
>> Ben Widawsky, Intel Open Source Technology Center
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
> -- 
> Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 00/16] arb robustness enablers v3
  2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
                   ` (15 preceding siblings ...)
  2013-04-04 15:32 ` [PATCH v3 16/16] drm/i915: add i915_get_reset_stats_ioctl Mika Kuoppala
@ 2013-04-24 23:27 ` Ben Widawsky
  16 siblings, 0 replies; 30+ messages in thread
From: Ben Widawsky @ 2013-04-24 23:27 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Thu, Apr 04, 2013 at 06:32:32PM +0300, Mika Kuoppala wrote:
> Hi,
> 
> I have addressed all the feedback I got from v2 series.
> 
> Most notably the interface has changed.
> 
> Now those users who don't care about the exact counts can use batch_pending
> and batch_active like they would be flags (see i-g-t testcase),
> assuming they destroy the context after hang is dealt with.
> For those users who want more detailed heuristics before deciding some action,
> they can use batch_pending and batch_active as counts.
> 
> Non process context related statistics are only available
> if CAP_SYS_ADMIN is hold.
> 
> Whole series can be found in:
> https://github.com/mkuoppal/linux/commits/arb-robustness
> 
> Test case for i-g-t is in:
> https://github.com/mkuoppal/intel-gpu-tools/commits/arb-robustness
> 
> Thanks for your feedback,
> -Mika

This started as a short email, but grew rather fast... sorry.


I've commented on some stuff up through 5, and I think maybe that was
lost a bit in the noise. So let me restate it here, and once we get
moving on those first 5, I'll pick the review back up.

Patch #1 looks functionally correct, but I don't really think we should
merge it (see comments in ml). It is however:
Reviewed-by: Ben Widawsky <ben@bwidawsk.net>

Patch #2 looks functionally correct, though here too, I don't think it's
worth merging.
Reviewed-by: Ben Widawsky <ben@bwidawsk.net>

I see no reason for, patch #2, and only reasons against patch #1 I've
given my $.02, and I'll leave it up to our maintainer to decide whether
he wants to merge.

Getting patch 3 upstream sooner is really important for enabling PPGTT,
and I think patch 3 is good to go. Patch #3 +
http://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/?h=ppgtt-ctx&id=35da04612650fb78b21811a13f88ad23ea1f64be
is regression testable via the existing context tests, and gives me what
I need for PPGTT (ie. they're already in my series).

Patch #3 is already SOB me, and is good to go.


Patch 4, and 5 change the way in which we detect hangs, which is scary
because we haven't really touched that much in quite a while. I'd like to
see that as just patches by themselves. The idea is solid, I feel, but
I'd like it to sit on it a bit to make sure all GPU clients can deal
with that. As a side note on patch 4, and 5, I'll point out that I think
you really want the GPU watchdog, which we can't use because the blitter
doesn't have one however at some point I believe it is added (maybe the
product that added it has even been released now) - so you should just
keep that in mind. (Shameless plug:
http://lists.freedesktop.org/archives/intel-gfx/2012-July/019012.html)

Assuming nobody screams about their GPU client which had a legitimate 3
seconds of work queued only to be reset, 4, and 5 are:
Reviewed-by: Ben Widawsky <ben@bwidawsk.net>

One thing you could do on 5 to make me a little less skittish is leave
the code in for acthd, and instdone and simply use it for bug triage.
For instance, ACTHD/instdone moving is likely a very different type of hang than
ACTHD/instdone not moving.


In short, here's what I'd recommend:
===================================
Submit 4, and 5 as a separate mini series explaining the benefits of tracking
hangs by seqno only, and what it intends to accomplish. Take my
suggestion about keeping ACTHD or not. Throw that onto dinq soon and
make sure nobody yells. Asign all hangs for the

Merge 1-3 (or drop 1 & 2, and rework 3 if someone agrees about 1 & 2).
Add the link above with the patch from Chris for good measure to catch
regressions.

Also while we're at it,
>   drm/i915: detect hang using per ring hangcheck_score
This patch introduces a warning:
'i915_hangcheck_hung' defined but not used


[snip]
-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2013-04-24 23:27 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-04 15:32 [PATCH v3 00/16] arb robustness enablers v3 Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 01/16] drm/i915: return context from i915_switch_context() Mika Kuoppala
2013-04-20 18:10   ` Ben Widawsky
2013-04-04 15:32 ` [PATCH v3 02/16] drm/i915: cleanup i915_add_request Mika Kuoppala
2013-04-20 18:36   ` Ben Widawsky
2013-04-04 15:32 ` [PATCH v3 03/16] drm/i915: reference count for i915_hw_contexts Mika Kuoppala
2013-04-09 22:18   ` Chris Wilson
2013-04-10  0:12     ` [PATCH] " Ben Widawsky
2013-04-20 18:11       ` Ben Widawsky
2013-04-04 15:32 ` [PATCH v3 04/16] drm/i915: pass seqno to i915_hangcheck_ring_idle Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 05/16] drm/i915: track ring progression using seqnos Mika Kuoppala
2013-04-20 18:43   ` Ben Widawsky
2013-04-21 21:07     ` Ben Widawsky
2013-04-22 13:36       ` Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 06/16] drm/i915: introduce i915_hangcheck_ring_hung Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 07/16] drm/i915: detect hang using per ring hangcheck_score Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 08/16] drm/i915: remove i915_hangcheck_hung Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 09/16] drm/i915: add struct i915_ctx_hang_stats Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 10/16] drm/i915: add i915_gem_context_get_hang_stats() Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 11/16] drm/i915: add batch object and context to i915_add_request() Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 12/16] drm/i915: mark rings which were waiting when hang happened Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 13/16] drm/i915: find guilty batch buffer on ring resets Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 14/16] drm/i915: refuse to submit more batchbuffers from guilty context Mika Kuoppala
2013-04-11 15:13   ` Mika Kuoppala
2013-04-16 11:32   ` [PATCH " Mika Kuoppala
2013-04-16 13:59     ` Chris Wilson
2013-04-17 10:11       ` [PATCH v3 " Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 15/16] drm/i915: add i915_reset_count Mika Kuoppala
2013-04-04 15:32 ` [PATCH v3 16/16] drm/i915: add i915_get_reset_stats_ioctl Mika Kuoppala
2013-04-24 23:27 ` [PATCH v3 00/16] arb robustness enablers v3 Ben Widawsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.