All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] gpu hangman support
@ 2012-04-25 11:57 Daniel Vetter
  2012-04-25 11:57 ` [PATCH 1/6] drm/i915: add interface to simulate gpu hangs Daniel Vetter
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Daniel Vetter @ 2012-04-25 11:57 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

Hi all,

In the light of the recent oops in the error state code I've resurrected my gpu
hangman and brushed it up a bit. Together with the hangman check in i-g-t (which
now also works a bit more reliable) we'll have a basic test of our hangcheck,
reset and error state infrastructure.

Comments, flames and excessively-painted bikesheds highly welcome.

Cheers, Daniel

Daniel Vetter (6):
  drm/i915: add interface to simulate gpu hangs
  drm/i915: rework dev->first_error locking
  drm/i915: destroy existing error_state when simulating a gpu hang
  drm/i915: simplify i915_reset a bit
  drm/i915: extract i915_do_reset
  drm/i915: make gpu hangman more resilient

 drivers/gpu/drm/i915/i915_debugfs.c     |   80 +++++++++++++++++++++++++++++--
 drivers/gpu/drm/i915/i915_drv.c         |   75 +++++++++++++++++------------
 drivers/gpu/drm/i915/i915_drv.h         |    6 ++
 drivers/gpu/drm/i915/i915_irq.c         |   12 +++--
 drivers/gpu/drm/i915/intel_ringbuffer.c |    4 ++
 5 files changed, 137 insertions(+), 40 deletions(-)

-- 
1.7.9

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] drm/i915: add interface to simulate gpu hangs
  2012-04-25 11:57 [PATCH 0/6] gpu hangman support Daniel Vetter
@ 2012-04-25 11:57 ` Daniel Vetter
  2012-04-26  0:03   ` Eugeni Dodonov
  2012-04-25 11:57 ` [PATCH 2/6] drm/i915: rework dev->first_error locking Daniel Vetter
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2012-04-25 11:57 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

gpu reset is a very important piece of our infrastructure.
Unfortunately we only really it test by actually hanging the gpu,
which often has bad side-effects for the entire system. And the gpu
hang handling code is one of the rather complicated pieces of code we
have, consisting of
- hang detection
- error capture
- actual gpu reset
- reset of all the gem bookkeeping
- reinitialition of the entire gpu

This patch adds a debugfs to selectively stopping rings by ceasing to
update the hw tail pointer, which will result in the gpu no longer
updating it's head pointer and eventually to the hangcheck firing.
This way we can exercise the gpu hang code under controlled conditions
without a dying gpu taking down the entire systems.

Patch motivated by me forgetting to properly reinitialize ppgtt after
a gpu reset.

Usage:

echo $((1 << $ringnum)) > i915_ring_stop # stops one ring

echo 0xffffffff > i915_ring_stop # stops all, future-proof version

then run whatever testload is desired. i915_ring_stop automatically
resets after a gpu hang is detected to avoid hanging the gpu to fast
and declaring it wedged.

v2: Incorporate feedback from Chris Wilson.

v3: Add the missing cleanup.

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_debugfs.c     |   65 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_drv.c         |    2 +
 drivers/gpu/drm/i915/i915_drv.h         |    2 +
 drivers/gpu/drm/i915/intel_ringbuffer.c |    4 ++
 4 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 54a1066..51e6b3f 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1601,6 +1601,64 @@ static const struct file_operations i915_wedged_fops = {
 };
 
 static ssize_t
+i915_ring_stop_read(struct file *filp,
+		    char __user *ubuf,
+		    size_t max,
+		    loff_t *ppos)
+{
+	struct drm_device *dev = filp->private_data;
+	drm_i915_private_t *dev_priv = dev->dev_private;
+	char buf[80];
+	int len;
+
+	len = snprintf(buf, sizeof(buf),
+		       "0x%08x\n", dev_priv->stop_rings);
+
+	if (len > sizeof(buf))
+		len = sizeof(buf);
+
+	return simple_read_from_buffer(ubuf, max, ppos, buf, len);
+}
+
+static ssize_t
+i915_ring_stop_write(struct file *filp,
+		     const char __user *ubuf,
+		     size_t cnt,
+		     loff_t *ppos)
+{
+	struct drm_device *dev = filp->private_data;
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	char buf[20];
+	int val = 0;
+
+	if (cnt > 0) {
+		if (cnt > sizeof(buf) - 1)
+			return -EINVAL;
+
+		if (copy_from_user(buf, ubuf, cnt))
+			return -EFAULT;
+		buf[cnt] = 0;
+
+		val = simple_strtoul(buf, NULL, 0);
+	}
+
+	DRM_DEBUG_DRIVER("Stopping rings 0x%08x\n", val);
+
+	mutex_lock(&dev->struct_mutex);
+	dev_priv->stop_rings = val;
+	mutex_unlock(&dev->struct_mutex);
+
+	return cnt;
+}
+
+static const struct file_operations i915_ring_stop_fops = {
+	.owner = THIS_MODULE,
+	.open = simple_open,
+	.read = i915_ring_stop_read,
+	.write = i915_ring_stop_write,
+	.llseek = default_llseek,
+};
+static ssize_t
 i915_max_freq_read(struct file *filp,
 		   char __user *ubuf,
 		   size_t max,
@@ -1901,6 +1959,11 @@ int i915_debugfs_init(struct drm_minor *minor)
 				  &i915_cache_sharing_fops);
 	if (ret)
 		return ret;
+	ret = i915_debugfs_create(minor->debugfs_root, minor,
+				  "i915_ring_stop",
+				  &i915_ring_stop_fops);
+	if (ret)
+		return ret;
 
 	return drm_debugfs_create_files(i915_debugfs_list,
 					I915_DEBUGFS_ENTRIES,
@@ -1919,6 +1982,8 @@ void i915_debugfs_cleanup(struct drm_minor *minor)
 				 1, minor);
 	drm_debugfs_remove_files((struct drm_info_list *) &i915_cache_sharing_fops,
 				 1, minor);
+	drm_debugfs_remove_files((struct drm_info_list *) &i915_ring_stop_fops,
+				 1, minor);
 }
 
 #endif /* CONFIG_DEBUG_FS */
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 8a98f9a..90a84f9 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -800,6 +800,8 @@ int i915_reset(struct drm_device *dev, u8 flags)
 	if (!mutex_trylock(&dev->struct_mutex))
 		return -EBUSY;
 
+	dev_priv->stop_rings = 0;
+
 	i915_gem_reset(dev);
 
 	ret = -ENODEV;
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index c144013..ee670f8 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -414,6 +414,8 @@ typedef struct drm_i915_private {
 	uint32_t last_instdone;
 	uint32_t last_instdone1;
 
+	unsigned int stop_rings;
+
 	unsigned long cfb_size;
 	unsigned int cfb_fb;
 	enum plane cfb_plane;
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index f797613..827e340 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1217,7 +1217,11 @@ int intel_ring_begin(struct intel_ring_buffer *ring,
 
 void intel_ring_advance(struct intel_ring_buffer *ring)
 {
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+
 	ring->tail &= ring->size - 1;
+	if (dev_priv->stop_rings & intel_ring_flag(ring))
+		return;
 	ring->write_tail(ring, ring->tail);
 }
 
-- 
1.7.9

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/6] drm/i915: rework dev->first_error locking
  2012-04-25 11:57 [PATCH 0/6] gpu hangman support Daniel Vetter
  2012-04-25 11:57 ` [PATCH 1/6] drm/i915: add interface to simulate gpu hangs Daniel Vetter
@ 2012-04-25 11:57 ` Daniel Vetter
  2012-04-25 11:57 ` [PATCH 3/6] drm/i915: destroy existing error_state when simulating a gpu hang Daniel Vetter
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Daniel Vetter @ 2012-04-25 11:57 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

- reduce the irq disabled section, even for a debugfs file this was
  way too long.
- always disable irqs when taking the lock.

v2: Thou shalt not mistake locking for reference counting, so:
- reference count the error_state to protect from concurent freeeing.
  This will be only really used in the next patch.

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_debugfs.c |   13 ++++++++-----
 drivers/gpu/drm/i915/i915_drv.h     |    4 ++++
 drivers/gpu/drm/i915/i915_irq.c     |   12 +++++++-----
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 51e6b3f..e2f96d2 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -717,12 +717,16 @@ static int i915_error_state(struct seq_file *m, void *unused)
 	int i, j, page, offset, elt;
 
 	spin_lock_irqsave(&dev_priv->error_lock, flags);
-	if (!dev_priv->first_error) {
+	error = dev_priv->first_error;
+	if (error)
+		kref_get(&error->ref);
+	spin_unlock_irqrestore(&dev_priv->error_lock, flags);
+
+	if (!error) {
 		seq_printf(m, "no error state collected\n");
-		goto out;
+		return 0;
 	}
 
-	error = dev_priv->first_error;
 
 	seq_printf(m, "Time: %ld s %ld us\n", error->time.tv_sec,
 		   error->time.tv_usec);
@@ -804,8 +808,7 @@ static int i915_error_state(struct seq_file *m, void *unused)
 	if (error->display)
 		intel_display_print_error_state(m, dev, error->display);
 
-out:
-	spin_unlock_irqrestore(&dev_priv->error_lock, flags);
+	kref_put(&error->ref, i915_error_state_free);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ee670f8..c59f92b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -39,6 +39,7 @@
 #include <drm/intel-gtt.h>
 #include <linux/backlight.h>
 #include <linux/intel-iommu.h>
+#include <linux/kref.h>
 
 /* General customization:
  */
@@ -171,6 +172,7 @@ struct sdvo_device_mapping {
 struct intel_display_error_state;
 
 struct drm_i915_error_state {
+	struct kref ref;
 	u32 eir;
 	u32 pgtbl_er;
 	u32 pipestat[I915_MAX_PIPES];
@@ -467,6 +469,7 @@ typedef struct drm_i915_private {
 	unsigned int fsb_freq, mem_freq, is_ddr3;
 
 	spinlock_t error_lock;
+	/* Protected by dev->error_lock. */
 	struct drm_i915_error_state *first_error;
 	struct work_struct error_work;
 	struct completion error_completion;
@@ -1179,6 +1182,7 @@ extern int i915_vblank_pipe_get(struct drm_device *dev, void *data,
 				struct drm_file *file_priv);
 extern int i915_vblank_swap(struct drm_device *dev, void *data,
 			    struct drm_file *file_priv);
+void i915_error_state_free(struct kref *error_ref);
 
 void
 i915_enable_pipestat(drm_i915_private_t *dev_priv, int pipe, u32 mask);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 26172ee..cf02898 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -902,10 +902,11 @@ i915_error_object_free(struct drm_i915_error_object *obj)
 	kfree(obj);
 }
 
-static void
-i915_error_state_free(struct drm_device *dev,
-		      struct drm_i915_error_state *error)
+void
+i915_error_state_free(struct kref *error_ref)
 {
+	struct drm_i915_error_state *error = container_of(error_ref,
+							  typeof(*error), ref);
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(error->ring); i++) {
@@ -1134,6 +1135,7 @@ static void i915_capture_error_state(struct drm_device *dev)
 	DRM_INFO("capturing error event; look for more information in /debug/dri/%d/i915_error_state\n",
 		 dev->primary->index);
 
+	kref_init(&error->ref);
 	error->eir = I915_READ(EIR);
 	error->pgtbl_er = I915_READ(PGTBL_ER);
 	for_each_pipe(pipe)
@@ -1194,7 +1196,7 @@ static void i915_capture_error_state(struct drm_device *dev)
 	spin_unlock_irqrestore(&dev_priv->error_lock, flags);
 
 	if (error)
-		i915_error_state_free(dev, error);
+		i915_error_state_free(&error->ref);
 }
 
 void i915_destroy_error_state(struct drm_device *dev)
@@ -1209,7 +1211,7 @@ void i915_destroy_error_state(struct drm_device *dev)
 	spin_unlock_irqrestore(&dev_priv->error_lock, flags);
 
 	if (error)
-		i915_error_state_free(dev, error);
+		kref_put(&error->ref, i915_error_state_free);
 }
 #else
 #define i915_capture_error_state(x)
-- 
1.7.9

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/6] drm/i915: destroy existing error_state when simulating a gpu hang
  2012-04-25 11:57 [PATCH 0/6] gpu hangman support Daniel Vetter
  2012-04-25 11:57 ` [PATCH 1/6] drm/i915: add interface to simulate gpu hangs Daniel Vetter
  2012-04-25 11:57 ` [PATCH 2/6] drm/i915: rework dev->first_error locking Daniel Vetter
@ 2012-04-25 11:57 ` Daniel Vetter
  2012-04-25 20:20   ` [PATCH] drm/i915: allow the existing error_state to be destroyed Daniel Vetter
  2012-04-25 11:57 ` [PATCH 4/6] drm/i915: simplify i915_reset a bit Daniel Vetter
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2012-04-25 11:57 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

This way we can simulate a bunch of gpu hangs and run the error_state
capture code every time (without the need to reload the module).

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_debugfs.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index e2f96d2..819b4a9 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1648,6 +1648,8 @@ i915_ring_stop_write(struct file *filp,
 	DRM_DEBUG_DRIVER("Stopping rings 0x%08x\n", val);
 
 	mutex_lock(&dev->struct_mutex);
+	i915_destroy_error_state(dev);
+
 	dev_priv->stop_rings = val;
 	mutex_unlock(&dev->struct_mutex);
 
-- 
1.7.9

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/6] drm/i915: simplify i915_reset a bit
  2012-04-25 11:57 [PATCH 0/6] gpu hangman support Daniel Vetter
                   ` (2 preceding siblings ...)
  2012-04-25 11:57 ` [PATCH 3/6] drm/i915: destroy existing error_state when simulating a gpu hang Daniel Vetter
@ 2012-04-25 11:57 ` Daniel Vetter
  2012-04-25 16:18   ` Jesse Barnes
  2012-04-25 11:57 ` [PATCH 5/6] drm/i915: extract i915_do_reset Daniel Vetter
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2012-04-25 11:57 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

- need_display is always true, scrap it.
- don't reacquire the mutex to do nothing after having restored the
  gem state.

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_drv.c |   19 +++++--------------
 1 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 90a84f9..3ffa9e7 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -787,11 +787,6 @@ static int gen6_do_reset(struct drm_device *dev, u8 flags)
 int i915_reset(struct drm_device *dev, u8 flags)
 {
 	drm_i915_private_t *dev_priv = dev->dev_private;
-	/*
-	 * We really should only reset the display subsystem if we actually
-	 * need to
-	 */
-	bool need_display = true;
 	int ret;
 
 	if (!i915_try_reset)
@@ -865,22 +860,18 @@ int i915_reset(struct drm_device *dev, u8 flags)
 		drm_irq_uninstall(dev);
 		drm_mode_config_reset(dev);
 		drm_irq_install(dev);
-
-		mutex_lock(&dev->struct_mutex);
+	} else {
+		mutex_unlock(&dev->struct_mutex);
 	}
 
-	mutex_unlock(&dev->struct_mutex);
-
 	/*
 	 * Perform a full modeset as on later generations, e.g. Ironlake, we may
 	 * need to retrain the display link and cannot just restore the register
 	 * values.
 	 */
-	if (need_display) {
-		mutex_lock(&dev->mode_config.mutex);
-		drm_helper_resume_force_mode(dev);
-		mutex_unlock(&dev->mode_config.mutex);
-	}
+	mutex_lock(&dev->mode_config.mutex);
+	drm_helper_resume_force_mode(dev);
+	mutex_unlock(&dev->mode_config.mutex);
 
 	return 0;
 }
-- 
1.7.9

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/6] drm/i915: extract i915_do_reset
  2012-04-25 11:57 [PATCH 0/6] gpu hangman support Daniel Vetter
                   ` (3 preceding siblings ...)
  2012-04-25 11:57 ` [PATCH 4/6] drm/i915: simplify i915_reset a bit Daniel Vetter
@ 2012-04-25 11:57 ` Daniel Vetter
  2012-04-25 13:14   ` [PATCH] drm/i915: extract intel_gpu_reset Daniel Vetter
  2012-04-25 11:57 ` [PATCH 6/6] drm/i915: make gpu hangman more resilient Daniel Vetter
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2012-04-25 11:57 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter, Ben Widawsky

Slightly cleans up the code and could be useful for e.g. Ben
Widawsky's hw context patches.

Cc: Ben Widawsky <ben@bwidawsk.net>
Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_drv.c |   43 ++++++++++++++++++++++++--------------
 1 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 3ffa9e7..fac9717 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -768,6 +768,29 @@ static int gen6_do_reset(struct drm_device *dev, u8 flags)
 	return ret;
 }
 
+static int intel_do_hw_reset(struct drm_device *dev, u8 flags)
+{
+	int ret = -ENODEV;
+
+	switch (INTEL_INFO(dev)->gen) {
+	case 7:
+	case 6:
+		ret = gen6_do_reset(dev, flags);
+		break;
+	case 5:
+		ret = ironlake_do_reset(dev, flags);
+		break;
+	case 4:
+		ret = i965_do_reset(dev, flags);
+		break;
+	case 2:
+		ret = i8xx_do_reset(dev, flags);
+		break;
+	}
+
+	return ret;
+}
+
 /**
  * i915_reset - reset chip after a hang
  * @dev: drm device to reset
@@ -800,23 +823,11 @@ int i915_reset(struct drm_device *dev, u8 flags)
 	i915_gem_reset(dev);
 
 	ret = -ENODEV;
-	if (get_seconds() - dev_priv->last_gpu_reset < 5) {
+	if (get_seconds() - dev_priv->last_gpu_reset < 5)
 		DRM_ERROR("GPU hanging too fast, declaring wedged!\n");
-	} else switch (INTEL_INFO(dev)->gen) {
-	case 7:
-	case 6:
-		ret = gen6_do_reset(dev, flags);
-		break;
-	case 5:
-		ret = ironlake_do_reset(dev, flags);
-		break;
-	case 4:
-		ret = i965_do_reset(dev, flags);
-		break;
-	case 2:
-		ret = i8xx_do_reset(dev, flags);
-		break;
-	}
+	else
+		ret = intel_do_hw_reset(dev, flags);
+
 	dev_priv->last_gpu_reset = get_seconds();
 	if (ret) {
 		DRM_ERROR("Failed to reset chip.\n");
-- 
1.7.9

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/6] drm/i915: make gpu hangman more resilient
  2012-04-25 11:57 [PATCH 0/6] gpu hangman support Daniel Vetter
                   ` (4 preceding siblings ...)
  2012-04-25 11:57 ` [PATCH 5/6] drm/i915: extract i915_do_reset Daniel Vetter
@ 2012-04-25 11:57 ` Daniel Vetter
       [not found]   ` <CAC7LmnuSWGfup9Vd9dnH_4BanFguRUj_y8Q1L+Jx-RoaUK2KsA@mail.gmail.com>
  2012-04-25 12:42 ` [PATCH 0/6] gpu hangman support Chris Wilson
  2012-04-27  0:03 ` Ben Widawsky
  7 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2012-04-25 11:57 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

- reset the stop_rings infrastructure while resetting the hw to
  avoid angering the hangcheck right away (and potentially declaring
  the gpu permanently wedged).

- ignore reset failures when hanging due to the hangman - we don't
  have reset code for all generations.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_drv.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index fac9717..d4030c5 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -770,6 +770,7 @@ static int gen6_do_reset(struct drm_device *dev, u8 flags)
 
 static int intel_do_hw_reset(struct drm_device *dev, u8 flags)
 {
+	struct drm_i915_private *dev_priv = dev->dev_private;
 	int ret = -ENODEV;
 
 	switch (INTEL_INFO(dev)->gen) {
@@ -788,6 +789,16 @@ static int intel_do_hw_reset(struct drm_device *dev, u8 flags)
 		break;
 	}
 
+	/* Also reset the gpu hangman. */
+	if (dev_priv->stop_rings) {
+		DRM_DEBUG("Simulated gpu hang, resetting stop_rings\n");
+		dev_priv->stop_rings = 0;
+		if (ret) {
+			DRM_ERROR("Ignoring reset failure in simulated gpu hang\n");
+			ret = 0;
+		}
+	}
+
 	return ret;
 }
 
-- 
1.7.9

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] gpu hangman support
  2012-04-25 11:57 [PATCH 0/6] gpu hangman support Daniel Vetter
                   ` (5 preceding siblings ...)
  2012-04-25 11:57 ` [PATCH 6/6] drm/i915: make gpu hangman more resilient Daniel Vetter
@ 2012-04-25 12:42 ` Chris Wilson
  2012-04-27  0:03 ` Ben Widawsky
  7 siblings, 0 replies; 19+ messages in thread
From: Chris Wilson @ 2012-04-25 12:42 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

On Wed, 25 Apr 2012 13:57:07 +0200, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> Hi all,
> 
> In the light of the recent oops in the error state code I've resurrected my gpu
> hangman and brushed it up a bit. Together with the hangman check in i-g-t (which
> now also works a bit more reliable) we'll have a basic test of our hangcheck,
> reset and error state infrastructure.

My only real complaint, aside from plenty of good advice on colour
selection, is the freeing of error state from within
/debug/i915_gem_stop_rings. I think that is a limitation borne out of
convenience, and would rather have a writable /debug/i915_error_state.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] drm/i915: extract intel_gpu_reset
  2012-04-25 11:57 ` [PATCH 5/6] drm/i915: extract i915_do_reset Daniel Vetter
@ 2012-04-25 13:14   ` Daniel Vetter
  2012-04-25 16:43     ` Ben Widawsky
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2012-04-25 13:14 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter, Ben Widawsky

Slightly cleans up the code and could be useful for e.g. Ben
Widawsky's hw context patches.

v2: New colours!

Cc: Ben Widawsky <ben@bwidawsk.net>
Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_drv.c |   43 ++++++++++++++++++++++++--------------
 1 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 3ffa9e7..2763084 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -768,6 +768,29 @@ static int gen6_do_reset(struct drm_device *dev, u8 flags)
 	return ret;
 }
 
+static int intel_gpu_reset(struct drm_device *dev, u8 flags)
+{
+	int ret = -ENODEV;
+
+	switch (INTEL_INFO(dev)->gen) {
+	case 7:
+	case 6:
+		ret = gen6_do_reset(dev, flags);
+		break;
+	case 5:
+		ret = ironlake_do_reset(dev, flags);
+		break;
+	case 4:
+		ret = i965_do_reset(dev, flags);
+		break;
+	case 2:
+		ret = i8xx_do_reset(dev, flags);
+		break;
+	}
+
+	return ret;
+}
+
 /**
  * i915_reset - reset chip after a hang
  * @dev: drm device to reset
@@ -800,23 +823,11 @@ int i915_reset(struct drm_device *dev, u8 flags)
 	i915_gem_reset(dev);
 
 	ret = -ENODEV;
-	if (get_seconds() - dev_priv->last_gpu_reset < 5) {
+	if (get_seconds() - dev_priv->last_gpu_reset < 5)
 		DRM_ERROR("GPU hanging too fast, declaring wedged!\n");
-	} else switch (INTEL_INFO(dev)->gen) {
-	case 7:
-	case 6:
-		ret = gen6_do_reset(dev, flags);
-		break;
-	case 5:
-		ret = ironlake_do_reset(dev, flags);
-		break;
-	case 4:
-		ret = i965_do_reset(dev, flags);
-		break;
-	case 2:
-		ret = i8xx_do_reset(dev, flags);
-		break;
-	}
+	else
+		ret = intel_gpu_reset(dev, flags);
+
 	dev_priv->last_gpu_reset = get_seconds();
 	if (ret) {
 		DRM_ERROR("Failed to reset chip.\n");
-- 
1.7.9

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] drm/i915: simplify i915_reset a bit
  2012-04-25 11:57 ` [PATCH 4/6] drm/i915: simplify i915_reset a bit Daniel Vetter
@ 2012-04-25 16:18   ` Jesse Barnes
  2012-04-25 20:27     ` Daniel Vetter
  0 siblings, 1 reply; 19+ messages in thread
From: Jesse Barnes @ 2012-04-25 16:18 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel Graphics Development

On Wed, 25 Apr 2012 13:57:11 +0200
Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> - need_display is always true, scrap it.
> - don't reacquire the mutex to do nothing after having restored the
>   gem state.
> 

Actually I think we generally *don't* need to reset display.  It's
currently there just because we haven't tried reset handling without
it.  But not resetting display would make resets a little less visible,
which would be nice.  Now that you have a nice test setup, can you try
testing without the display engine bit set (i.e. only reset render and
media at hang time)?

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] drm/i915: extract intel_gpu_reset
  2012-04-25 13:14   ` [PATCH] drm/i915: extract intel_gpu_reset Daniel Vetter
@ 2012-04-25 16:43     ` Ben Widawsky
  0 siblings, 0 replies; 19+ messages in thread
From: Ben Widawsky @ 2012-04-25 16:43 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel Graphics Development

On Wed, 25 Apr 2012 15:14:05 +0200
Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> Slightly cleans up the code and could be useful for e.g. Ben
> Widawsky's hw context patches.
> 
> v2: New colours!
> 
> Cc: Ben Widawsky <ben@bwidawsk.net>
> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

I have this queued for both HW context, and DPF, but you've beaten me to
it.

Reviewed-by: Ben Widawsky <ben@bwidawsk.net>
or (your choice)
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>

> ---
>  drivers/gpu/drm/i915/i915_drv.c |   43 ++++++++++++++++++++++++--------------
>  1 files changed, 27 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 3ffa9e7..2763084 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -768,6 +768,29 @@ static int gen6_do_reset(struct drm_device *dev, u8 flags)
>  	return ret;
>  }
>  
> +static int intel_gpu_reset(struct drm_device *dev, u8 flags)
> +{
> +	int ret = -ENODEV;
> +
> +	switch (INTEL_INFO(dev)->gen) {
> +	case 7:
> +	case 6:
> +		ret = gen6_do_reset(dev, flags);
> +		break;
> +	case 5:
> +		ret = ironlake_do_reset(dev, flags);
> +		break;
> +	case 4:
> +		ret = i965_do_reset(dev, flags);
> +		break;
> +	case 2:
> +		ret = i8xx_do_reset(dev, flags);
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
>  /**
>   * i915_reset - reset chip after a hang
>   * @dev: drm device to reset
> @@ -800,23 +823,11 @@ int i915_reset(struct drm_device *dev, u8 flags)
>  	i915_gem_reset(dev);
>  
>  	ret = -ENODEV;
> -	if (get_seconds() - dev_priv->last_gpu_reset < 5) {
> +	if (get_seconds() - dev_priv->last_gpu_reset < 5)
>  		DRM_ERROR("GPU hanging too fast, declaring wedged!\n");
> -	} else switch (INTEL_INFO(dev)->gen) {
> -	case 7:
> -	case 6:
> -		ret = gen6_do_reset(dev, flags);
> -		break;
> -	case 5:
> -		ret = ironlake_do_reset(dev, flags);
> -		break;
> -	case 4:
> -		ret = i965_do_reset(dev, flags);
> -		break;
> -	case 2:
> -		ret = i8xx_do_reset(dev, flags);
> -		break;
> -	}
> +	else
> +		ret = intel_gpu_reset(dev, flags);
> +
>  	dev_priv->last_gpu_reset = get_seconds();
>  	if (ret) {
>  		DRM_ERROR("Failed to reset chip.\n");

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] drm/i915: allow the existing error_state to be destroyed
  2012-04-25 11:57 ` [PATCH 3/6] drm/i915: destroy existing error_state when simulating a gpu hang Daniel Vetter
@ 2012-04-25 20:20   ` Daniel Vetter
  0 siblings, 0 replies; 19+ messages in thread
From: Daniel Vetter @ 2012-04-25 20:20 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter

... by writing (anything) to i915_error_state.

This way we can simulate a bunch of gpu hangs and run the error_state
capture code every time (without the need to reload the module).

To make that happen we need to abandon the simple seq_file wrappers
provided by the drm core. While at it, but the new error_state
refcounting to some good use and associated. This should help greatly
when we finally get around to split up the giant single seq_file block
that the error_state file currently is into smaller parts.

Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_debugfs.c |   85 ++++++++++++++++++++++++++++++-----
 1 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index e2f96d2..a932e28 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -707,21 +707,19 @@ static void i915_ring_error_state(struct seq_file *m,
 	seq_printf(m, "  ring->tail: 0x%08x\n", error->cpu_ring_tail[ring]);
 }
 
+struct i915_error_state_file_priv {
+	struct drm_device *dev;
+	struct drm_i915_error_state *error;
+};
+
 static int i915_error_state(struct seq_file *m, void *unused)
 {
-	struct drm_info_node *node = (struct drm_info_node *) m->private;
-	struct drm_device *dev = node->minor->dev;
+	struct i915_error_state_file_priv *error_priv = m->private;
+	struct drm_device *dev = error_priv->dev;
 	drm_i915_private_t *dev_priv = dev->dev_private;
-	struct drm_i915_error_state *error;
-	unsigned long flags;
+	struct drm_i915_error_state *error = error_priv->error;
 	int i, j, page, offset, elt;
 
-	spin_lock_irqsave(&dev_priv->error_lock, flags);
-	error = dev_priv->first_error;
-	if (error)
-		kref_get(&error->ref);
-	spin_unlock_irqrestore(&dev_priv->error_lock, flags);
-
 	if (!error) {
 		seq_printf(m, "no error state collected\n");
 		return 0;
@@ -813,6 +811,65 @@ static int i915_error_state(struct seq_file *m, void *unused)
 	return 0;
 }
 
+static ssize_t
+i915_error_state_write(struct file *filp,
+		       const char __user *ubuf,
+		       size_t cnt,
+		       loff_t *ppos)
+{
+	struct drm_device *dev = filp->private_data;
+
+	DRM_DEBUG_DRIVER("Resetting error state\n");
+
+	mutex_lock(&dev->struct_mutex);
+	i915_destroy_error_state(dev);
+	mutex_unlock(&dev->struct_mutex);
+
+	return cnt;
+}
+
+static int i915_error_state_open(struct inode *inode, struct file *file)
+{
+	struct drm_device *dev = inode->i_private;
+	drm_i915_private_t *dev_priv = dev->dev_private;
+	struct i915_error_state_file_priv *error_priv;
+	unsigned long flags;
+
+	error_priv = kzalloc(sizeof(*error_priv), GFP_KERNEL);
+	if (!error_priv)
+		return -ENOMEM;
+
+	error_priv->dev = dev;
+
+	spin_lock_irqsave(&dev_priv->error_lock, flags);
+	error_priv->error = dev_priv->first_error;
+	if (error_priv->error)
+		kref_get(&error_priv->error->ref);
+	spin_unlock_irqrestore(&dev_priv->error_lock, flags);
+
+	return single_open(file, i915_error_state, error_priv);
+}
+
+static int i915_error_state_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *m = file->private_data;
+	struct i915_error_state_file_priv *error_priv = m->private;
+
+	kref_put(&error_priv->error->ref, i915_error_state_free);
+	kfree(error_priv);
+
+	return single_release(inode, file);
+}
+
+static const struct file_operations i915_error_state_fops = {
+	.owner = THIS_MODULE,
+	.open = i915_error_state_open,
+	.read = seq_read,
+	.write = i915_error_state_write,
+	.llseek = default_llseek,
+	.release = i915_error_state_release,
+};
+
 static int i915_rstdby_delays(struct seq_file *m, void *unused)
 {
 	struct drm_info_node *node = (struct drm_info_node *) m->private;
@@ -1661,6 +1718,7 @@ static const struct file_operations i915_ring_stop_fops = {
 	.write = i915_ring_stop_write,
 	.llseek = default_llseek,
 };
+
 static ssize_t
 i915_max_freq_read(struct file *filp,
 		   char __user *ubuf,
@@ -1916,7 +1974,6 @@ static struct drm_info_list i915_debugfs_list[] = {
 	{"i915_gem_hws", i915_hws_info, 0, (void *)RCS},
 	{"i915_gem_hws_blt", i915_hws_info, 0, (void *)BCS},
 	{"i915_gem_hws_bsd", i915_hws_info, 0, (void *)VCS},
-	{"i915_error_state", i915_error_state, 0},
 	{"i915_rstdby_delays", i915_rstdby_delays, 0},
 	{"i915_cur_delayinfo", i915_cur_delayinfo, 0},
 	{"i915_delayfreq_table", i915_delayfreq_table, 0},
@@ -1968,6 +2025,12 @@ int i915_debugfs_init(struct drm_minor *minor)
 	if (ret)
 		return ret;
 
+	ret = i915_debugfs_create(minor->debugfs_root, minor,
+				  "i915_error_state",
+				  &i915_error_state_fops);
+	if (ret)
+		return ret;
+
 	return drm_debugfs_create_files(i915_debugfs_list,
 					I915_DEBUGFS_ENTRIES,
 					minor->debugfs_root, minor);
-- 
1.7.9

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] drm/i915: simplify i915_reset a bit
  2012-04-25 16:18   ` Jesse Barnes
@ 2012-04-25 20:27     ` Daniel Vetter
  2012-04-25 20:34       ` Jesse Barnes
  2012-04-25 23:14       ` Eric Anholt
  0 siblings, 2 replies; 19+ messages in thread
From: Daniel Vetter @ 2012-04-25 20:27 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Daniel Vetter, Intel Graphics Development

On Wed, Apr 25, 2012 at 09:18:21AM -0700, Jesse Barnes wrote:
> On Wed, 25 Apr 2012 13:57:11 +0200
> Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> 
> > - need_display is always true, scrap it.
> > - don't reacquire the mutex to do nothing after having restored the
> >   gem state.
> > 
> 
> Actually I think we generally *don't* need to reset display.  It's
> currently there just because we haven't tried reset handling without
> it.  But not resetting display would make resets a little less visible,
> which would be nice.  Now that you have a nice test setup, can you try
> testing without the display engine bit set (i.e. only reset render and
> media at hang time)?

Imo 'less visible' and gpu hang don't go well together, the 3 second
freeze plus screen flicker at least ensure that users report bugs. And I
have no idea whether resetting the entire gpu or just the render portion
has a greater chance of survival. So I'm not sure whether working on this
has much benefit ... at least opposed to working on the bugs ;-)
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] drm/i915: simplify i915_reset a bit
  2012-04-25 20:27     ` Daniel Vetter
@ 2012-04-25 20:34       ` Jesse Barnes
  2012-04-25 21:04         ` Daniel Vetter
  2012-04-25 23:14       ` Eric Anholt
  1 sibling, 1 reply; 19+ messages in thread
From: Jesse Barnes @ 2012-04-25 20:34 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Intel Graphics Development

On Wed, 25 Apr 2012 22:27:12 +0200
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Wed, Apr 25, 2012 at 09:18:21AM -0700, Jesse Barnes wrote:
> > On Wed, 25 Apr 2012 13:57:11 +0200
> > Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > 
> > > - need_display is always true, scrap it.
> > > - don't reacquire the mutex to do nothing after having restored the
> > >   gem state.
> > > 
> > 
> > Actually I think we generally *don't* need to reset display.  It's
> > currently there just because we haven't tried reset handling without
> > it.  But not resetting display would make resets a little less visible,
> > which would be nice.  Now that you have a nice test setup, can you try
> > testing without the display engine bit set (i.e. only reset render and
> > media at hang time)?
> 
> Imo 'less visible' and gpu hang don't go well together, the 3 second
> freeze plus screen flicker at least ensure that users report bugs. And I
> have no idea whether resetting the entire gpu or just the render portion
> has a greater chance of survival. So I'm not sure whether working on this
> has much benefit ... at least opposed to working on the bugs ;-)

With the CPU/PCH split, display reset is actually a really big hammer.
Theoretically, just doing a render/media reset will be more reliable
and have less chance of wedging the system really hard.

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] drm/i915: simplify i915_reset a bit
  2012-04-25 20:34       ` Jesse Barnes
@ 2012-04-25 21:04         ` Daniel Vetter
  0 siblings, 0 replies; 19+ messages in thread
From: Daniel Vetter @ 2012-04-25 21:04 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Daniel Vetter, Intel Graphics Development

On Wed, Apr 25, 2012 at 01:34:45PM -0700, Jesse Barnes wrote:
> On Wed, 25 Apr 2012 22:27:12 +0200
> Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Wed, Apr 25, 2012 at 09:18:21AM -0700, Jesse Barnes wrote:
> > > On Wed, 25 Apr 2012 13:57:11 +0200
> > > Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > 
> > > > - need_display is always true, scrap it.
> > > > - don't reacquire the mutex to do nothing after having restored the
> > > >   gem state.
> > > > 
> > > 
> > > Actually I think we generally *don't* need to reset display.  It's
> > > currently there just because we haven't tried reset handling without
> > > it.  But not resetting display would make resets a little less visible,
> > > which would be nice.  Now that you have a nice test setup, can you try
> > > testing without the display engine bit set (i.e. only reset render and
> > > media at hang time)?
> > 
> > Imo 'less visible' and gpu hang don't go well together, the 3 second
> > freeze plus screen flicker at least ensure that users report bugs. And I
> > have no idea whether resetting the entire gpu or just the render portion
> > has a greater chance of survival. So I'm not sure whether working on this
> > has much benefit ... at least opposed to working on the bugs ;-)
> 
> With the CPU/PCH split, display reset is actually a really big hammer.
> Theoretically, just doing a render/media reset will be more reliable
> and have less chance of wedging the system really hard.

Hm, yeah. I think on pch we already only do a reset of the render/media
core and no longer of the display unit. So I think we could just add a
pch_split test around the. But atm we still have some funny things like
the flags parameter which half the reset functions ignore that I'd like to
clean up first. So imo this can wait a bit.
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] drm/i915: simplify i915_reset a bit
  2012-04-25 20:27     ` Daniel Vetter
  2012-04-25 20:34       ` Jesse Barnes
@ 2012-04-25 23:14       ` Eric Anholt
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Anholt @ 2012-04-25 23:14 UTC (permalink / raw)
  To: Daniel Vetter, Jesse Barnes; +Cc: Daniel Vetter, Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 1479 bytes --]

On Wed, 25 Apr 2012 22:27:12 +0200, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, Apr 25, 2012 at 09:18:21AM -0700, Jesse Barnes wrote:
> > On Wed, 25 Apr 2012 13:57:11 +0200
> > Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > 
> > > - need_display is always true, scrap it.
> > > - don't reacquire the mutex to do nothing after having restored the
> > >   gem state.
> > > 
> > 
> > Actually I think we generally *don't* need to reset display.  It's
> > currently there just because we haven't tried reset handling without
> > it.  But not resetting display would make resets a little less visible,
> > which would be nice.  Now that you have a nice test setup, can you try
> > testing without the display engine bit set (i.e. only reset render and
> > media at hang time)?
> 
> Imo 'less visible' and gpu hang don't go well together, the 3 second
> freeze plus screen flicker at least ensure that users report bugs. And I
> have no idea whether resetting the entire gpu or just the render portion
> has a greater chance of survival. So I'm not sure whether working on this
> has much benefit ... at least opposed to working on the bugs ;-)

I've been hanging my system every couple of minutes most of today, and
this DP panel takes about 15 seconds to light up after a modeset.
Please stop resetting display when I hang the GPU.

(Then if we could get rid of the random decision to fallback to sw in
the 2d driver on a gpu hang, I'd be happy)

[-- Attachment #1.2: Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] drm/i915: add interface to simulate gpu hangs
  2012-04-25 11:57 ` [PATCH 1/6] drm/i915: add interface to simulate gpu hangs Daniel Vetter
@ 2012-04-26  0:03   ` Eugeni Dodonov
  0 siblings, 0 replies; 19+ messages in thread
From: Eugeni Dodonov @ 2012-04-26  0:03 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel Graphics Development


[-- Attachment #1.1: Type: text/plain, Size: 2484 bytes --]

On Wed, Apr 25, 2012 at 08:57, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> gpu reset is a very important piece of our infrastructure.
> Unfortunately we only really it test by actually hanging the gpu,
> which often has bad side-effects for the entire system. And the gpu
> hang handling code is one of the rather complicated pieces of code we
> have, consisting of
> - hang detection
> - error capture
> - actual gpu reset
> - reset of all the gem bookkeeping
> - reinitialition of the entire gpu
>
> This patch adds a debugfs to selectively stopping rings by ceasing to
> update the hw tail pointer, which will result in the gpu no longer
> updating it's head pointer and eventually to the hangcheck firing.
> This way we can exercise the gpu hang code under controlled conditions
> without a dying gpu taking down the entire systems.
>
> Patch motivated by me forgetting to properly reinitialize ppgtt after
> a gpu reset.
>
> Usage:
>
> echo $((1 << $ringnum)) > i915_ring_stop # stops one ring
>
> echo 0xffffffff > i915_ring_stop # stops all, future-proof version
>
> then run whatever testload is desired. i915_ring_stop automatically
> resets after a gpu hang is detected to avoid hanging the gpu to fast
> and declaring it wedged.
>
> v2: Incorporate feedback from Chris Wilson.
>
> v3: Add the missing cleanup.
>
> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>

I think I sent my R-b for the previous series, but just in case, for the
new one:
Reviewed-By: Eugeni Dodonov <eugeni.dodonov@intel.com>

With small bikeshed below:

 static ssize_t
> +i915_ring_stop_read(struct file *filp,
> +                   char __user *ubuf,
> +                   size_t max,
> +                   loff_t *ppos)
> +{
> +       struct drm_device *dev = filp->private_data;
> +       drm_i915_private_t *dev_priv = dev->dev_private;
> +       char buf[80];
>

buf is 80 characters here, but


> +static ssize_t
> +i915_ring_stop_write(struct file *filp,
> +                    const char __user *ubuf,
> +                    size_t cnt,
> +                    loff_t *ppos)
> +{
> +       struct drm_device *dev = filp->private_data;
> +       struct drm_i915_private *dev_priv = dev->dev_private;
> +       char buf[20];
>

here it is 20.. I don't think we'll need more than 20 the way it is
supposed to work, so maybe standardize it to 80 above as well for
consistency?

-- 
Eugeni Dodonov
<http://eugeni.dodonov.net/>

[-- Attachment #1.2: Type: text/html, Size: 3541 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] drm/i915: make gpu hangman more resilient
       [not found]   ` <CAC7LmnuSWGfup9Vd9dnH_4BanFguRUj_y8Q1L+Jx-RoaUK2KsA@mail.gmail.com>
@ 2012-04-26  8:25     ` Daniel Vetter
  0 siblings, 0 replies; 19+ messages in thread
From: Daniel Vetter @ 2012-04-26  8:25 UTC (permalink / raw)
  To: Eugeni Dodonov; +Cc: Daniel Vetter, Intel Graphics Development

On Wed, Apr 25, 2012 at 11:30:24PM -0300, Eugeni Dodonov wrote:
> On Apr 25, 2012 9:01 AM, "Daniel Vetter" <daniel.vetter@ffwll.ch> wrote:
> > +       /* Also reset the gpu hangman. */
> > +       if (dev_priv->stop_rings) {
> > +               DRM_DEBUG("Simulated gpu hang, resetting stop_rings\n");
> 
> Maybe also print the stop_ring value here?
> 
> > +               dev_priv->stop_rings = 0;
> > +               if (ret) {
> > +                       DRM_ERROR("Ignoring reset failure in simulated
> gpu hang\n");
> 
> ...and error value here as well?
> 
> If we are simulating hangs, I'd guess that we are interested in those
> values..

Well, I think I'll check for -ENODEV only instead because I simply don't
want to kill the gpu and force a module reload just because we don't have
the reset code implemented.
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] gpu hangman support
  2012-04-25 11:57 [PATCH 0/6] gpu hangman support Daniel Vetter
                   ` (6 preceding siblings ...)
  2012-04-25 12:42 ` [PATCH 0/6] gpu hangman support Chris Wilson
@ 2012-04-27  0:03 ` Ben Widawsky
  7 siblings, 0 replies; 19+ messages in thread
From: Ben Widawsky @ 2012-04-27  0:03 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel Graphics Development

On Wed, 25 Apr 2012 13:57:07 +0200
Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> Hi all,
> 
> In the light of the recent oops in the error state code I've resurrected my gpu
> hangman and brushed it up a bit. Together with the hangman check in i-g-t (which
> now also works a bit more reliable) we'll have a basic test of our hangcheck,
> reset and error state infrastructure.
> 
> Comments, flames and excessively-painted bikesheds highly welcome.
> 
> Cheers, Daniel

Would you mind separating out the reset patches? I really want those
regardless of the fate of hangman.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-04-27  0:04 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-25 11:57 [PATCH 0/6] gpu hangman support Daniel Vetter
2012-04-25 11:57 ` [PATCH 1/6] drm/i915: add interface to simulate gpu hangs Daniel Vetter
2012-04-26  0:03   ` Eugeni Dodonov
2012-04-25 11:57 ` [PATCH 2/6] drm/i915: rework dev->first_error locking Daniel Vetter
2012-04-25 11:57 ` [PATCH 3/6] drm/i915: destroy existing error_state when simulating a gpu hang Daniel Vetter
2012-04-25 20:20   ` [PATCH] drm/i915: allow the existing error_state to be destroyed Daniel Vetter
2012-04-25 11:57 ` [PATCH 4/6] drm/i915: simplify i915_reset a bit Daniel Vetter
2012-04-25 16:18   ` Jesse Barnes
2012-04-25 20:27     ` Daniel Vetter
2012-04-25 20:34       ` Jesse Barnes
2012-04-25 21:04         ` Daniel Vetter
2012-04-25 23:14       ` Eric Anholt
2012-04-25 11:57 ` [PATCH 5/6] drm/i915: extract i915_do_reset Daniel Vetter
2012-04-25 13:14   ` [PATCH] drm/i915: extract intel_gpu_reset Daniel Vetter
2012-04-25 16:43     ` Ben Widawsky
2012-04-25 11:57 ` [PATCH 6/6] drm/i915: make gpu hangman more resilient Daniel Vetter
     [not found]   ` <CAC7LmnuSWGfup9Vd9dnH_4BanFguRUj_y8Q1L+Jx-RoaUK2KsA@mail.gmail.com>
2012-04-26  8:25     ` Daniel Vetter
2012-04-25 12:42 ` [PATCH 0/6] gpu hangman support Chris Wilson
2012-04-27  0:03 ` Ben Widawsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.