All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: intel-gfx@lists.freedesktop.org
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Subject: [CI-ping 09/15] drm/i915: Prevent leaking of -EIO from i915_wait_request()
Date: Tue, 12 Apr 2016 21:03:03 +0100	[thread overview]
Message-ID: <1460491389-8602-9-git-send-email-chris@chris-wilson.co.uk> (raw)
In-Reply-To: <1460491389-8602-1-git-send-email-chris@chris-wilson.co.uk>

Reporting -EIO from i915_wait_request() has proven very troublematic
over the years, with numerous hard-to-reproduce bugs cropping up in the
corner case of where a reset occurs and the code wasn't expecting such
an error.

If the we reset the GPU or have detected a hang and wish to reset the
GPU, the request is forcibly complete and the wait broken. Currently, we
report either -EAGAIN or -EIO in order for the caller to retreat and
restart the wait (if appropriate) after dropping and then reacquiring
the struct_mutex (essential to allow the GPU reset to proceed). However,
if we take the view that the request is complete (no further work will
be done on it by the GPU because it is dead and soon to be reset), then
we can proceed with the task at hand and then drop the struct_mutex
allowing the reset to occur. This transfers the burden of checking
whether it is safe to proceed to the caller, which in all but one
instance it is safe - completely eliminating the source of all spurious
-EIO.

Of note, we only have two API entry points where we expect that
userspace can observe an EIO. First is when submitting an execbuf, if
the GPU is terminally wedged, then the operation cannot succeed and an
-EIO is reported. Secondly, existing userspace uses the throttle ioctl
to detect an already wedged GPU before starting using HW acceleration
(or to confirm that the GPU is wedged after an error condition). So if
the GPU is wedged when the user calls throttle, also report -EIO.

v2: Split more carefully the change to i915_wait_request() and assorted
ABI from the reset handling.
v3: Add a couple of WARN_ON(EIO) to the interruptible modesetting code
so that we don't start to leak EIO there in future (and break our hang
resistant modesetting).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_drv.h         |  2 --
 drivers/gpu/drm/i915/i915_gem.c         | 44 ++++++++++++++++-----------------
 drivers/gpu/drm/i915/i915_gem_userptr.c |  6 ++---
 drivers/gpu/drm/i915/intel_display.c    | 13 +++++-----
 drivers/gpu/drm/i915/intel_lrc.c        |  2 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c |  3 +--
 6 files changed, 32 insertions(+), 38 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 8821d38c07ed..061ecc43d935 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3083,8 +3083,6 @@ i915_gem_find_active_request(struct intel_engine_cs *engine);
 
 bool i915_gem_retire_requests(struct drm_device *dev);
 void i915_gem_retire_requests_ring(struct intel_engine_cs *engine);
-int __must_check i915_gem_check_wedge(struct i915_gpu_error *error,
-				      bool interruptible);
 
 static inline u32 i915_reset_counter(struct i915_gpu_error *error)
 {
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 01cdebfea27a..765efa88db32 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -206,11 +206,10 @@ i915_gem_object_put_pages_phys(struct drm_i915_gem_object *obj)
 	BUG_ON(obj->madv == __I915_MADV_PURGED);
 
 	ret = i915_gem_object_set_to_cpu_domain(obj, true);
-	if (ret) {
+	if (WARN_ON(ret)) {
 		/* In the event of a disaster, abandon all caches and
 		 * hope for the best.
 		 */
-		WARN_ON(ret != -EIO);
 		obj->base.read_domains = obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	}
 
@@ -1105,15 +1104,13 @@ put_rpm:
 	return ret;
 }
 
-int
-i915_gem_check_wedge(struct i915_gpu_error *error,
-		     bool interruptible)
+static int
+i915_gem_check_wedge(unsigned reset_counter, bool interruptible)
 {
-	if (i915_reset_in_progress_or_wedged(error)) {
-		/* Recovery complete, but the reset failed ... */
-		if (i915_terminally_wedged(error))
-			return -EIO;
+	if (__i915_terminally_wedged(reset_counter))
+		return -EIO;
 
+	if (__i915_reset_in_progress(reset_counter)) {
 		/* Non-interruptible callers can't handle -EAGAIN, hence return
 		 * -EIO unconditionally for these. */
 		if (!interruptible)
@@ -1287,13 +1284,14 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 		prepare_to_wait(&engine->irq_queue, &wait, state);
 
 		/* We need to check whether any gpu reset happened in between
-		 * the caller grabbing the seqno and now ... */
+		 * the request being submitted and now. If a reset has occurred,
+		 * the request is effectively complete (we either are in the
+		 * process of or have discarded the rendering and completely
+		 * reset the GPU. The results of the request are lost and we
+		 * are free to continue on with the original operation.
+		 */
 		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
-			/* ... but upgrade the -EAGAIN to an -EIO if the gpu
-			 * is truely gone. */
-			ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
-			if (ret == 0)
-				ret = -EAGAIN;
+			ret = 0;
 			break;
 		}
 
@@ -2154,11 +2152,10 @@ i915_gem_object_put_pages_gtt(struct drm_i915_gem_object *obj)
 	BUG_ON(obj->madv == __I915_MADV_PURGED);
 
 	ret = i915_gem_object_set_to_cpu_domain(obj, true);
-	if (ret) {
+	if (WARN_ON(ret)) {
 		/* In the event of a disaster, abandon all caches and
 		 * hope for the best.
 		 */
-		WARN_ON(ret != -EIO);
 		i915_gem_clflush_object(obj, true);
 		obj->base.read_domains = obj->base.write_domain = I915_GEM_DOMAIN_CPU;
 	}
@@ -2729,8 +2726,11 @@ __i915_gem_request_alloc(struct intel_engine_cs *engine,
 
 	*req_out = NULL;
 
-	ret = i915_gem_check_wedge(&dev_priv->gpu_error,
-				   dev_priv->mm.interruptible);
+	/* ABI: Before userspace accesses the GPU (e.g. execbuffer), report
+	 * EIO if the GPU is already wedged, or EAGAIN to drop the struct_mutex
+	 * and restart.
+	 */
+	ret = i915_gem_check_wedge(reset_counter, dev_priv->mm.interruptible);
 	if (ret)
 		return ret;
 
@@ -4165,9 +4165,9 @@ i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file)
 	if (ret)
 		return ret;
 
-	ret = i915_gem_check_wedge(&dev_priv->gpu_error, false);
-	if (ret)
-		return ret;
+	/* ABI: return -EIO if already wedged */
+	if (i915_terminally_wedged(&dev_priv->gpu_error))
+		return -EIO;
 
 	spin_lock(&file_priv->mm.lock);
 	list_for_each_entry(request, &file_priv->mm.request_list, client_list) {
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index e6b5938ce6e2..32d9726e38b1 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -112,10 +112,8 @@ static void cancel_userptr(struct work_struct *work)
 		was_interruptible = dev_priv->mm.interruptible;
 		dev_priv->mm.interruptible = false;
 
-		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
-			int ret = i915_vma_unbind(vma);
-			WARN_ON(ret && ret != -EIO);
-		}
+		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
+			WARN_ON(i915_vma_unbind(vma));
 		WARN_ON(i915_gem_object_put_pages(obj));
 
 		dev_priv->mm.interruptible = was_interruptible;
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 6500f77fc78e..b1b457864e17 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -13436,11 +13436,9 @@ static int intel_atomic_prepare_commit(struct drm_device *dev,
 
 			ret = __i915_wait_request(intel_plane_state->wait_req,
 						  true, NULL, NULL);
-
-			/* Swallow -EIO errors to allow updates during hw lockup. */
-			if (ret == -EIO)
-				ret = 0;
 			if (ret) {
+				/* Any hang should be swallowed by the wait */
+				WARN_ON(ret == -EIO);
 				mutex_lock(&dev->struct_mutex);
 				drm_atomic_helper_cleanup_planes(dev, state);
 				mutex_unlock(&dev->struct_mutex);
@@ -13792,10 +13790,11 @@ intel_prepare_plane_fb(struct drm_plane *plane,
 		 */
 		if (needs_modeset(crtc_state))
 			ret = i915_gem_object_wait_rendering(old_obj, true);
-
-		/* Swallow -EIO errors to allow updates during hw lockup. */
-		if (ret && ret != -EIO)
+		if (ret) {
+			/* GPU hangs should have been swallowed by the wait */
+			WARN_ON(ret == -EIO);
 			return ret;
+		}
 	}
 
 	/* For framebuffer backed by dmabuf, wait for fence */
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 027c47dff55b..b8f6b96472a6 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1051,7 +1051,7 @@ void intel_logical_ring_stop(struct intel_engine_cs *engine)
 		return;
 
 	ret = intel_engine_idle(engine);
-	if (ret && !i915_reset_in_progress_or_wedged(&dev_priv->gpu_error))
+	if (ret)
 		DRM_ERROR("failed to quiesce %s whilst cleaning up: %d\n",
 			  engine->name, ret);
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 55829b5ff837..8391382431b2 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -3183,8 +3183,7 @@ intel_stop_engine(struct intel_engine_cs *engine)
 		return;
 
 	ret = intel_engine_idle(engine);
-	if (ret &&
-	    !i915_reset_in_progress_or_wedged(&to_i915(engine->dev)->gpu_error))
+	if (ret)
 		DRM_ERROR("failed to quiesce %s whilst cleaning up: %d\n",
 			  engine->name, ret);
 
-- 
2.8.0.rc3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2016-04-12 20:03 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-12 20:02 [CI-ping 01/15] drm/i915: Force clean compilation with -Werror Chris Wilson
2016-04-12 20:02 ` [CI-ping 02/15] drm/i915: Disentangle i915_drv.h includes Chris Wilson
2016-04-12 20:02 ` [CI-ping 03/15] drm/i915: Add GEM debugging Kconfig option Chris Wilson
2016-04-12 20:02 ` [CI-ping 04/15] drm/i915: Hide the atomic_read(reset_counter) behind a helper Chris Wilson
2016-04-12 20:02 ` [CI-ping 05/15] drm/i915: Simplify checking of GPU reset_counter in display pageflips Chris Wilson
2016-04-12 20:03 ` [CI-ping 06/15] drm/i915: Tighten reset_counter for reset status Chris Wilson
2016-04-12 20:03 ` [CI-ping 07/15] drm/i915: Store the reset counter when constructing a request Chris Wilson
2016-04-12 20:03 ` [CI-ping 08/15] drm/i915: Simplify reset_counter handling during atomic modesetting Chris Wilson
2016-04-12 20:03 ` Chris Wilson [this message]
2016-04-12 20:03 ` [CI-ping 10/15] drm/i915: Suppress error message when GPU resets are disabled Chris Wilson
2016-04-12 20:03 ` [CI-ping 11/15] drm/i915: Prevent machine death on Ivybridge context switching Chris Wilson
2016-04-13  9:33   ` Daniel Vetter
2016-04-13  9:33     ` Daniel Vetter
2016-04-18  9:50     ` [Intel-gfx] " Jani Nikula
2016-04-18  9:50       ` Jani Nikula
2016-04-20 13:26       ` [Intel-gfx] " Jani Nikula
2016-04-12 20:03 ` [CI-ping 12/15] drm/i915: Force ringbuffers to not be at offset 0 Chris Wilson
2016-04-13  9:34   ` [Intel-gfx] " Daniel Vetter
2016-04-12 20:03 ` [CI-ping 13/15] drm/i915: Move the mb() following release-mmap into release-mmap Chris Wilson
2016-04-12 20:03 ` [CI-ping 14/15] drm/i915: Reorganise legacy context switch to cope with late failure Chris Wilson
2016-04-13  9:59   ` Daniel Vetter
2016-04-13 10:05     ` Chris Wilson
2016-04-13 14:16       ` [PATCH 1/2] drm/i915: Split out !RCS legacy context switching Chris Wilson
2016-04-13 14:16         ` [PATCH 2/2] drm/i915: Reorganise legacy context switch to cope with late failure Chris Wilson
2016-04-13 15:05           ` Daniel Vetter
2016-04-13 15:18             ` Chris Wilson
2016-04-13 14:56         ` [PATCH 1/2] drm/i915: Split out !RCS legacy context switching Daniel Vetter
2016-04-13 15:04           ` Chris Wilson
2016-04-12 20:03 ` [CI-ping 15/15] drm/i915: Late request cancellations are harmful Chris Wilson
2016-04-13  9:57   ` Daniel Vetter
2016-04-13  9:57     ` Daniel Vetter
2016-04-13 14:21     ` John Harrison
2016-04-19 12:35       ` Dave Gordon
2016-04-21 13:04         ` John Harrison
2016-04-22 22:57           ` John Harrison
2016-04-27 18:52             ` Dave Gordon
2016-04-18  9:46     ` [Intel-gfx] " Jani Nikula
2016-04-18  9:46       ` Jani Nikula
2016-04-14  8:45 ` ✗ Fi.CI.BAT: failure for series starting with [CI-ping,01/15] drm/i915: Force clean compilation with -Werror (rev3) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1460491389-8602-9-git-send-email-chris@chris-wilson.co.uk \
    --to=chris@chris-wilson.co.uk \
    --cc=daniel.vetter@ffwll.ch \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.