All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] drm/i915: Fix pre-g4x GPU reset, again
@ 2017-06-29 13:49 ville.syrjala
  2017-06-29 13:49 ` [PATCH 1/5] drm/atomic: Refactor drm_atomic_state_realloc_connectors() ville.syrjala
                   ` (6 more replies)
  0 siblings, 7 replies; 44+ messages in thread
From: ville.syrjala @ 2017-06-29 13:49 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel

From: Ville Syrjälä <ville.syrjala@linux.intel.com>

I set out to fix the pre-g4x GPU reset by protecting display commits with
an rw_semaphore. I originally went all out and added infrastructure to track
the committed state (as opposed the latest swapped state), but Daniel suggested 
that we want to backport the thing so I simplified it to just use obj->state
instead. I will be posting the committed state handling as a followup as it
will also DTRT if/when we will start allowing queueing multiple commits per-crtc.

Not sure if we want to put the "committed" state stuf into the atomic helper or
I should just pull it all into i915. Suggestions welcome.

Series available here:
git://github.com/vsyrjala/linux.git reset_commit_rwsem_norefactor_2,

Ville Syrjälä (5):
  drm/atomic: Refactor drm_atomic_state_realloc_connectors()
  drm/atomic: Introduce drm_atomic_helper_duplicate_committed_state()
  drm/i915% Store vma gtt offset in plane state
  drm/i915: Refactor __intel_atomic_commit_tail()
  drm/i915: Solve the GPU reset vs. modeset deadlocks with an
    rw_semaphore

 drivers/gpu/drm/drm_atomic.c         |  43 ++++--
 drivers/gpu/drm/drm_atomic_helper.c  | 123 +++++++++++++++++
 drivers/gpu/drm/i915/i915_drv.h      |   2 +
 drivers/gpu/drm/i915/i915_irq.c      |  44 +-----
 drivers/gpu/drm/i915/intel_display.c | 251 +++++++++++++++++++++++------------
 drivers/gpu/drm/i915/intel_drv.h     |   6 +-
 drivers/gpu/drm/i915/intel_sprite.c  |   8 +-
 include/drm/drm_atomic.h             |   5 +
 include/drm/drm_atomic_helper.h      |   4 +
 9 files changed, 338 insertions(+), 148 deletions(-)

-- 
2.13.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/5] drm/atomic: Refactor drm_atomic_state_realloc_connectors()
  2017-06-29 13:49 [PATCH 0/5] drm/i915: Fix pre-g4x GPU reset, again ville.syrjala
@ 2017-06-29 13:49 ` ville.syrjala
  2017-06-30 13:15   ` Daniel Vetter
  2017-06-29 13:49 ` [PATCH 2/5] drm/atomic: Introduce drm_atomic_helper_duplicate_committed_state() ville.syrjala
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: ville.syrjala @ 2017-06-29 13:49 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel

From: Ville Syrjälä <ville.syrjala@linux.intel.com>

Pull the code to reallocate the state->connectors[] array into a
helper function. We'll have another use for this later.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/drm_atomic.c | 43 +++++++++++++++++++++++++++++--------------
 include/drm/drm_atomic.h     |  5 +++++
 2 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/drm_atomic.c b/drivers/gpu/drm/drm_atomic.c
index 095e87278a88..a9f02b214fc6 100644
--- a/drivers/gpu/drm/drm_atomic.c
+++ b/drivers/gpu/drm/drm_atomic.c
@@ -1043,6 +1043,32 @@ drm_atomic_get_private_obj_state(struct drm_atomic_state *state, void *obj,
 }
 EXPORT_SYMBOL(drm_atomic_get_private_obj_state);
 
+int drm_atomic_state_realloc_connectors(struct drm_device *dev,
+					struct drm_atomic_state *state,
+					int index)
+{
+	struct drm_mode_config *config = &dev->mode_config;
+	struct __drm_connnectors_state *c;
+	int alloc = max(index + 1, config->num_connector);
+
+	if (index < state->num_connector)
+		return 0;
+
+	c = krealloc(state->connectors,
+		     alloc * sizeof(*state->connectors), GFP_KERNEL);
+	if (!c)
+		return -ENOMEM;
+
+	state->connectors = c;
+	memset(&state->connectors[state->num_connector], 0,
+	       sizeof(*state->connectors) * (alloc - state->num_connector));
+
+	state->num_connector = alloc;
+
+	return 0;
+}
+EXPORT_SYMBOL(drm_atomic_state_realloc_connectors);
+
 /**
  * drm_atomic_get_connector_state - get connector state
  * @state: global atomic state object
@@ -1074,20 +1100,9 @@ drm_atomic_get_connector_state(struct drm_atomic_state *state,
 
 	index = drm_connector_index(connector);
 
-	if (index >= state->num_connector) {
-		struct __drm_connnectors_state *c;
-		int alloc = max(index + 1, config->num_connector);
-
-		c = krealloc(state->connectors, alloc * sizeof(*state->connectors), GFP_KERNEL);
-		if (!c)
-			return ERR_PTR(-ENOMEM);
-
-		state->connectors = c;
-		memset(&state->connectors[state->num_connector], 0,
-		       sizeof(*state->connectors) * (alloc - state->num_connector));
-
-		state->num_connector = alloc;
-	}
+	ret = drm_atomic_state_realloc_connectors(connector->dev, state, index);
+	if (ret)
+		return ERR_PTR(ret);
 
 	if (state->connectors[index].state)
 		return state->connectors[index].state;
diff --git a/include/drm/drm_atomic.h b/include/drm/drm_atomic.h
index 0196f264a418..5596ad58bcdc 100644
--- a/include/drm/drm_atomic.h
+++ b/include/drm/drm_atomic.h
@@ -324,6 +324,11 @@ drm_atomic_get_private_obj_state(struct drm_atomic_state *state,
 			      void *obj,
 			      const struct drm_private_state_funcs *funcs);
 
+int __must_check
+drm_atomic_state_realloc_connectors(struct drm_device *dev,
+				    struct drm_atomic_state *state,
+				    int index);
+
 /**
  * drm_atomic_get_existing_crtc_state - get crtc state, if it exists
  * @state: global atomic state object
-- 
2.13.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/5] drm/atomic: Introduce drm_atomic_helper_duplicate_committed_state()
  2017-06-29 13:49 [PATCH 0/5] drm/i915: Fix pre-g4x GPU reset, again ville.syrjala
  2017-06-29 13:49 ` [PATCH 1/5] drm/atomic: Refactor drm_atomic_state_realloc_connectors() ville.syrjala
@ 2017-06-29 13:49 ` ville.syrjala
  2017-06-29 13:49 ` [PATCH 3/5] drm/i915% Store vma gtt offset in plane state ville.syrjala
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: ville.syrjala @ 2017-06-29 13:49 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel

From: Ville Syrjälä <ville.syrjala@linux.intel.com>

For i915 GPU reset handling we'll want to be able to duplicate the state
that was last committed to the hardware. For that purpose let's provide
a helper function that is supposed to duplicate the state last committed
to the hardware. For now we'll actually just duplicate the last swapped
state for each object. That isn't quite correct but being able to
duplicate the actaully committed state will require larger refactoring.

Since we will access obj->state without the protection of the
appropriate locks there is a small chance that this might blow up. That
problem too will get solved once we start dealing with the committed
state correctly.

Note that we duplicates the current tate to to both old_state and
new_state. For the purposes of i915 GPU reset we would only need one
of them, but we actually need two top level states; one for
disabling everything (which would need the duplicated state to be
old_state), and another to reenable everything (which would need the
duplicated state to be new_state). So to make it less comples I figured
I'd just always duplicate both. Might want to rethink this if for no
other reason that reducing the chances of memory allocation failure.
Due to the double state duplication we need
drm_atomic_helper_clean_committed_state() to clean up the duplicated
old_state since that's not handled by the normal drm_atomic_state
cleanup code.

TODO: do we want this in the helper, or maybe it should be just in i915?

v2: s/commited/committed/ everywhere (checkpatch)
    Handle state duplication errors better

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/drm_atomic_helper.c | 123 ++++++++++++++++++++++++++++++++++++
 include/drm/drm_atomic_helper.h     |   4 ++
 2 files changed, 127 insertions(+)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index 2f269e4267da..a6ee0d16f723 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -3596,6 +3596,129 @@ drm_atomic_helper_duplicate_state(struct drm_device *dev,
 }
 EXPORT_SYMBOL(drm_atomic_helper_duplicate_state);
 
+struct drm_atomic_state *
+drm_atomic_helper_duplicate_committed_state(struct drm_device *dev)
+{
+	struct drm_atomic_state *state;
+	struct drm_connector_list_iter conn_iter;
+	struct drm_connector *conn;
+	struct drm_plane *plane;
+	struct drm_crtc *crtc;
+	int err = 0;
+
+	state = drm_atomic_state_alloc(dev);
+	if (!state)
+		return ERR_PTR(-ENOMEM);
+
+	drm_for_each_plane(plane, dev) {
+		int i = drm_plane_index(plane);
+
+		state->planes[i].ptr = plane;
+		state->planes[i].state =
+			plane->funcs->atomic_duplicate_state(plane);
+		state->planes[i].new_state = state->planes[i].state;
+		state->planes[i].old_state =
+			plane->funcs->atomic_duplicate_state(plane);
+
+		if (!state->planes[i].state ||
+		    !state->planes[i].old_state) {
+			err = -ENOMEM;
+			goto free;
+		}
+
+		state->planes[i].old_state->state = state;
+	}
+
+	drm_for_each_crtc(crtc, dev) {
+		int i = drm_crtc_index(crtc);
+
+		state->crtcs[i].ptr = crtc;
+		state->crtcs[i].state =
+			crtc->funcs->atomic_duplicate_state(crtc);
+		state->crtcs[i].new_state = state->crtcs[i].state;
+		state->crtcs[i].old_state =
+			crtc->funcs->atomic_duplicate_state(crtc);
+
+		if (!state->crtcs[i].state ||
+		    !state->crtcs[i].old_state) {
+			err = -ENOMEM;
+			goto free;
+		}
+
+		state->crtcs[i].old_state->state = state;
+	}
+
+	drm_connector_list_iter_begin(dev, &conn_iter);
+	drm_for_each_connector_iter(conn, &conn_iter) {
+		int i = drm_connector_index(conn);
+
+		err = drm_atomic_state_realloc_connectors(conn->dev, state, i);
+		if (err)
+			break;
+
+		drm_connector_get(conn);
+		state->connectors[i].ptr = conn;
+		state->connectors[i].state = conn->funcs->atomic_duplicate_state(conn);
+		state->connectors[i].new_state = state->connectors[i].state;
+		state->connectors[i].old_state = conn->funcs->atomic_duplicate_state(conn);
+
+		if (!state->connectors[i].state ||
+		    !state->connectors[i].old_state) {
+			err = -ENOMEM;
+			break;
+		}
+
+		state->connectors[i].old_state->state = state;
+	}
+	drm_connector_list_iter_end(&conn_iter);
+
+free:
+	if (err < 0) {
+		drm_atomic_helper_clean_committed_state(state);
+		drm_atomic_state_put(state);
+		state = ERR_PTR(err);
+	}
+
+	return state;
+}
+EXPORT_SYMBOL(drm_atomic_helper_duplicate_committed_state);
+
+void
+drm_atomic_helper_clean_committed_state(struct drm_atomic_state *state)
+{
+	struct drm_plane *plane;
+	struct drm_crtc *crtc;
+	struct drm_connector *conn;
+	struct drm_plane_state *plane_state;
+	struct drm_crtc_state *crtc_state;
+	struct drm_connector_state *conn_state;
+	int i;
+
+	/* restore the correct state->state */
+	for_each_new_plane_in_state(state, plane, plane_state, i) {
+		if (!state->planes[i].old_state)
+			continue;
+		plane->funcs->atomic_destroy_state(plane,
+						   state->planes[i].old_state);
+		state->planes[i].old_state = NULL;
+	}
+	for_each_new_crtc_in_state(state, crtc, crtc_state, i) {
+		if (!state->crtcs[i].old_state)
+			continue;
+		crtc->funcs->atomic_destroy_state(crtc,
+						  state->crtcs[i].old_state);
+		state->crtcs[i].old_state = NULL;
+	}
+	for_each_new_connector_in_state(state, conn, conn_state, i) {
+		if (!state->connectors[i].old_state)
+			continue;
+		conn->funcs->atomic_destroy_state(conn,
+						  state->connectors[i].old_state);
+		state->connectors[i].old_state = NULL;
+	}
+}
+EXPORT_SYMBOL(drm_atomic_helper_clean_committed_state);
+
 /**
  * __drm_atomic_helper_connector_destroy_state - release connector state
  * @state: connector state object to release
diff --git a/include/drm/drm_atomic_helper.h b/include/drm/drm_atomic_helper.h
index 3bfeb2b2f746..70167c387501 100644
--- a/include/drm/drm_atomic_helper.h
+++ b/include/drm/drm_atomic_helper.h
@@ -173,6 +173,10 @@ drm_atomic_helper_connector_duplicate_state(struct drm_connector *connector);
 struct drm_atomic_state *
 drm_atomic_helper_duplicate_state(struct drm_device *dev,
 				  struct drm_modeset_acquire_ctx *ctx);
+struct drm_atomic_state *
+drm_atomic_helper_duplicate_committed_state(struct drm_device *dev);
+void
+drm_atomic_helper_clean_committed_state(struct drm_atomic_state *state);
 void
 __drm_atomic_helper_connector_destroy_state(struct drm_connector_state *state);
 void drm_atomic_helper_connector_destroy_state(struct drm_connector *connector,
-- 
2.13.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/5] drm/i915% Store vma gtt offset in plane state
  2017-06-29 13:49 [PATCH 0/5] drm/i915: Fix pre-g4x GPU reset, again ville.syrjala
  2017-06-29 13:49 ` [PATCH 1/5] drm/atomic: Refactor drm_atomic_state_realloc_connectors() ville.syrjala
  2017-06-29 13:49 ` [PATCH 2/5] drm/atomic: Introduce drm_atomic_helper_duplicate_committed_state() ville.syrjala
@ 2017-06-29 13:49 ` ville.syrjala
  2017-06-29 13:49 ` [PATCH 4/5] drm/i915: Refactor __intel_atomic_commit_tail() ville.syrjala
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: ville.syrjala @ 2017-06-29 13:49 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel

From: Ville Syrjälä <ville.syrjala@linux.intel.com>

To avoid having to deference plane_state->vma during the commit phase of
plane updates, let's store the vma gtt offset (or the bus address when
we need it) in the plane state. This is crucial for doing the modeset
operations during GPU reset as as plane_state->vma gets cleared when we
duplicate the state and we won't be calling .prepare_fb() during GPU
reset plane commits.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_display.c | 38 +++++++++++++++++++++---------------
 drivers/gpu/drm/i915/intel_drv.h     |  6 +-----
 drivers/gpu/drm/i915/intel_sprite.c  |  8 ++++----
 3 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 4e03ca6c946f..4ce81a694bd2 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -2744,7 +2744,7 @@ intel_find_initial_plane_obj(struct intel_crtc *intel_crtc,
 		if (!state->vma)
 			continue;
 
-		if (intel_plane_ggtt_offset(state) == plane_config->base) {
+		if (state->gtt_offset == plane_config->base) {
 			fb = c->primary->fb;
 			drm_framebuffer_reference(fb);
 			goto valid_fb;
@@ -2771,6 +2771,8 @@ intel_find_initial_plane_obj(struct intel_crtc *intel_crtc,
 	mutex_lock(&dev->struct_mutex);
 	intel_state->vma =
 		intel_pin_and_fence_fb_obj(fb, primary->state->rotation);
+	intel_state->gtt_offset = i915_ggtt_offset(intel_state->vma);
+
 	mutex_unlock(&dev->struct_mutex);
 	if (IS_ERR(intel_state->vma)) {
 		DRM_ERROR("failed to pin boot fb on pipe %d: %li\n",
@@ -3122,19 +3124,16 @@ static void i9xx_update_primary_plane(struct intel_plane *primary,
 	I915_WRITE_FW(DSPSTRIDE(plane), fb->pitches[0]);
 	if (IS_HASWELL(dev_priv) || IS_BROADWELL(dev_priv)) {
 		I915_WRITE_FW(DSPSURF(plane),
-			      intel_plane_ggtt_offset(plane_state) +
-			      crtc->dspaddr_offset);
+			      plane_state->gtt_offset + crtc->dspaddr_offset);
 		I915_WRITE_FW(DSPOFFSET(plane), (y << 16) | x);
 	} else if (INTEL_GEN(dev_priv) >= 4) {
 		I915_WRITE_FW(DSPSURF(plane),
-			      intel_plane_ggtt_offset(plane_state) +
-			      crtc->dspaddr_offset);
+			      plane_state->gtt_offset + crtc->dspaddr_offset);
 		I915_WRITE_FW(DSPTILEOFF(plane), (y << 16) | x);
 		I915_WRITE_FW(DSPLINOFF(plane), linear_offset);
 	} else {
 		I915_WRITE_FW(DSPADDR(plane),
-			      intel_plane_ggtt_offset(plane_state) +
-			      crtc->dspaddr_offset);
+			      plane_state->gtt_offset + crtc->dspaddr_offset);
 	}
 	POSTING_READ_FW(reg);
 
@@ -3395,7 +3394,7 @@ static void skylake_update_primary_plane(struct intel_plane *plane,
 	}
 
 	I915_WRITE_FW(PLANE_SURF(pipe, plane_id),
-		      intel_plane_ggtt_offset(plane_state) + surf_addr);
+		      plane_state->gtt_offset + surf_addr);
 
 	POSTING_READ_FW(PLANE_SURF(pipe, plane_id));
 
@@ -9185,15 +9184,9 @@ static u32 intel_cursor_base(const struct intel_plane_state *plane_state)
 	struct drm_i915_private *dev_priv =
 		to_i915(plane_state->base.plane->dev);
 	const struct drm_framebuffer *fb = plane_state->base.fb;
-	const struct drm_i915_gem_object *obj = intel_fb_obj(fb);
 	u32 base;
 
-	if (INTEL_INFO(dev_priv)->cursor_needs_physical)
-		base = obj->phys_handle->busaddr;
-	else
-		base = intel_plane_ggtt_offset(plane_state);
-
-	base += plane_state->main.offset;
+	base = plane_state->gtt_offset + plane_state->main.offset;
 
 	/* ILK+ do this automagically */
 	if (HAS_GMCH_DISPLAY(dev_priv) &&
@@ -10846,8 +10839,11 @@ static int intel_crtc_page_flip(struct drm_crtc *crtc,
 
 	work->old_vma = to_intel_plane_state(primary->state)->vma;
 	to_intel_plane_state(primary->state)->vma = vma;
+	to_intel_plane_state(primary->state)->gtt_offset =
+		i915_ggtt_offset(vma);
 
-	work->gtt_offset = i915_ggtt_offset(vma) + intel_crtc->dspaddr_offset;
+	work->gtt_offset = to_intel_plane_state(primary->state)->gtt_offset +
+		intel_crtc->dspaddr_offset;
 	work->rotation = crtc->primary->state->rotation;
 
 	/*
@@ -10903,6 +10899,8 @@ static int intel_crtc_page_flip(struct drm_crtc *crtc,
 	i915_add_request(request);
 cleanup_unpin:
 	to_intel_plane_state(primary->state)->vma = work->old_vma;
+	to_intel_plane_state(primary->state)->gtt_offset =
+		i915_ggtt_offset(work->old_vma);
 	intel_unpin_fb_vma(vma);
 cleanup_pending:
 	atomic_dec(&intel_crtc->unpin_work_count);
@@ -13344,6 +13342,8 @@ intel_prepare_plane_fb(struct drm_plane *plane,
 				DRM_DEBUG_KMS("failed to attach phys object\n");
 				return ret;
 			}
+			to_intel_plane_state(new_state)->gtt_offset =
+				obj->phys_handle->busaddr;
 		} else {
 			struct i915_vma *vma;
 
@@ -13354,6 +13354,8 @@ intel_prepare_plane_fb(struct drm_plane *plane,
 			}
 
 			to_intel_plane_state(new_state)->vma = vma;
+			to_intel_plane_state(new_state)->gtt_offset =
+				i915_ggtt_offset(vma);
 		}
 	}
 
@@ -13657,6 +13659,8 @@ intel_legacy_cursor_update(struct drm_plane *plane,
 			DRM_DEBUG_KMS("failed to attach phys object\n");
 			goto out_unlock;
 		}
+		to_intel_plane_state(new_plane_state)->gtt_offset =
+			intel_fb_obj(fb)->phys_handle->busaddr;
 	} else {
 		struct i915_vma *vma;
 
@@ -13669,6 +13673,8 @@ intel_legacy_cursor_update(struct drm_plane *plane,
 		}
 
 		to_intel_plane_state(new_plane_state)->vma = vma;
+		to_intel_plane_state(new_plane_state)->gtt_offset =
+			i915_ggtt_offset(vma);
 	}
 
 	old_fb = old_plane_state->fb;
diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
index d17a32437f07..67efbc7de559 100644
--- a/drivers/gpu/drm/i915/intel_drv.h
+++ b/drivers/gpu/drm/i915/intel_drv.h
@@ -405,6 +405,7 @@ struct intel_plane_state {
 	struct drm_plane_state base;
 	struct drm_rect clip;
 	struct i915_vma *vma;
+	u32 gtt_offset; /* GGTT offset, or bus address */
 
 	struct {
 		u32 offset;
@@ -1482,11 +1483,6 @@ void intel_mode_from_pipe_config(struct drm_display_mode *mode,
 int skl_update_scaler_crtc(struct intel_crtc_state *crtc_state);
 int skl_max_scale(struct intel_crtc *crtc, struct intel_crtc_state *crtc_state);
 
-static inline u32 intel_plane_ggtt_offset(const struct intel_plane_state *state)
-{
-	return i915_ggtt_offset(state->vma);
-}
-
 u32 skl_plane_ctl(const struct intel_crtc_state *crtc_state,
 		  const struct intel_plane_state *plane_state);
 u32 skl_plane_stride(const struct drm_framebuffer *fb, int plane,
diff --git a/drivers/gpu/drm/i915/intel_sprite.c b/drivers/gpu/drm/i915/intel_sprite.c
index 0c650c2cbca8..eced0772ed03 100644
--- a/drivers/gpu/drm/i915/intel_sprite.c
+++ b/drivers/gpu/drm/i915/intel_sprite.c
@@ -300,7 +300,7 @@ skl_update_plane(struct intel_plane *plane,
 
 	I915_WRITE_FW(PLANE_CTL(pipe, plane_id), plane_ctl);
 	I915_WRITE_FW(PLANE_SURF(pipe, plane_id),
-		      intel_plane_ggtt_offset(plane_state) + surf_addr);
+		      plane_state->gtt_offset + surf_addr);
 	POSTING_READ_FW(PLANE_SURF(pipe, plane_id));
 
 	spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags);
@@ -477,7 +477,7 @@ vlv_update_plane(struct intel_plane *plane,
 	I915_WRITE_FW(SPSIZE(pipe, plane_id), (crtc_h << 16) | crtc_w);
 	I915_WRITE_FW(SPCNTR(pipe, plane_id), sprctl);
 	I915_WRITE_FW(SPSURF(pipe, plane_id),
-		      intel_plane_ggtt_offset(plane_state) + sprsurf_offset);
+		      plane_state->gtt_offset + sprsurf_offset);
 	POSTING_READ_FW(SPSURF(pipe, plane_id));
 
 	spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags);
@@ -615,7 +615,7 @@ ivb_update_plane(struct intel_plane *plane,
 		I915_WRITE_FW(SPRSCALE(pipe), sprscale);
 	I915_WRITE_FW(SPRCTL(pipe), sprctl);
 	I915_WRITE_FW(SPRSURF(pipe),
-		      intel_plane_ggtt_offset(plane_state) + sprsurf_offset);
+		      plane_state->gtt_offset + sprsurf_offset);
 	POSTING_READ_FW(SPRSURF(pipe));
 
 	spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags);
@@ -747,7 +747,7 @@ g4x_update_plane(struct intel_plane *plane,
 	I915_WRITE_FW(DVSSCALE(pipe), dvsscale);
 	I915_WRITE_FW(DVSCNTR(pipe), dvscntr);
 	I915_WRITE_FW(DVSSURF(pipe),
-		      intel_plane_ggtt_offset(plane_state) + dvssurf_offset);
+		      plane_state->gtt_offset + dvssurf_offset);
 	POSTING_READ_FW(DVSSURF(pipe));
 
 	spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags);
-- 
2.13.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/5] drm/i915: Refactor __intel_atomic_commit_tail()
  2017-06-29 13:49 [PATCH 0/5] drm/i915: Fix pre-g4x GPU reset, again ville.syrjala
                   ` (2 preceding siblings ...)
  2017-06-29 13:49 ` [PATCH 3/5] drm/i915% Store vma gtt offset in plane state ville.syrjala
@ 2017-06-29 13:49 ` ville.syrjala
  2017-06-29 13:49 ` [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore ville.syrjala
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: ville.syrjala @ 2017-06-29 13:49 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel

From: Ville Syrjälä <ville.syrjala@linux.intel.com>

Split intel_atomic_commit_tail() into a lower level function that does
the actual commit, and a higher level one that waits for the
dependencies and signals the commit as done. We'll reuse the lower
level function to perform commits during GPU resets.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_display.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 4ce81a694bd2..7cdd6ec97f80 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -13004,7 +13004,7 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work)
 	intel_atomic_helper_free_state(dev_priv);
 }
 
-static void intel_atomic_commit_tail(struct drm_atomic_state *state)
+static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 {
 	struct drm_device *dev = state->dev;
 	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
@@ -13017,8 +13017,6 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
 	unsigned crtc_vblank_mask = 0;
 	int i;
 
-	drm_atomic_helper_wait_for_dependencies(state);
-
 	if (intel_state->modeset)
 		intel_display_power_get(dev_priv, POWER_DOMAIN_MODESET);
 
@@ -13143,8 +13141,6 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
 	if (intel_state->modeset && intel_can_enable_sagv(state))
 		intel_enable_sagv(dev_priv);
 
-	drm_atomic_helper_commit_hw_done(state);
-
 	if (intel_state->modeset) {
 		/* As one of the primary mmio accessors, KMS has a high
 		 * likelihood of triggering bugs in unclaimed access. After we
@@ -13155,6 +13151,18 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
 		intel_uncore_arm_unclaimed_mmio_detection(dev_priv);
 		intel_display_power_put(dev_priv, POWER_DOMAIN_MODESET);
 	}
+}
+
+static void intel_atomic_commit_tail(struct drm_atomic_state *state)
+{
+	struct drm_device *dev = state->dev;
+	struct drm_i915_private *dev_priv = to_i915(dev);
+
+	drm_atomic_helper_wait_for_dependencies(state);
+
+	__intel_atomic_commit_tail(state);
+
+	drm_atomic_helper_commit_hw_done(state);
 
 	mutex_lock(&dev->struct_mutex);
 	drm_atomic_helper_cleanup_planes(dev, state);
-- 
2.13.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-29 13:49 [PATCH 0/5] drm/i915: Fix pre-g4x GPU reset, again ville.syrjala
                   ` (3 preceding siblings ...)
  2017-06-29 13:49 ` [PATCH 4/5] drm/i915: Refactor __intel_atomic_commit_tail() ville.syrjala
@ 2017-06-29 13:49 ` ville.syrjala
  2017-06-29 13:56   ` Chris Wilson
                     ` (2 more replies)
  2017-06-29 14:24 ` ✓ Fi.CI.BAT: success for drm/i915: Fix pre-g4x GPU reset, again Patchwork
  2017-06-29 14:55 ` ✓ Fi.CI.BAT: success for drm/i915: Fix pre-g4x GPU reset, again (rev2) Patchwork
  6 siblings, 3 replies; 44+ messages in thread
From: ville.syrjala @ 2017-06-29 13:49 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, stable, dri-devel

From: Ville Syrjälä <ville.syrjala@linux.intel.com>

Introduce an rw_semaphore to protect the display commits. All normal
commits use down_read() and hence can proceed in parallel, but GPU reset
will use down_write() making sure no other commits are in progress when
we have to pull the plug on the display engine on pre-g4x platforms.
There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
itself, and we wait for all dependencies before the down_read(), and
thus there is no chance of deadlocks with this scheme.

During reset we should be recommiting the state that was committed last.
But for now we'll settle for recommiting the last state for each object.
Hence we may commit things a bit too soon when a GPU reset occurs. The
rw_semaphore should guarantee that whatever state we observe in
obj->state during reset sticks around while we do the commit. The
obj->state pointer might change for some objects if another swap_state()
occurs while we do our thing, so there migth be some theoretical
mismatch which I tink we could avoid by grabbing the rw_semaphore also
around the swap_state(), but for now I didn't do that.

Another source of mismatch can happen because we sometimes use the
intel_crtc->config during the actual commit, and that only gets updated
when we do the commuit. Hence we may get some state via ->state, some
via the duplicated ->state, and some via ->config. We'll want to
fix this all by tracking the commited state properly, but that will
some bigger refactroing so for now we'll just choose to accept these
potential mismatches.

I left out the state readout from the post-reset display
reinitialization as that still likes to clobber crtc->state etc.
If we make it use a free standing atomic state and mke sure it doesn't
need any locks we could reintroduce it. For now I just mark the
post-reset display state as dirty as possible to make sure we
reinitialize everything.

One other potential issue remains in the form of display detection.
Either we need to protect that with the same rw_semaphore as well, or
perhaps it would be enough to force gmbus into bitbanging mode while
the reset is happening and we don't have interrupts, and just across
the actual hardware GPU reset we could hold the gmbus mutex.

v2: Keep intel_prepare/finish_reset() outside struct_mutex (Chris)
v3: Drop all the committed_state refactoring to make this less
    obnoxious to backport (Daniel)

Cc: stable@vger.kernel.org # for the brave
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101597
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99093
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600
Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h      |   2 +
 drivers/gpu/drm/i915/i915_irq.c      |  44 +-------
 drivers/gpu/drm/i915/intel_display.c | 199 ++++++++++++++++++++++++-----------
 3 files changed, 139 insertions(+), 106 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index effbe4f72a64..88ddd27f61b0 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2237,6 +2237,8 @@ struct drm_i915_private {
 	struct drm_atomic_state *modeset_restore_state;
 	struct drm_modeset_acquire_ctx reset_ctx;
 
+	struct rw_semaphore commit_sem;
+
 	struct list_head vm_list; /* Global list of all address spaces */
 	struct i915_ggtt ggtt; /* VM representing the global address space */
 
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index eb4f1dca2077..9d591f17fda3 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2587,46 +2587,6 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
 	return ret;
 }
 
-struct wedge_me {
-	struct delayed_work work;
-	struct drm_i915_private *i915;
-	const char *name;
-};
-
-static void wedge_me(struct work_struct *work)
-{
-	struct wedge_me *w = container_of(work, typeof(*w), work.work);
-
-	dev_err(w->i915->drm.dev,
-		"%s timed out, cancelling all in-flight rendering.\n",
-		w->name);
-	i915_gem_set_wedged(w->i915);
-}
-
-static void __init_wedge(struct wedge_me *w,
-			 struct drm_i915_private *i915,
-			 long timeout,
-			 const char *name)
-{
-	w->i915 = i915;
-	w->name = name;
-
-	INIT_DELAYED_WORK_ONSTACK(&w->work, wedge_me);
-	schedule_delayed_work(&w->work, timeout);
-}
-
-static void __fini_wedge(struct wedge_me *w)
-{
-	cancel_delayed_work_sync(&w->work);
-	destroy_delayed_work_on_stack(&w->work);
-	w->i915 = NULL;
-}
-
-#define i915_wedge_on_timeout(W, DEV, TIMEOUT)				\
-	for (__init_wedge((W), (DEV), (TIMEOUT), __func__);		\
-	     (W)->i915;							\
-	     __fini_wedge((W)))
-
 /**
  * i915_reset_device - do process context error handling work
  * @dev_priv: i915 device private
@@ -2640,15 +2600,13 @@ static void i915_reset_device(struct drm_i915_private *dev_priv)
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
 	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
 	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
-	struct wedge_me w;
 
 	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
 
 	DRM_DEBUG_DRIVER("resetting chip\n");
 	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
 
-	/* Use a watchdog to ensure that our reset completes */
-	i915_wedge_on_timeout(&w, dev_priv, 5*HZ) {
+	{
 		intel_prepare_reset(dev_priv);
 
 		/* Signal that locked waiters should reset the GPU */
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 7cdd6ec97f80..f13c7d81d4a9 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -123,6 +123,7 @@ static void ironlake_pfit_enable(struct intel_crtc *crtc);
 static void intel_modeset_setup_hw_state(struct drm_device *dev,
 					 struct drm_modeset_acquire_ctx *ctx);
 static void intel_pre_disable_primary_noatomic(struct drm_crtc *crtc);
+static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset);
 
 struct intel_limit {
 	struct {
@@ -3491,27 +3492,85 @@ static bool gpu_reset_clobbers_display(struct drm_i915_private *dev_priv)
 		INTEL_GEN(dev_priv) < 5 && !IS_G4X(dev_priv);
 }
 
-void intel_prepare_reset(struct drm_i915_private *dev_priv)
+static void init_intel_state(struct intel_atomic_state *state)
+{
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
+	int i;
+
+	state->modeset = true;
+
+	for_each_oldnew_crtc_in_state(&state->base, crtc,
+				      old_crtc_state, new_crtc_state, i) {
+		if (new_crtc_state->active)
+			state->active_crtcs |= 1 << i;
+		else
+			state->active_crtcs &= ~(1 << i);
+
+		if (old_crtc_state->active != new_crtc_state->active)
+			state->active_pipe_changes |= drm_crtc_mask(crtc);
+	}
+}
+
+static struct drm_atomic_state *
+intel_duplicate_committed_state(struct drm_i915_private *dev_priv)
 {
-	struct drm_device *dev = &dev_priv->drm;
-	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
 	struct drm_atomic_state *state;
-	int ret;
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *crtc_state;
+	int i;
 
-	/*
-	 * Need mode_config.mutex so that we don't
-	 * trample ongoing ->detect() and whatnot.
-	 */
-	mutex_lock(&dev->mode_config.mutex);
-	drm_modeset_acquire_init(ctx, 0);
-	while (1) {
-		ret = drm_modeset_lock_all_ctx(dev, ctx);
-		if (ret != -EDEADLK)
-			break;
+	state = drm_atomic_helper_duplicate_committed_state(&dev_priv->drm);
+	if (IS_ERR(state)) {
+		DRM_ERROR("Duplicating state failed with %ld\n",
+			  PTR_ERR(state));
+		return NULL;
+	}
+
+	to_intel_atomic_state(state)->active_crtcs = 0;
+	to_intel_atomic_state(state)->cdclk.logical = dev_priv->cdclk.hw;
+	to_intel_atomic_state(state)->cdclk.actual = dev_priv->cdclk.hw;
+
+	init_intel_state(to_intel_atomic_state(state));
+
+	/* force a full update */
+	for_each_new_crtc_in_state(state, crtc, crtc_state, i) {
+		struct intel_crtc_state *intel_crtc_state =
+			to_intel_crtc_state(crtc_state);
+
+		if (!crtc_state->active)
+			continue;
 
-		drm_modeset_backoff(ctx);
+		crtc_state->mode_changed = true;
+		crtc_state->active_changed = true;
+		crtc_state->planes_changed = true;
+		crtc_state->connectors_changed = true;
+		crtc_state->color_mgmt_changed = true;
+		crtc_state->zpos_changed = true;
+
+		intel_crtc_state->update_pipe = true;
+		intel_crtc_state->disable_lp_wm = true;
+		intel_crtc_state->disable_cxsr = true;
+		intel_crtc_state->update_wm_post = true;
+		intel_crtc_state->fb_changed = true;
+		intel_crtc_state->fifo_changed = true;
+		intel_crtc_state->wm.need_postvbl_update = true;
 	}
 
+	return state;
+}
+
+void intel_prepare_reset(struct drm_i915_private *dev_priv)
+{
+	struct drm_atomic_state *disable_state, *restore_state = NULL;
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *crtc_state;
+	struct drm_plane *plane;
+	struct drm_plane_state *plane_state;
+	int i;
+
+	down_write(&dev_priv->commit_sem);
+
 	/* reset doesn't touch the display, but flips might get nuked anyway, */
 	if (!i915.force_reset_modeset_test &&
 	    !gpu_reset_clobbers_display(dev_priv))
@@ -3521,30 +3580,40 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
 	 * Disabling the crtcs gracefully seems nicer. Also the
 	 * g33 docs say we should at least disable all the planes.
 	 */
-	state = drm_atomic_helper_duplicate_state(dev, ctx);
-	if (IS_ERR(state)) {
-		ret = PTR_ERR(state);
-		DRM_ERROR("Duplicating state failed with %i\n", ret);
-		return;
-	}
+	disable_state = intel_duplicate_committed_state(dev_priv);
+	if (IS_ERR(disable_state))
+		goto out;
 
-	ret = drm_atomic_helper_disable_all(dev, ctx);
-	if (ret) {
-		DRM_ERROR("Suspending crtc's failed with %i\n", ret);
-		drm_atomic_state_put(state);
-		return;
-	}
+	to_intel_atomic_state(disable_state)->active_crtcs = 0;
 
-	dev_priv->modeset_restore_state = state;
-	state->acquire_ctx = ctx;
+	for_each_new_crtc_in_state(disable_state, crtc, crtc_state, i)
+		crtc_state->active = false;
+	for_each_new_plane_in_state(disable_state, plane, plane_state, i)
+		plane_state->visible = false;
+
+	__intel_atomic_commit_tail(disable_state, true);
+
+	drm_atomic_helper_clean_committed_state(disable_state);
+	drm_atomic_state_put(disable_state);
+
+	restore_state = intel_duplicate_committed_state(dev_priv);
+	if (IS_ERR(restore_state))
+		restore_state = NULL;
+
+	for_each_old_crtc_in_state(restore_state, crtc, crtc_state, i)
+		crtc_state->active = false;
+	for_each_old_plane_in_state(restore_state, plane, plane_state, i)
+		plane_state->visible = false;
+
+out:
+	dev_priv->modeset_restore_state = restore_state;
 }
 
 void intel_finish_reset(struct drm_i915_private *dev_priv)
 {
 	struct drm_device *dev = &dev_priv->drm;
-	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
-	struct drm_atomic_state *state = dev_priv->modeset_restore_state;
-	int ret;
+	struct drm_atomic_state *restore_state =
+		dev_priv->modeset_restore_state;
 
 	/*
 	 * Flips in the rings will be nuked by the reset,
@@ -3557,7 +3626,7 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 
 	/* reset doesn't touch the display */
 	if (!gpu_reset_clobbers_display(dev_priv)) {
-		if (!state) {
+		if (!restore_state) {
 			/*
 			 * Flips in the rings have been nuked by the reset,
 			 * so update the base address of all primary
@@ -3569,11 +3638,11 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 			 */
 			intel_update_primary_planes(dev);
 		} else {
-			ret = __intel_display_resume(dev, state, ctx);
-			if (ret)
-				DRM_ERROR("Restoring old state failed with %i\n", ret);
+			__intel_atomic_commit_tail(restore_state, true);
 		}
 	} else {
+		i915_redisable_vga(dev_priv);
+
 		/*
 		 * The display has been reset as well,
 		 * so need a full re-initialization.
@@ -3589,18 +3658,17 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 			dev_priv->display.hpd_irq_setup(dev_priv);
 		spin_unlock_irq(&dev_priv->irq_lock);
 
-		ret = __intel_display_resume(dev, state, ctx);
-		if (ret)
-			DRM_ERROR("Restoring old state failed with %i\n", ret);
+		__intel_atomic_commit_tail(restore_state, true);
 
 		intel_hpd_init(dev_priv);
 	}
 
-	if (state)
-		drm_atomic_state_put(state);
-	drm_modeset_drop_locks(ctx);
-	drm_modeset_acquire_fini(ctx);
-	mutex_unlock(&dev->mode_config.mutex);
+	if (restore_state) {
+		drm_atomic_helper_clean_committed_state(restore_state);
+		drm_atomic_state_put(restore_state);
+	}
+
+	up_write(&dev_priv->commit_sem);
 }
 
 static bool abort_flip_on_reset(struct intel_crtc *crtc)
@@ -12592,29 +12660,18 @@ static int intel_modeset_checks(struct drm_atomic_state *state)
 {
 	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
 	struct drm_i915_private *dev_priv = to_i915(state->dev);
-	struct drm_crtc *crtc;
-	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
-	int ret = 0, i;
+	int ret = 0;
 
 	if (!check_digital_port_conflicts(state)) {
 		DRM_DEBUG_KMS("rejecting conflicting digital port configuration\n");
 		return -EINVAL;
 	}
 
-	intel_state->modeset = true;
 	intel_state->active_crtcs = dev_priv->active_crtcs;
 	intel_state->cdclk.logical = dev_priv->cdclk.logical;
 	intel_state->cdclk.actual = dev_priv->cdclk.actual;
 
-	for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, new_crtc_state, i) {
-		if (new_crtc_state->active)
-			intel_state->active_crtcs |= 1 << i;
-		else
-			intel_state->active_crtcs &= ~(1 << i);
-
-		if (old_crtc_state->active != new_crtc_state->active)
-			intel_state->active_pipe_changes |= drm_crtc_mask(crtc);
-	}
+	init_intel_state(intel_state);
 
 	/*
 	 * See if the config requires any additional preparation, e.g.
@@ -13004,7 +13061,7 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work)
 	intel_atomic_helper_free_state(dev_priv);
 }
 
-static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
+static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset)
 {
 	struct drm_device *dev = state->dev;
 	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
@@ -13068,10 +13125,18 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 
 	/* Only after disabling all output pipelines that will be changed can we
 	 * update the the output configuration. */
-	intel_modeset_update_crtc_state(state);
+	if (!is_reset)
+		intel_modeset_update_crtc_state(state);
 
 	if (intel_state->modeset) {
-		drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
+		if (!is_reset) {
+			drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
+		} else {
+			for_each_new_crtc_in_state(state, crtc, new_crtc_state, i) {
+				if (new_crtc_state->enable)
+					drm_calc_timestamping_constants(crtc, &new_crtc_state->adjusted_mode);
+			}
+		}
 
 		intel_set_cdclk(dev_priv, &dev_priv->cdclk.actual);
 
@@ -13082,7 +13147,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 		if (!intel_can_enable_sagv(state))
 			intel_disable_sagv(dev_priv);
 
-		intel_modeset_verify_disabled(dev, state);
+		if (!is_reset)
+			intel_modeset_verify_disabled(dev, state);
 	}
 
 	/* Complete the events for pipes that have now been disabled */
@@ -13135,7 +13201,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 		if (put_domains[i])
 			modeset_put_power_domains(dev_priv, put_domains[i]);
 
-		intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
+		if (!is_reset)
+			intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
 	}
 
 	if (intel_state->modeset && intel_can_enable_sagv(state))
@@ -13160,10 +13227,14 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
 
 	drm_atomic_helper_wait_for_dependencies(state);
 
-	__intel_atomic_commit_tail(state);
+	down_read(&dev_priv->commit_sem);
+
+	__intel_atomic_commit_tail(state, false);
 
 	drm_atomic_helper_commit_hw_done(state);
 
+	up_read(&dev_priv->commit_sem);
+
 	mutex_lock(&dev->struct_mutex);
 	drm_atomic_helper_cleanup_planes(dev, state);
 	mutex_unlock(&dev->struct_mutex);
@@ -15022,6 +15093,8 @@ int intel_modeset_init(struct drm_device *dev)
 	INIT_WORK(&dev_priv->atomic_helper.free_work,
 		  intel_atomic_helper_free_state_worker);
 
+	init_rwsem(&dev_priv->commit_sem);
+
 	intel_init_quirks(dev);
 
 	intel_init_pm(dev_priv);
-- 
2.13.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-29 13:49 ` [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore ville.syrjala
@ 2017-06-29 13:56   ` Chris Wilson
  2017-06-29 14:36     ` ville.syrjala
  2017-06-30 13:25     ` Daniel Vetter
  2 siblings, 0 replies; 44+ messages in thread
From: Chris Wilson @ 2017-06-29 13:56 UTC (permalink / raw)
  To: ville.syrjala, intel-gfx; +Cc: Daniel Vetter, stable, dri-devel

Quoting ville.syrjala@linux.intel.com (2017-06-29 14:49:48)
> @@ -2640,15 +2600,13 @@ static void i915_reset_device(struct drm_i915_private *dev_priv)
>         char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>         char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
>         char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
> -       struct wedge_me w;
>  
>         kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>  
>         DRM_DEBUG_DRIVER("resetting chip\n");
>         kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>  
> -       /* Use a watchdog to ensure that our reset completes */
> -       i915_wedge_on_timeout(&w, dev_priv, 5*HZ) {

Keep the wedge-if-reset times out mechanism. It is a nice safe guard
against driver misbehaviour and not limited to breaking the previous
deadlock. If it fires, we get an error in dmesg to go along with the
lost data. I quickly grew fond of having this safe guard!
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Fix pre-g4x GPU reset, again
  2017-06-29 13:49 [PATCH 0/5] drm/i915: Fix pre-g4x GPU reset, again ville.syrjala
                   ` (4 preceding siblings ...)
  2017-06-29 13:49 ` [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore ville.syrjala
@ 2017-06-29 14:24 ` Patchwork
  2017-06-29 15:07   ` Ville Syrjälä
  2017-06-29 14:55 ` ✓ Fi.CI.BAT: success for drm/i915: Fix pre-g4x GPU reset, again (rev2) Patchwork
  6 siblings, 1 reply; 44+ messages in thread
From: Patchwork @ 2017-06-29 14:24 UTC (permalink / raw)
  To: ville.syrjala; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Fix pre-g4x GPU reset, again
URL   : https://patchwork.freedesktop.org/series/26554/
State : success

== Summary ==

Series 26554v1 drm/i915: Fix pre-g4x GPU reset, again
https://patchwork.freedesktop.org/api/1.0/series/26554/revisions/1/mbox/

Test gem_exec_suspend:
        Subgroup basic-s4-devices:
                pass       -> DMESG-WARN (fi-kbl-7560u) fdo#100125
Test gem_ringfill:
        Subgroup basic-default-hang:
                dmesg-warn -> PASS       (fi-blb-e6850) fdo#101600 +1
Test kms_cursor_legacy:
        Subgroup basic-busy-flip-before-cursor-atomic:
                fail       -> PASS       (fi-snb-2600) fdo#100215
Test kms_pipe_crc_basic:
        Subgroup hang-read-crc-pipe-b:
                pass       -> DMESG-WARN (fi-pnv-d510) fdo#101597
        Subgroup suspend-read-crc-pipe-b:
                dmesg-warn -> PASS       (fi-byt-n2820) fdo#101517

fdo#100125 https://bugs.freedesktop.org/show_bug.cgi?id=100125
fdo#101600 https://bugs.freedesktop.org/show_bug.cgi?id=101600
fdo#100215 https://bugs.freedesktop.org/show_bug.cgi?id=100215
fdo#101597 https://bugs.freedesktop.org/show_bug.cgi?id=101597
fdo#101517 https://bugs.freedesktop.org/show_bug.cgi?id=101517

fi-bdw-5557u     total:279  pass:264  dwarn:4   dfail:0   fail:0   skip:11  time:448s
fi-bdw-gvtdvm    total:279  pass:257  dwarn:8   dfail:0   fail:0   skip:14  time:429s
fi-blb-e6850     total:279  pass:225  dwarn:0   dfail:0   fail:0   skip:54  time:354s
fi-bsw-n3050     total:279  pass:243  dwarn:0   dfail:0   fail:0   skip:36  time:530s
fi-bxt-j4205     total:279  pass:260  dwarn:0   dfail:0   fail:0   skip:19  time:512s
fi-byt-j1900     total:279  pass:254  dwarn:1   dfail:0   fail:0   skip:24  time:485s
fi-byt-n2820     total:279  pass:251  dwarn:0   dfail:0   fail:0   skip:28  time:495s
fi-glk-2a        total:279  pass:260  dwarn:0   dfail:0   fail:0   skip:19  time:602s
fi-hsw-4770      total:279  pass:259  dwarn:4   dfail:0   fail:0   skip:16  time:433s
fi-hsw-4770r     total:279  pass:259  dwarn:4   dfail:0   fail:0   skip:16  time:415s
fi-ilk-650       total:279  pass:229  dwarn:0   dfail:0   fail:0   skip:50  time:422s
fi-ivb-3520m     total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:492s
fi-ivb-3770      total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:476s
fi-kbl-7500u     total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:465s
fi-kbl-7560u     total:279  pass:268  dwarn:1   dfail:0   fail:0   skip:10  time:570s
fi-kbl-r         total:279  pass:260  dwarn:1   dfail:0   fail:0   skip:18  time:581s
fi-pnv-d510      total:279  pass:222  dwarn:2   dfail:0   fail:0   skip:55  time:554s
fi-skl-6260u     total:279  pass:269  dwarn:0   dfail:0   fail:0   skip:10  time:461s
fi-skl-6700hq    total:279  pass:223  dwarn:1   dfail:0   fail:30  skip:24  time:340s
fi-skl-6700k     total:279  pass:257  dwarn:4   dfail:0   fail:0   skip:18  time:463s
fi-skl-6770hq    total:279  pass:269  dwarn:0   dfail:0   fail:0   skip:10  time:476s
fi-skl-gvtdvm    total:279  pass:266  dwarn:0   dfail:0   fail:0   skip:13  time:437s
fi-snb-2520m     total:279  pass:251  dwarn:0   dfail:0   fail:0   skip:28  time:541s
fi-snb-2600      total:279  pass:250  dwarn:0   dfail:0   fail:0   skip:29  time:405s

ee2da0ac24fb8d50a03b263eb1fa2e82849eda95 drm-tip: 2017y-06m-29d-13h-14m-49s UTC integration manifest
d37cc43 drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
e6e3d92 drm/i915: Refactor __intel_atomic_commit_tail()
c33eea0 drm/i915% Store vma gtt offset in plane state
5ab92c4 drm/atomic: Introduce drm_atomic_helper_duplicate_committed_state()
988bc43 drm/atomic: Refactor drm_atomic_state_realloc_connectors()

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_5069/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-29 13:49 ` [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore ville.syrjala
@ 2017-06-29 14:36     ` ville.syrjala
  2017-06-29 14:36     ` ville.syrjala
  2017-06-30 13:25     ` Daniel Vetter
  2 siblings, 0 replies; 44+ messages in thread
From: ville.syrjala @ 2017-06-29 14:36 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, stable, Daniel Vetter, Chris Wilson

From: Ville Syrjälä <ville.syrjala@linux.intel.com>

Introduce an rw_semaphore to protect the display commits. All normal
commits use down_read() and hence can proceed in parallel, but GPU reset
will use down_write() making sure no other commits are in progress when
we have to pull the plug on the display engine on pre-g4x platforms.
There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
itself, and we wait for all dependencies before the down_read(), and
thus there is no chance of deadlocks with this scheme.

During reset we should be recommiting the state that was committed last.
But for now we'll settle for recommiting the last state for each object.
Hence we may commit things a bit too soon when a GPU reset occurs. The
rw_semaphore should guarantee that whatever state we observe in
obj->state during reset sticks around while we do the commit. The
obj->state pointer might change for some objects if another swap_state()
occurs while we do our thing, so there migth be some theoretical
mismatch which I tink we could avoid by grabbing the rw_semaphore also
around the swap_state(), but for now I didn't do that.

Another source of mismatch can happen because we sometimes use the
intel_crtc->config during the actual commit, and that only gets updated
when we do the commuit. Hence we may get some state via ->state, some
via the duplicated ->state, and some via ->config. We'll want to
fix this all by tracking the commited state properly, but that will
some bigger refactroing so for now we'll just choose to accept these
potential mismatches.

I left out the state readout from the post-reset display
reinitialization as that still likes to clobber crtc->state etc.
If we make it use a free standing atomic state and mke sure it doesn't
need any locks we could reintroduce it. For now I just mark the
post-reset display state as dirty as possible to make sure we
reinitialize everything.

One other potential issue remains in the form of display detection.
Either we need to protect that with the same rw_semaphore as well, or
perhaps it would be enough to force gmbus into bitbanging mode while
the reset is happening and we don't have interrupts, and just across
the actual hardware GPU reset we could hold the gmbus mutex.

v2: Keep intel_prepare/finish_reset() outside struct_mutex (Chris)
v3: Drop all the committed_state refactoring to make this less
    obnoxious to backport (Daniel)
v4: Preserve the wedge timeout mechanism (Chris)

Cc: <stable@vger.kernel.org> # for the brave
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101597
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99093
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600
Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h      |   2 +
 drivers/gpu/drm/i915/intel_display.c | 199 ++++++++++++++++++++++++-----------
 2 files changed, 138 insertions(+), 63 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index effbe4f72a64..88ddd27f61b0 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2237,6 +2237,8 @@ struct drm_i915_private {
 	struct drm_atomic_state *modeset_restore_state;
 	struct drm_modeset_acquire_ctx reset_ctx;
 
+	struct rw_semaphore commit_sem;
+
 	struct list_head vm_list; /* Global list of all address spaces */
 	struct i915_ggtt ggtt; /* VM representing the global address space */
 
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 7cdd6ec97f80..f13c7d81d4a9 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -123,6 +123,7 @@ static void ironlake_pfit_enable(struct intel_crtc *crtc);
 static void intel_modeset_setup_hw_state(struct drm_device *dev,
 					 struct drm_modeset_acquire_ctx *ctx);
 static void intel_pre_disable_primary_noatomic(struct drm_crtc *crtc);
+static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset);
 
 struct intel_limit {
 	struct {
@@ -3491,27 +3492,85 @@ static bool gpu_reset_clobbers_display(struct drm_i915_private *dev_priv)
 		INTEL_GEN(dev_priv) < 5 && !IS_G4X(dev_priv);
 }
 
-void intel_prepare_reset(struct drm_i915_private *dev_priv)
+static void init_intel_state(struct intel_atomic_state *state)
+{
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
+	int i;
+
+	state->modeset = true;
+
+	for_each_oldnew_crtc_in_state(&state->base, crtc,
+				      old_crtc_state, new_crtc_state, i) {
+		if (new_crtc_state->active)
+			state->active_crtcs |= 1 << i;
+		else
+			state->active_crtcs &= ~(1 << i);
+
+		if (old_crtc_state->active != new_crtc_state->active)
+			state->active_pipe_changes |= drm_crtc_mask(crtc);
+	}
+}
+
+static struct drm_atomic_state *
+intel_duplicate_committed_state(struct drm_i915_private *dev_priv)
 {
-	struct drm_device *dev = &dev_priv->drm;
-	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
 	struct drm_atomic_state *state;
-	int ret;
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *crtc_state;
+	int i;
 
-	/*
-	 * Need mode_config.mutex so that we don't
-	 * trample ongoing ->detect() and whatnot.
-	 */
-	mutex_lock(&dev->mode_config.mutex);
-	drm_modeset_acquire_init(ctx, 0);
-	while (1) {
-		ret = drm_modeset_lock_all_ctx(dev, ctx);
-		if (ret != -EDEADLK)
-			break;
+	state = drm_atomic_helper_duplicate_committed_state(&dev_priv->drm);
+	if (IS_ERR(state)) {
+		DRM_ERROR("Duplicating state failed with %ld\n",
+			  PTR_ERR(state));
+		return NULL;
+	}
+
+	to_intel_atomic_state(state)->active_crtcs = 0;
+	to_intel_atomic_state(state)->cdclk.logical = dev_priv->cdclk.hw;
+	to_intel_atomic_state(state)->cdclk.actual = dev_priv->cdclk.hw;
+
+	init_intel_state(to_intel_atomic_state(state));
+
+	/* force a full update */
+	for_each_new_crtc_in_state(state, crtc, crtc_state, i) {
+		struct intel_crtc_state *intel_crtc_state =
+			to_intel_crtc_state(crtc_state);
+
+		if (!crtc_state->active)
+			continue;
 
-		drm_modeset_backoff(ctx);
+		crtc_state->mode_changed = true;
+		crtc_state->active_changed = true;
+		crtc_state->planes_changed = true;
+		crtc_state->connectors_changed = true;
+		crtc_state->color_mgmt_changed = true;
+		crtc_state->zpos_changed = true;
+
+		intel_crtc_state->update_pipe = true;
+		intel_crtc_state->disable_lp_wm = true;
+		intel_crtc_state->disable_cxsr = true;
+		intel_crtc_state->update_wm_post = true;
+		intel_crtc_state->fb_changed = true;
+		intel_crtc_state->fifo_changed = true;
+		intel_crtc_state->wm.need_postvbl_update = true;
 	}
 
+	return state;
+}
+
+void intel_prepare_reset(struct drm_i915_private *dev_priv)
+{
+	struct drm_atomic_state *disable_state, *restore_state = NULL;
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *crtc_state;
+	struct drm_plane *plane;
+	struct drm_plane_state *plane_state;
+	int i;
+
+	down_write(&dev_priv->commit_sem);
+
 	/* reset doesn't touch the display, but flips might get nuked anyway, */
 	if (!i915.force_reset_modeset_test &&
 	    !gpu_reset_clobbers_display(dev_priv))
@@ -3521,30 +3580,40 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
 	 * Disabling the crtcs gracefully seems nicer. Also the
 	 * g33 docs say we should at least disable all the planes.
 	 */
-	state = drm_atomic_helper_duplicate_state(dev, ctx);
-	if (IS_ERR(state)) {
-		ret = PTR_ERR(state);
-		DRM_ERROR("Duplicating state failed with %i\n", ret);
-		return;
-	}
+	disable_state = intel_duplicate_committed_state(dev_priv);
+	if (IS_ERR(disable_state))
+		goto out;
 
-	ret = drm_atomic_helper_disable_all(dev, ctx);
-	if (ret) {
-		DRM_ERROR("Suspending crtc's failed with %i\n", ret);
-		drm_atomic_state_put(state);
-		return;
-	}
+	to_intel_atomic_state(disable_state)->active_crtcs = 0;
 
-	dev_priv->modeset_restore_state = state;
-	state->acquire_ctx = ctx;
+	for_each_new_crtc_in_state(disable_state, crtc, crtc_state, i)
+		crtc_state->active = false;
+	for_each_new_plane_in_state(disable_state, plane, plane_state, i)
+		plane_state->visible = false;
+
+	__intel_atomic_commit_tail(disable_state, true);
+
+	drm_atomic_helper_clean_committed_state(disable_state);
+	drm_atomic_state_put(disable_state);
+
+	restore_state = intel_duplicate_committed_state(dev_priv);
+	if (IS_ERR(restore_state))
+		restore_state = NULL;
+
+	for_each_old_crtc_in_state(restore_state, crtc, crtc_state, i)
+		crtc_state->active = false;
+	for_each_old_plane_in_state(restore_state, plane, plane_state, i)
+		plane_state->visible = false;
+
+out:
+	dev_priv->modeset_restore_state = restore_state;
 }
 
 void intel_finish_reset(struct drm_i915_private *dev_priv)
 {
 	struct drm_device *dev = &dev_priv->drm;
-	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
-	struct drm_atomic_state *state = dev_priv->modeset_restore_state;
-	int ret;
+	struct drm_atomic_state *restore_state =
+		dev_priv->modeset_restore_state;
 
 	/*
 	 * Flips in the rings will be nuked by the reset,
@@ -3557,7 +3626,7 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 
 	/* reset doesn't touch the display */
 	if (!gpu_reset_clobbers_display(dev_priv)) {
-		if (!state) {
+		if (!restore_state) {
 			/*
 			 * Flips in the rings have been nuked by the reset,
 			 * so update the base address of all primary
@@ -3569,11 +3638,11 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 			 */
 			intel_update_primary_planes(dev);
 		} else {
-			ret = __intel_display_resume(dev, state, ctx);
-			if (ret)
-				DRM_ERROR("Restoring old state failed with %i\n", ret);
+			__intel_atomic_commit_tail(restore_state, true);
 		}
 	} else {
+		i915_redisable_vga(dev_priv);
+
 		/*
 		 * The display has been reset as well,
 		 * so need a full re-initialization.
@@ -3589,18 +3658,17 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 			dev_priv->display.hpd_irq_setup(dev_priv);
 		spin_unlock_irq(&dev_priv->irq_lock);
 
-		ret = __intel_display_resume(dev, state, ctx);
-		if (ret)
-			DRM_ERROR("Restoring old state failed with %i\n", ret);
+		__intel_atomic_commit_tail(restore_state, true);
 
 		intel_hpd_init(dev_priv);
 	}
 
-	if (state)
-		drm_atomic_state_put(state);
-	drm_modeset_drop_locks(ctx);
-	drm_modeset_acquire_fini(ctx);
-	mutex_unlock(&dev->mode_config.mutex);
+	if (restore_state) {
+		drm_atomic_helper_clean_committed_state(restore_state);
+		drm_atomic_state_put(restore_state);
+	}
+
+	up_write(&dev_priv->commit_sem);
 }
 
 static bool abort_flip_on_reset(struct intel_crtc *crtc)
@@ -12592,29 +12660,18 @@ static int intel_modeset_checks(struct drm_atomic_state *state)
 {
 	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
 	struct drm_i915_private *dev_priv = to_i915(state->dev);
-	struct drm_crtc *crtc;
-	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
-	int ret = 0, i;
+	int ret = 0;
 
 	if (!check_digital_port_conflicts(state)) {
 		DRM_DEBUG_KMS("rejecting conflicting digital port configuration\n");
 		return -EINVAL;
 	}
 
-	intel_state->modeset = true;
 	intel_state->active_crtcs = dev_priv->active_crtcs;
 	intel_state->cdclk.logical = dev_priv->cdclk.logical;
 	intel_state->cdclk.actual = dev_priv->cdclk.actual;
 
-	for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, new_crtc_state, i) {
-		if (new_crtc_state->active)
-			intel_state->active_crtcs |= 1 << i;
-		else
-			intel_state->active_crtcs &= ~(1 << i);
-
-		if (old_crtc_state->active != new_crtc_state->active)
-			intel_state->active_pipe_changes |= drm_crtc_mask(crtc);
-	}
+	init_intel_state(intel_state);
 
 	/*
 	 * See if the config requires any additional preparation, e.g.
@@ -13004,7 +13061,7 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work)
 	intel_atomic_helper_free_state(dev_priv);
 }
 
-static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
+static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset)
 {
 	struct drm_device *dev = state->dev;
 	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
@@ -13068,10 +13125,18 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 
 	/* Only after disabling all output pipelines that will be changed can we
 	 * update the the output configuration. */
-	intel_modeset_update_crtc_state(state);
+	if (!is_reset)
+		intel_modeset_update_crtc_state(state);
 
 	if (intel_state->modeset) {
-		drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
+		if (!is_reset) {
+			drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
+		} else {
+			for_each_new_crtc_in_state(state, crtc, new_crtc_state, i) {
+				if (new_crtc_state->enable)
+					drm_calc_timestamping_constants(crtc, &new_crtc_state->adjusted_mode);
+			}
+		}
 
 		intel_set_cdclk(dev_priv, &dev_priv->cdclk.actual);
 
@@ -13082,7 +13147,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 		if (!intel_can_enable_sagv(state))
 			intel_disable_sagv(dev_priv);
 
-		intel_modeset_verify_disabled(dev, state);
+		if (!is_reset)
+			intel_modeset_verify_disabled(dev, state);
 	}
 
 	/* Complete the events for pipes that have now been disabled */
@@ -13135,7 +13201,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 		if (put_domains[i])
 			modeset_put_power_domains(dev_priv, put_domains[i]);
 
-		intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
+		if (!is_reset)
+			intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
 	}
 
 	if (intel_state->modeset && intel_can_enable_sagv(state))
@@ -13160,10 +13227,14 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
 
 	drm_atomic_helper_wait_for_dependencies(state);
 
-	__intel_atomic_commit_tail(state);
+	down_read(&dev_priv->commit_sem);
+
+	__intel_atomic_commit_tail(state, false);
 
 	drm_atomic_helper_commit_hw_done(state);
 
+	up_read(&dev_priv->commit_sem);
+
 	mutex_lock(&dev->struct_mutex);
 	drm_atomic_helper_cleanup_planes(dev, state);
 	mutex_unlock(&dev->struct_mutex);
@@ -15022,6 +15093,8 @@ int intel_modeset_init(struct drm_device *dev)
 	INIT_WORK(&dev_priv->atomic_helper.free_work,
 		  intel_atomic_helper_free_state_worker);
 
+	init_rwsem(&dev_priv->commit_sem);
+
 	intel_init_quirks(dev);
 
 	intel_init_pm(dev_priv);
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-06-29 14:36     ` ville.syrjala
  0 siblings, 0 replies; 44+ messages in thread
From: ville.syrjala @ 2017-06-29 14:36 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, stable, dri-devel

From: Ville Syrjälä <ville.syrjala@linux.intel.com>

Introduce an rw_semaphore to protect the display commits. All normal
commits use down_read() and hence can proceed in parallel, but GPU reset
will use down_write() making sure no other commits are in progress when
we have to pull the plug on the display engine on pre-g4x platforms.
There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
itself, and we wait for all dependencies before the down_read(), and
thus there is no chance of deadlocks with this scheme.

During reset we should be recommiting the state that was committed last.
But for now we'll settle for recommiting the last state for each object.
Hence we may commit things a bit too soon when a GPU reset occurs. The
rw_semaphore should guarantee that whatever state we observe in
obj->state during reset sticks around while we do the commit. The
obj->state pointer might change for some objects if another swap_state()
occurs while we do our thing, so there migth be some theoretical
mismatch which I tink we could avoid by grabbing the rw_semaphore also
around the swap_state(), but for now I didn't do that.

Another source of mismatch can happen because we sometimes use the
intel_crtc->config during the actual commit, and that only gets updated
when we do the commuit. Hence we may get some state via ->state, some
via the duplicated ->state, and some via ->config. We'll want to
fix this all by tracking the commited state properly, but that will
some bigger refactroing so for now we'll just choose to accept these
potential mismatches.

I left out the state readout from the post-reset display
reinitialization as that still likes to clobber crtc->state etc.
If we make it use a free standing atomic state and mke sure it doesn't
need any locks we could reintroduce it. For now I just mark the
post-reset display state as dirty as possible to make sure we
reinitialize everything.

One other potential issue remains in the form of display detection.
Either we need to protect that with the same rw_semaphore as well, or
perhaps it would be enough to force gmbus into bitbanging mode while
the reset is happening and we don't have interrupts, and just across
the actual hardware GPU reset we could hold the gmbus mutex.

v2: Keep intel_prepare/finish_reset() outside struct_mutex (Chris)
v3: Drop all the committed_state refactoring to make this less
    obnoxious to backport (Daniel)
v4: Preserve the wedge timeout mechanism (Chris)

Cc: <stable@vger.kernel.org> # for the brave
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101597
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99093
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600
Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h      |   2 +
 drivers/gpu/drm/i915/intel_display.c | 199 ++++++++++++++++++++++++-----------
 2 files changed, 138 insertions(+), 63 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index effbe4f72a64..88ddd27f61b0 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2237,6 +2237,8 @@ struct drm_i915_private {
 	struct drm_atomic_state *modeset_restore_state;
 	struct drm_modeset_acquire_ctx reset_ctx;
 
+	struct rw_semaphore commit_sem;
+
 	struct list_head vm_list; /* Global list of all address spaces */
 	struct i915_ggtt ggtt; /* VM representing the global address space */
 
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 7cdd6ec97f80..f13c7d81d4a9 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -123,6 +123,7 @@ static void ironlake_pfit_enable(struct intel_crtc *crtc);
 static void intel_modeset_setup_hw_state(struct drm_device *dev,
 					 struct drm_modeset_acquire_ctx *ctx);
 static void intel_pre_disable_primary_noatomic(struct drm_crtc *crtc);
+static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset);
 
 struct intel_limit {
 	struct {
@@ -3491,27 +3492,85 @@ static bool gpu_reset_clobbers_display(struct drm_i915_private *dev_priv)
 		INTEL_GEN(dev_priv) < 5 && !IS_G4X(dev_priv);
 }
 
-void intel_prepare_reset(struct drm_i915_private *dev_priv)
+static void init_intel_state(struct intel_atomic_state *state)
+{
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
+	int i;
+
+	state->modeset = true;
+
+	for_each_oldnew_crtc_in_state(&state->base, crtc,
+				      old_crtc_state, new_crtc_state, i) {
+		if (new_crtc_state->active)
+			state->active_crtcs |= 1 << i;
+		else
+			state->active_crtcs &= ~(1 << i);
+
+		if (old_crtc_state->active != new_crtc_state->active)
+			state->active_pipe_changes |= drm_crtc_mask(crtc);
+	}
+}
+
+static struct drm_atomic_state *
+intel_duplicate_committed_state(struct drm_i915_private *dev_priv)
 {
-	struct drm_device *dev = &dev_priv->drm;
-	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
 	struct drm_atomic_state *state;
-	int ret;
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *crtc_state;
+	int i;
 
-	/*
-	 * Need mode_config.mutex so that we don't
-	 * trample ongoing ->detect() and whatnot.
-	 */
-	mutex_lock(&dev->mode_config.mutex);
-	drm_modeset_acquire_init(ctx, 0);
-	while (1) {
-		ret = drm_modeset_lock_all_ctx(dev, ctx);
-		if (ret != -EDEADLK)
-			break;
+	state = drm_atomic_helper_duplicate_committed_state(&dev_priv->drm);
+	if (IS_ERR(state)) {
+		DRM_ERROR("Duplicating state failed with %ld\n",
+			  PTR_ERR(state));
+		return NULL;
+	}
+
+	to_intel_atomic_state(state)->active_crtcs = 0;
+	to_intel_atomic_state(state)->cdclk.logical = dev_priv->cdclk.hw;
+	to_intel_atomic_state(state)->cdclk.actual = dev_priv->cdclk.hw;
+
+	init_intel_state(to_intel_atomic_state(state));
+
+	/* force a full update */
+	for_each_new_crtc_in_state(state, crtc, crtc_state, i) {
+		struct intel_crtc_state *intel_crtc_state =
+			to_intel_crtc_state(crtc_state);
+
+		if (!crtc_state->active)
+			continue;
 
-		drm_modeset_backoff(ctx);
+		crtc_state->mode_changed = true;
+		crtc_state->active_changed = true;
+		crtc_state->planes_changed = true;
+		crtc_state->connectors_changed = true;
+		crtc_state->color_mgmt_changed = true;
+		crtc_state->zpos_changed = true;
+
+		intel_crtc_state->update_pipe = true;
+		intel_crtc_state->disable_lp_wm = true;
+		intel_crtc_state->disable_cxsr = true;
+		intel_crtc_state->update_wm_post = true;
+		intel_crtc_state->fb_changed = true;
+		intel_crtc_state->fifo_changed = true;
+		intel_crtc_state->wm.need_postvbl_update = true;
 	}
 
+	return state;
+}
+
+void intel_prepare_reset(struct drm_i915_private *dev_priv)
+{
+	struct drm_atomic_state *disable_state, *restore_state = NULL;
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *crtc_state;
+	struct drm_plane *plane;
+	struct drm_plane_state *plane_state;
+	int i;
+
+	down_write(&dev_priv->commit_sem);
+
 	/* reset doesn't touch the display, but flips might get nuked anyway, */
 	if (!i915.force_reset_modeset_test &&
 	    !gpu_reset_clobbers_display(dev_priv))
@@ -3521,30 +3580,40 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
 	 * Disabling the crtcs gracefully seems nicer. Also the
 	 * g33 docs say we should at least disable all the planes.
 	 */
-	state = drm_atomic_helper_duplicate_state(dev, ctx);
-	if (IS_ERR(state)) {
-		ret = PTR_ERR(state);
-		DRM_ERROR("Duplicating state failed with %i\n", ret);
-		return;
-	}
+	disable_state = intel_duplicate_committed_state(dev_priv);
+	if (IS_ERR(disable_state))
+		goto out;
 
-	ret = drm_atomic_helper_disable_all(dev, ctx);
-	if (ret) {
-		DRM_ERROR("Suspending crtc's failed with %i\n", ret);
-		drm_atomic_state_put(state);
-		return;
-	}
+	to_intel_atomic_state(disable_state)->active_crtcs = 0;
 
-	dev_priv->modeset_restore_state = state;
-	state->acquire_ctx = ctx;
+	for_each_new_crtc_in_state(disable_state, crtc, crtc_state, i)
+		crtc_state->active = false;
+	for_each_new_plane_in_state(disable_state, plane, plane_state, i)
+		plane_state->visible = false;
+
+	__intel_atomic_commit_tail(disable_state, true);
+
+	drm_atomic_helper_clean_committed_state(disable_state);
+	drm_atomic_state_put(disable_state);
+
+	restore_state = intel_duplicate_committed_state(dev_priv);
+	if (IS_ERR(restore_state))
+		restore_state = NULL;
+
+	for_each_old_crtc_in_state(restore_state, crtc, crtc_state, i)
+		crtc_state->active = false;
+	for_each_old_plane_in_state(restore_state, plane, plane_state, i)
+		plane_state->visible = false;
+
+out:
+	dev_priv->modeset_restore_state = restore_state;
 }
 
 void intel_finish_reset(struct drm_i915_private *dev_priv)
 {
 	struct drm_device *dev = &dev_priv->drm;
-	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
-	struct drm_atomic_state *state = dev_priv->modeset_restore_state;
-	int ret;
+	struct drm_atomic_state *restore_state =
+		dev_priv->modeset_restore_state;
 
 	/*
 	 * Flips in the rings will be nuked by the reset,
@@ -3557,7 +3626,7 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 
 	/* reset doesn't touch the display */
 	if (!gpu_reset_clobbers_display(dev_priv)) {
-		if (!state) {
+		if (!restore_state) {
 			/*
 			 * Flips in the rings have been nuked by the reset,
 			 * so update the base address of all primary
@@ -3569,11 +3638,11 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 			 */
 			intel_update_primary_planes(dev);
 		} else {
-			ret = __intel_display_resume(dev, state, ctx);
-			if (ret)
-				DRM_ERROR("Restoring old state failed with %i\n", ret);
+			__intel_atomic_commit_tail(restore_state, true);
 		}
 	} else {
+		i915_redisable_vga(dev_priv);
+
 		/*
 		 * The display has been reset as well,
 		 * so need a full re-initialization.
@@ -3589,18 +3658,17 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 			dev_priv->display.hpd_irq_setup(dev_priv);
 		spin_unlock_irq(&dev_priv->irq_lock);
 
-		ret = __intel_display_resume(dev, state, ctx);
-		if (ret)
-			DRM_ERROR("Restoring old state failed with %i\n", ret);
+		__intel_atomic_commit_tail(restore_state, true);
 
 		intel_hpd_init(dev_priv);
 	}
 
-	if (state)
-		drm_atomic_state_put(state);
-	drm_modeset_drop_locks(ctx);
-	drm_modeset_acquire_fini(ctx);
-	mutex_unlock(&dev->mode_config.mutex);
+	if (restore_state) {
+		drm_atomic_helper_clean_committed_state(restore_state);
+		drm_atomic_state_put(restore_state);
+	}
+
+	up_write(&dev_priv->commit_sem);
 }
 
 static bool abort_flip_on_reset(struct intel_crtc *crtc)
@@ -12592,29 +12660,18 @@ static int intel_modeset_checks(struct drm_atomic_state *state)
 {
 	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
 	struct drm_i915_private *dev_priv = to_i915(state->dev);
-	struct drm_crtc *crtc;
-	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
-	int ret = 0, i;
+	int ret = 0;
 
 	if (!check_digital_port_conflicts(state)) {
 		DRM_DEBUG_KMS("rejecting conflicting digital port configuration\n");
 		return -EINVAL;
 	}
 
-	intel_state->modeset = true;
 	intel_state->active_crtcs = dev_priv->active_crtcs;
 	intel_state->cdclk.logical = dev_priv->cdclk.logical;
 	intel_state->cdclk.actual = dev_priv->cdclk.actual;
 
-	for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, new_crtc_state, i) {
-		if (new_crtc_state->active)
-			intel_state->active_crtcs |= 1 << i;
-		else
-			intel_state->active_crtcs &= ~(1 << i);
-
-		if (old_crtc_state->active != new_crtc_state->active)
-			intel_state->active_pipe_changes |= drm_crtc_mask(crtc);
-	}
+	init_intel_state(intel_state);
 
 	/*
 	 * See if the config requires any additional preparation, e.g.
@@ -13004,7 +13061,7 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work)
 	intel_atomic_helper_free_state(dev_priv);
 }
 
-static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
+static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset)
 {
 	struct drm_device *dev = state->dev;
 	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
@@ -13068,10 +13125,18 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 
 	/* Only after disabling all output pipelines that will be changed can we
 	 * update the the output configuration. */
-	intel_modeset_update_crtc_state(state);
+	if (!is_reset)
+		intel_modeset_update_crtc_state(state);
 
 	if (intel_state->modeset) {
-		drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
+		if (!is_reset) {
+			drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
+		} else {
+			for_each_new_crtc_in_state(state, crtc, new_crtc_state, i) {
+				if (new_crtc_state->enable)
+					drm_calc_timestamping_constants(crtc, &new_crtc_state->adjusted_mode);
+			}
+		}
 
 		intel_set_cdclk(dev_priv, &dev_priv->cdclk.actual);
 
@@ -13082,7 +13147,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 		if (!intel_can_enable_sagv(state))
 			intel_disable_sagv(dev_priv);
 
-		intel_modeset_verify_disabled(dev, state);
+		if (!is_reset)
+			intel_modeset_verify_disabled(dev, state);
 	}
 
 	/* Complete the events for pipes that have now been disabled */
@@ -13135,7 +13201,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 		if (put_domains[i])
 			modeset_put_power_domains(dev_priv, put_domains[i]);
 
-		intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
+		if (!is_reset)
+			intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
 	}
 
 	if (intel_state->modeset && intel_can_enable_sagv(state))
@@ -13160,10 +13227,14 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
 
 	drm_atomic_helper_wait_for_dependencies(state);
 
-	__intel_atomic_commit_tail(state);
+	down_read(&dev_priv->commit_sem);
+
+	__intel_atomic_commit_tail(state, false);
 
 	drm_atomic_helper_commit_hw_done(state);
 
+	up_read(&dev_priv->commit_sem);
+
 	mutex_lock(&dev->struct_mutex);
 	drm_atomic_helper_cleanup_planes(dev, state);
 	mutex_unlock(&dev->struct_mutex);
@@ -15022,6 +15093,8 @@ int intel_modeset_init(struct drm_device *dev)
 	INIT_WORK(&dev_priv->atomic_helper.free_work,
 		  intel_atomic_helper_free_state_worker);
 
+	init_rwsem(&dev_priv->commit_sem);
+
 	intel_init_quirks(dev);
 
 	intel_init_pm(dev_priv);
-- 
2.13.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Fix pre-g4x GPU reset, again (rev2)
  2017-06-29 13:49 [PATCH 0/5] drm/i915: Fix pre-g4x GPU reset, again ville.syrjala
                   ` (5 preceding siblings ...)
  2017-06-29 14:24 ` ✓ Fi.CI.BAT: success for drm/i915: Fix pre-g4x GPU reset, again Patchwork
@ 2017-06-29 14:55 ` Patchwork
  6 siblings, 0 replies; 44+ messages in thread
From: Patchwork @ 2017-06-29 14:55 UTC (permalink / raw)
  To: ville.syrjala; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Fix pre-g4x GPU reset, again (rev2)
URL   : https://patchwork.freedesktop.org/series/26554/
State : success

== Summary ==

Series 26554v2 drm/i915: Fix pre-g4x GPU reset, again
https://patchwork.freedesktop.org/api/1.0/series/26554/revisions/2/mbox/

Test gem_exec_suspend:
        Subgroup basic-s4-devices:
                pass       -> DMESG-WARN (fi-kbl-7560u) fdo#100125
Test gem_ringfill:
        Subgroup basic-default-hang:
                dmesg-warn -> PASS       (fi-blb-e6850) fdo#101600 +1
Test kms_cursor_legacy:
        Subgroup basic-busy-flip-before-cursor-atomic:
                fail       -> PASS       (fi-snb-2600) fdo#100215
Test kms_pipe_crc_basic:
        Subgroup hang-read-crc-pipe-a:
                dmesg-warn -> PASS       (fi-pnv-d510) fdo#101597 +1

fdo#100125 https://bugs.freedesktop.org/show_bug.cgi?id=100125
fdo#101600 https://bugs.freedesktop.org/show_bug.cgi?id=101600
fdo#100215 https://bugs.freedesktop.org/show_bug.cgi?id=100215
fdo#101597 https://bugs.freedesktop.org/show_bug.cgi?id=101597

fi-bdw-5557u     total:279  pass:264  dwarn:4   dfail:0   fail:0   skip:11  time:442s
fi-bdw-gvtdvm    total:279  pass:257  dwarn:8   dfail:0   fail:0   skip:14  time:435s
fi-blb-e6850     total:279  pass:225  dwarn:0   dfail:0   fail:0   skip:54  time:346s
fi-bsw-n3050     total:279  pass:243  dwarn:0   dfail:0   fail:0   skip:36  time:528s
fi-bxt-j4205     total:279  pass:260  dwarn:0   dfail:0   fail:0   skip:19  time:513s
fi-byt-j1900     total:279  pass:254  dwarn:1   dfail:0   fail:0   skip:24  time:488s
fi-byt-n2820     total:279  pass:250  dwarn:1   dfail:0   fail:0   skip:28  time:483s
fi-glk-2a        total:279  pass:260  dwarn:0   dfail:0   fail:0   skip:19  time:596s
fi-hsw-4770      total:279  pass:259  dwarn:4   dfail:0   fail:0   skip:16  time:434s
fi-hsw-4770r     total:279  pass:259  dwarn:4   dfail:0   fail:0   skip:16  time:414s
fi-ilk-650       total:279  pass:229  dwarn:0   dfail:0   fail:0   skip:50  time:420s
fi-ivb-3520m     total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:505s
fi-ivb-3770      total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:474s
fi-kbl-7500u     total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:471s
fi-kbl-7560u     total:279  pass:268  dwarn:1   dfail:0   fail:0   skip:10  time:571s
fi-kbl-r         total:279  pass:260  dwarn:1   dfail:0   fail:0   skip:18  time:582s
fi-pnv-d510      total:279  pass:223  dwarn:1   dfail:0   fail:0   skip:55  time:553s
fi-skl-6260u     total:279  pass:269  dwarn:0   dfail:0   fail:0   skip:10  time:457s
fi-skl-6700hq    total:279  pass:223  dwarn:1   dfail:0   fail:30  skip:24  time:349s
fi-skl-6700k     total:279  pass:257  dwarn:4   dfail:0   fail:0   skip:18  time:466s
fi-skl-6770hq    total:279  pass:269  dwarn:0   dfail:0   fail:0   skip:10  time:475s
fi-skl-gvtdvm    total:279  pass:266  dwarn:0   dfail:0   fail:0   skip:13  time:436s
fi-snb-2520m     total:279  pass:251  dwarn:0   dfail:0   fail:0   skip:28  time:543s
fi-snb-2600      total:279  pass:250  dwarn:0   dfail:0   fail:0   skip:29  time:402s

ee2da0ac24fb8d50a03b263eb1fa2e82849eda95 drm-tip: 2017y-06m-29d-13h-14m-49s UTC integration manifest
d25f19a drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
8b50bb0 drm/i915: Refactor __intel_atomic_commit_tail()
37b6e58 drm/i915% Store vma gtt offset in plane state
f385b63 drm/atomic: Introduce drm_atomic_helper_duplicate_committed_state()
1b8f5ad drm/atomic: Refactor drm_atomic_state_realloc_connectors()

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_5071/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ✓ Fi.CI.BAT: success for drm/i915: Fix pre-g4x GPU reset, again
  2017-06-29 14:24 ` ✓ Fi.CI.BAT: success for drm/i915: Fix pre-g4x GPU reset, again Patchwork
@ 2017-06-29 15:07   ` Ville Syrjälä
  0 siblings, 0 replies; 44+ messages in thread
From: Ville Syrjälä @ 2017-06-29 15:07 UTC (permalink / raw)
  To: intel-gfx

On Thu, Jun 29, 2017 at 02:24:21PM -0000, Patchwork wrote:
> == Series Details ==
> 
> Series: drm/i915: Fix pre-g4x GPU reset, again
> URL   : https://patchwork.freedesktop.org/series/26554/
> State : success
> 
> == Summary ==
> 
> Series 26554v1 drm/i915: Fix pre-g4x GPU reset, again
> https://patchwork.freedesktop.org/api/1.0/series/26554/revisions/1/mbox/
> 
> Test gem_exec_suspend:
>         Subgroup basic-s4-devices:
>                 pass       -> DMESG-WARN (fi-kbl-7560u) fdo#100125
> Test gem_ringfill:
>         Subgroup basic-default-hang:
>                 dmesg-warn -> PASS       (fi-blb-e6850) fdo#101600 +1
> Test kms_cursor_legacy:
>         Subgroup basic-busy-flip-before-cursor-atomic:
>                 fail       -> PASS       (fi-snb-2600) fdo#100215
> Test kms_pipe_crc_basic:
>         Subgroup hang-read-crc-pipe-b:
>                 pass       -> DMESG-WARN (fi-pnv-d510) fdo#101597

[  484.554418] drm/i915: Resetting chip after gpu hang
[  484.723084] pipe B vblank wait timed out
[  484.723168] ------------[ cut here ]------------
[  484.723303] WARNING: CPU: 3 PID: 3484 at drivers/gpu/drm/i915/intel_display.c:12906 __intel_atomic_commit_tail+0x101e/0x1050 [i915]

And it seems to have happened for pipe A as well. So far I can't 
reproduce this on the Lenovo Ideapan PNV I have on my desk.

>         Subgroup suspend-read-crc-pipe-b:
>                 dmesg-warn -> PASS       (fi-byt-n2820) fdo#101517
> 
> fdo#100125 https://bugs.freedesktop.org/show_bug.cgi?id=100125
> fdo#101600 https://bugs.freedesktop.org/show_bug.cgi?id=101600
> fdo#100215 https://bugs.freedesktop.org/show_bug.cgi?id=100215
> fdo#101597 https://bugs.freedesktop.org/show_bug.cgi?id=101597
> fdo#101517 https://bugs.freedesktop.org/show_bug.cgi?id=101517
> 
> fi-bdw-5557u     total:279  pass:264  dwarn:4   dfail:0   fail:0   skip:11  time:448s
> fi-bdw-gvtdvm    total:279  pass:257  dwarn:8   dfail:0   fail:0   skip:14  time:429s
> fi-blb-e6850     total:279  pass:225  dwarn:0   dfail:0   fail:0   skip:54  time:354s
> fi-bsw-n3050     total:279  pass:243  dwarn:0   dfail:0   fail:0   skip:36  time:530s
> fi-bxt-j4205     total:279  pass:260  dwarn:0   dfail:0   fail:0   skip:19  time:512s
> fi-byt-j1900     total:279  pass:254  dwarn:1   dfail:0   fail:0   skip:24  time:485s
> fi-byt-n2820     total:279  pass:251  dwarn:0   dfail:0   fail:0   skip:28  time:495s
> fi-glk-2a        total:279  pass:260  dwarn:0   dfail:0   fail:0   skip:19  time:602s
> fi-hsw-4770      total:279  pass:259  dwarn:4   dfail:0   fail:0   skip:16  time:433s
> fi-hsw-4770r     total:279  pass:259  dwarn:4   dfail:0   fail:0   skip:16  time:415s
> fi-ilk-650       total:279  pass:229  dwarn:0   dfail:0   fail:0   skip:50  time:422s
> fi-ivb-3520m     total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:492s
> fi-ivb-3770      total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:476s
> fi-kbl-7500u     total:279  pass:261  dwarn:0   dfail:0   fail:0   skip:18  time:465s
> fi-kbl-7560u     total:279  pass:268  dwarn:1   dfail:0   fail:0   skip:10  time:570s
> fi-kbl-r         total:279  pass:260  dwarn:1   dfail:0   fail:0   skip:18  time:581s
> fi-pnv-d510      total:279  pass:222  dwarn:2   dfail:0   fail:0   skip:55  time:554s
> fi-skl-6260u     total:279  pass:269  dwarn:0   dfail:0   fail:0   skip:10  time:461s
> fi-skl-6700hq    total:279  pass:223  dwarn:1   dfail:0   fail:30  skip:24  time:340s
> fi-skl-6700k     total:279  pass:257  dwarn:4   dfail:0   fail:0   skip:18  time:463s
> fi-skl-6770hq    total:279  pass:269  dwarn:0   dfail:0   fail:0   skip:10  time:476s
> fi-skl-gvtdvm    total:279  pass:266  dwarn:0   dfail:0   fail:0   skip:13  time:437s
> fi-snb-2520m     total:279  pass:251  dwarn:0   dfail:0   fail:0   skip:28  time:541s
> fi-snb-2600      total:279  pass:250  dwarn:0   dfail:0   fail:0   skip:29  time:405s
> 
> ee2da0ac24fb8d50a03b263eb1fa2e82849eda95 drm-tip: 2017y-06m-29d-13h-14m-49s UTC integration manifest
> d37cc43 drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
> e6e3d92 drm/i915: Refactor __intel_atomic_commit_tail()
> c33eea0 drm/i915% Store vma gtt offset in plane state
> 5ab92c4 drm/atomic: Introduce drm_atomic_helper_duplicate_committed_state()
> 988bc43 drm/atomic: Refactor drm_atomic_state_realloc_connectors()
> 
> == Logs ==
> 
> For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_5069/

-- 
Ville Syrjälä
Intel OTC
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-29 14:36     ` ville.syrjala
  (?)
@ 2017-06-29 17:57     ` Chris Wilson
  2017-06-29 19:26         ` Ville Syrjälä
  -1 siblings, 1 reply; 44+ messages in thread
From: Chris Wilson @ 2017-06-29 17:57 UTC (permalink / raw)
  To: ville.syrjala, intel-gfx; +Cc: dri-devel, stable, Daniel Vetter

Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> 
> Introduce an rw_semaphore to protect the display commits. All normal
> commits use down_read() and hence can proceed in parallel, but GPU reset
> will use down_write() making sure no other commits are in progress when
> we have to pull the plug on the display engine on pre-g4x platforms.
> There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> itself, and we wait for all dependencies before the down_read(), and
> thus there is no chance of deadlocks with this scheme.

This matches what I thought should be done (I didn't think of using
rwsem just a mutex, nice touch). The point I got stuck on was what
should be done after the reset? I expected another modeset to return the
state back or otherwise the inflight would get confused?
 
> During reset we should be recommiting the state that was committed last.
> But for now we'll settle for recommiting the last state for each object.

Ah, I guess that explains the above. What is the complication with
restoring the current state as opposed to the next state?

But I have to leave debating the merits of atomic internals to others.  :|
-Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-29 17:57     ` Chris Wilson
@ 2017-06-29 19:26         ` Ville Syrjälä
  0 siblings, 0 replies; 44+ messages in thread
From: Ville Syrjälä @ 2017-06-29 19:26 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, dri-devel, stable, Daniel Vetter

On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > From: Ville Syrj�l� <ville.syrjala@linux.intel.com>
> > 
> > Introduce an rw_semaphore to protect the display commits. All normal
> > commits use down_read() and hence can proceed in parallel, but GPU reset
> > will use down_write() making sure no other commits are in progress when
> > we have to pull the plug on the display engine on pre-g4x platforms.
> > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > itself, and we wait for all dependencies before the down_read(), and
> > thus there is no chance of deadlocks with this scheme.
> 
> This matches what I thought should be done (I didn't think of using
> rwsem just a mutex, nice touch). The point I got stuck on was what
> should be done after the reset? I expected another modeset to return the
> state back or otherwise the inflight would get confused?

I guess that can happen. For instance, if we have a crtc_enable() in flight,
and then we do a reset before it gets committed we would end up doing
crtc_enable() twice in a row without a crtc_disable in between. For page
flips and such this shouldn't be a big deal in general.

>  
> > During reset we should be recommiting the state that was committed last.
> > But for now we'll settle for recommiting the last state for each object.
> 
> Ah, I guess that explains the above. What is the complication with
> restoring the current state as opposed to the next state?

Well the main thing is just tracking which is the current state. That
just needs refactoring the .atomic_duplicate_state() calling convention
across the whole tree so that we can then duplicate the committed state
rather than the latest state.

Also due to the commit_hw_done() being potentially done after the
modeset locks have been dropped, I don't think we can be certain
of it getting called in the same order as swap_state(), hence
when we track the committed state in commit_hw_done() we'll have
to have some way to figure out if our new state is in fact the
latest committed state for each object or if the calls got
reordered. We don't insert any dependencies between two commits
unless they touch the same active crtc, thus this reordering
seems very much possible. Dunno if we should add some way to add
such dependeencies whenever the same object is part of two otherwise
independent commits, or if we just want to try and work with the
reordered calls. My idea for the latter was some kind of seqno/age
counter on the object states that allows me to recongnize which
state is more recent. The object states aren't refcounted so hanging
on to the wrong pointer could cause an oops the next time we have to
perform a GPU reset.

-- 
Ville Syrj�l�
Intel OTC

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-06-29 19:26         ` Ville Syrjälä
  0 siblings, 0 replies; 44+ messages in thread
From: Ville Syrjälä @ 2017-06-29 19:26 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Daniel Vetter, intel-gfx, stable, dri-devel

On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > 
> > Introduce an rw_semaphore to protect the display commits. All normal
> > commits use down_read() and hence can proceed in parallel, but GPU reset
> > will use down_write() making sure no other commits are in progress when
> > we have to pull the plug on the display engine on pre-g4x platforms.
> > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > itself, and we wait for all dependencies before the down_read(), and
> > thus there is no chance of deadlocks with this scheme.
> 
> This matches what I thought should be done (I didn't think of using
> rwsem just a mutex, nice touch). The point I got stuck on was what
> should be done after the reset? I expected another modeset to return the
> state back or otherwise the inflight would get confused?

I guess that can happen. For instance, if we have a crtc_enable() in flight,
and then we do a reset before it gets committed we would end up doing
crtc_enable() twice in a row without a crtc_disable in between. For page
flips and such this shouldn't be a big deal in general.

>  
> > During reset we should be recommiting the state that was committed last.
> > But for now we'll settle for recommiting the last state for each object.
> 
> Ah, I guess that explains the above. What is the complication with
> restoring the current state as opposed to the next state?

Well the main thing is just tracking which is the current state. That
just needs refactoring the .atomic_duplicate_state() calling convention
across the whole tree so that we can then duplicate the committed state
rather than the latest state.

Also due to the commit_hw_done() being potentially done after the
modeset locks have been dropped, I don't think we can be certain
of it getting called in the same order as swap_state(), hence
when we track the committed state in commit_hw_done() we'll have
to have some way to figure out if our new state is in fact the
latest committed state for each object or if the calls got
reordered. We don't insert any dependencies between two commits
unless they touch the same active crtc, thus this reordering
seems very much possible. Dunno if we should add some way to add
such dependeencies whenever the same object is part of two otherwise
independent commits, or if we just want to try and work with the
reordered calls. My idea for the latter was some kind of seqno/age
counter on the object states that allows me to recongnize which
state is more recent. The object states aren't refcounted so hanging
on to the wrong pointer could cause an oops the next time we have to
perform a GPU reset.

-- 
Ville Syrjälä
Intel OTC
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/5] drm/atomic: Refactor drm_atomic_state_realloc_connectors()
  2017-06-29 13:49 ` [PATCH 1/5] drm/atomic: Refactor drm_atomic_state_realloc_connectors() ville.syrjala
@ 2017-06-30 13:15   ` Daniel Vetter
  0 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-06-30 13:15 UTC (permalink / raw)
  To: ville.syrjala; +Cc: intel-gfx, dri-devel

On Thu, Jun 29, 2017 at 04:49:44PM +0300, ville.syrjala@linux.intel.com wrote:
> From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> 
> Pull the code to reallocate the state->connectors[] array into a
> helper function. We'll have another use for this later.
> 
> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> ---
>  drivers/gpu/drm/drm_atomic.c | 43 +++++++++++++++++++++++++++++--------------
>  include/drm/drm_atomic.h     |  5 +++++
>  2 files changed, 34 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_atomic.c b/drivers/gpu/drm/drm_atomic.c
> index 095e87278a88..a9f02b214fc6 100644
> --- a/drivers/gpu/drm/drm_atomic.c
> +++ b/drivers/gpu/drm/drm_atomic.c
> @@ -1043,6 +1043,32 @@ drm_atomic_get_private_obj_state(struct drm_atomic_state *state, void *obj,
>  }
>  EXPORT_SYMBOL(drm_atomic_get_private_obj_state);
>  
> +int drm_atomic_state_realloc_connectors(struct drm_device *dev,
> +					struct drm_atomic_state *state,
> +					int index)

Needs some pretty kerneldoc, best with some explanation why/when drivers
might want to use this (i.e. when they're constructing their own state for
special reasons like hw state read-out or recovery after a hw reset or
whatever). Commit message should explain that too, but better to stuff it
into the kerneldoc. With that addressed:

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

> +{
> +	struct drm_mode_config *config = &dev->mode_config;
> +	struct __drm_connnectors_state *c;
> +	int alloc = max(index + 1, config->num_connector);
> +
> +	if (index < state->num_connector)
> +		return 0;
> +
> +	c = krealloc(state->connectors,
> +		     alloc * sizeof(*state->connectors), GFP_KERNEL);
> +	if (!c)
> +		return -ENOMEM;
> +
> +	state->connectors = c;
> +	memset(&state->connectors[state->num_connector], 0,
> +	       sizeof(*state->connectors) * (alloc - state->num_connector));
> +
> +	state->num_connector = alloc;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(drm_atomic_state_realloc_connectors);
> +
>  /**
>   * drm_atomic_get_connector_state - get connector state
>   * @state: global atomic state object
> @@ -1074,20 +1100,9 @@ drm_atomic_get_connector_state(struct drm_atomic_state *state,
>  
>  	index = drm_connector_index(connector);
>  
> -	if (index >= state->num_connector) {
> -		struct __drm_connnectors_state *c;
> -		int alloc = max(index + 1, config->num_connector);
> -
> -		c = krealloc(state->connectors, alloc * sizeof(*state->connectors), GFP_KERNEL);
> -		if (!c)
> -			return ERR_PTR(-ENOMEM);
> -
> -		state->connectors = c;
> -		memset(&state->connectors[state->num_connector], 0,
> -		       sizeof(*state->connectors) * (alloc - state->num_connector));
> -
> -		state->num_connector = alloc;
> -	}
> +	ret = drm_atomic_state_realloc_connectors(connector->dev, state, index);
> +	if (ret)
> +		return ERR_PTR(ret);
>  
>  	if (state->connectors[index].state)
>  		return state->connectors[index].state;
> diff --git a/include/drm/drm_atomic.h b/include/drm/drm_atomic.h
> index 0196f264a418..5596ad58bcdc 100644
> --- a/include/drm/drm_atomic.h
> +++ b/include/drm/drm_atomic.h
> @@ -324,6 +324,11 @@ drm_atomic_get_private_obj_state(struct drm_atomic_state *state,
>  			      void *obj,
>  			      const struct drm_private_state_funcs *funcs);
>  
> +int __must_check
> +drm_atomic_state_realloc_connectors(struct drm_device *dev,
> +				    struct drm_atomic_state *state,
> +				    int index);
> +
>  /**
>   * drm_atomic_get_existing_crtc_state - get crtc state, if it exists
>   * @state: global atomic state object
> -- 
> 2.13.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-29 13:49 ` [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore ville.syrjala
@ 2017-06-30 13:25     ` Daniel Vetter
  2017-06-29 14:36     ` ville.syrjala
  2017-06-30 13:25     ` Daniel Vetter
  2 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-06-30 13:25 UTC (permalink / raw)
  To: ville.syrjala; +Cc: intel-gfx, dri-devel, stable, Daniel Vetter, Chris Wilson

On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:
> From: Ville Syrj�l� <ville.syrjala@linux.intel.com>
> 
> Introduce an rw_semaphore to protect the display commits. All normal
> commits use down_read() and hence can proceed in parallel, but GPU reset
> will use down_write() making sure no other commits are in progress when
> we have to pull the plug on the display engine on pre-g4x platforms.
> There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> itself, and we wait for all dependencies before the down_read(), and
> thus there is no chance of deadlocks with this scheme.

How does this solve the deadlock? Afaiui the deadlock is that the gpu
reset stopped unconditionally completing all requests before it did
anything else, including anything with the hw or taking modeset locks.

This ensured that any outstanding flips (we only had pageflips, no atomic
ones back then) could complete (although maybe displaying garbage). The
only thing we had to do was grab the locks (to avoid concurrent modesets)
and then we could happily nuke the hw (since the flips where lost in the
CS anyway), and restore it afterwards.

Then the TDR rewrite came around and broke that, which now makes atomic
time out waiting for the gpu to complete (because the gpu reset waits for
the modeset to complete first). Which means if you want to fix this
without breaking TDR, then you need to cancel the pending atomic commits.
That seems somewhat what you're attempting here with trying to figure out
what the latest hw-committed step is (but that function gets it wrong),
and lots more locking and stuff on top.

Why exactly can't we just go back to the old world of force-completing
dead requests on gen4 and earlier? That would be tons simpler imo instead
of throwing a pile of hacks (which really needs a complete rewrite of the
atomic code in i915) in as a work-around. We didn't have TDR on gen4 and
earlier for years, going back to that isn't going to hurt anyone.

Making working gen4 gpu reset contigent on cancellable atomic commits and
the full commit queue is imo nuts.
-Daniel

> 
> During reset we should be recommiting the state that was committed last.
> But for now we'll settle for recommiting the last state for each object.
> Hence we may commit things a bit too soon when a GPU reset occurs. The
> rw_semaphore should guarantee that whatever state we observe in
> obj->state during reset sticks around while we do the commit. The
> obj->state pointer might change for some objects if another swap_state()
> occurs while we do our thing, so there migth be some theoretical
> mismatch which I tink we could avoid by grabbing the rw_semaphore also
> around the swap_state(), but for now I didn't do that.
> 
> Another source of mismatch can happen because we sometimes use the
> intel_crtc->config during the actual commit, and that only gets updated
> when we do the commuit. Hence we may get some state via ->state, some
> via the duplicated ->state, and some via ->config. We'll want to
> fix this all by tracking the commited state properly, but that will
> some bigger refactroing so for now we'll just choose to accept these
> potential mismatches.
> 
> I left out the state readout from the post-reset display
> reinitialization as that still likes to clobber crtc->state etc.
> If we make it use a free standing atomic state and mke sure it doesn't
> need any locks we could reintroduce it. For now I just mark the
> post-reset display state as dirty as possible to make sure we
> reinitialize everything.
> 
> One other potential issue remains in the form of display detection.
> Either we need to protect that with the same rw_semaphore as well, or
> perhaps it would be enough to force gmbus into bitbanging mode while
> the reset is happening and we don't have interrupts, and just across
> the actual hardware GPU reset we could hold the gmbus mutex.
> 
> v2: Keep intel_prepare/finish_reset() outside struct_mutex (Chris)
> v3: Drop all the committed_state refactoring to make this less
>     obnoxious to backport (Daniel)
> 
> Cc: stable@vger.kernel.org # for the brave
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101597
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99093
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600
> Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
> Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
> Signed-off-by: Ville Syrj�l� <ville.syrjala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h      |   2 +
>  drivers/gpu/drm/i915/i915_irq.c      |  44 +-------
>  drivers/gpu/drm/i915/intel_display.c | 199 ++++++++++++++++++++++++-----------
>  3 files changed, 139 insertions(+), 106 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index effbe4f72a64..88ddd27f61b0 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2237,6 +2237,8 @@ struct drm_i915_private {
>  	struct drm_atomic_state *modeset_restore_state;
>  	struct drm_modeset_acquire_ctx reset_ctx;
>  
> +	struct rw_semaphore commit_sem;
> +
>  	struct list_head vm_list; /* Global list of all address spaces */
>  	struct i915_ggtt ggtt; /* VM representing the global address space */
>  
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index eb4f1dca2077..9d591f17fda3 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2587,46 +2587,6 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
>  	return ret;
>  }
>  
> -struct wedge_me {
> -	struct delayed_work work;
> -	struct drm_i915_private *i915;
> -	const char *name;
> -};
> -
> -static void wedge_me(struct work_struct *work)
> -{
> -	struct wedge_me *w = container_of(work, typeof(*w), work.work);
> -
> -	dev_err(w->i915->drm.dev,
> -		"%s timed out, cancelling all in-flight rendering.\n",
> -		w->name);
> -	i915_gem_set_wedged(w->i915);
> -}
> -
> -static void __init_wedge(struct wedge_me *w,
> -			 struct drm_i915_private *i915,
> -			 long timeout,
> -			 const char *name)
> -{
> -	w->i915 = i915;
> -	w->name = name;
> -
> -	INIT_DELAYED_WORK_ONSTACK(&w->work, wedge_me);
> -	schedule_delayed_work(&w->work, timeout);
> -}
> -
> -static void __fini_wedge(struct wedge_me *w)
> -{
> -	cancel_delayed_work_sync(&w->work);
> -	destroy_delayed_work_on_stack(&w->work);
> -	w->i915 = NULL;
> -}
> -
> -#define i915_wedge_on_timeout(W, DEV, TIMEOUT)				\
> -	for (__init_wedge((W), (DEV), (TIMEOUT), __func__);		\
> -	     (W)->i915;							\
> -	     __fini_wedge((W)))
> -
>  /**
>   * i915_reset_device - do process context error handling work
>   * @dev_priv: i915 device private
> @@ -2640,15 +2600,13 @@ static void i915_reset_device(struct drm_i915_private *dev_priv)
>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>  	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
>  	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
> -	struct wedge_me w;
>  
>  	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>  
>  	DRM_DEBUG_DRIVER("resetting chip\n");
>  	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>  
> -	/* Use a watchdog to ensure that our reset completes */
> -	i915_wedge_on_timeout(&w, dev_priv, 5*HZ) {
> +	{
>  		intel_prepare_reset(dev_priv);
>  
>  		/* Signal that locked waiters should reset the GPU */
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index 7cdd6ec97f80..f13c7d81d4a9 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -123,6 +123,7 @@ static void ironlake_pfit_enable(struct intel_crtc *crtc);
>  static void intel_modeset_setup_hw_state(struct drm_device *dev,
>  					 struct drm_modeset_acquire_ctx *ctx);
>  static void intel_pre_disable_primary_noatomic(struct drm_crtc *crtc);
> +static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset);
>  
>  struct intel_limit {
>  	struct {
> @@ -3491,27 +3492,85 @@ static bool gpu_reset_clobbers_display(struct drm_i915_private *dev_priv)
>  		INTEL_GEN(dev_priv) < 5 && !IS_G4X(dev_priv);
>  }
>  
> -void intel_prepare_reset(struct drm_i915_private *dev_priv)
> +static void init_intel_state(struct intel_atomic_state *state)
> +{
> +	struct drm_crtc *crtc;
> +	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
> +	int i;
> +
> +	state->modeset = true;
> +
> +	for_each_oldnew_crtc_in_state(&state->base, crtc,
> +				      old_crtc_state, new_crtc_state, i) {
> +		if (new_crtc_state->active)
> +			state->active_crtcs |= 1 << i;
> +		else
> +			state->active_crtcs &= ~(1 << i);
> +
> +		if (old_crtc_state->active != new_crtc_state->active)
> +			state->active_pipe_changes |= drm_crtc_mask(crtc);
> +	}
> +}
> +
> +static struct drm_atomic_state *
> +intel_duplicate_committed_state(struct drm_i915_private *dev_priv)
>  {
> -	struct drm_device *dev = &dev_priv->drm;
> -	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
>  	struct drm_atomic_state *state;
> -	int ret;
> +	struct drm_crtc *crtc;
> +	struct drm_crtc_state *crtc_state;
> +	int i;
>  
> -	/*
> -	 * Need mode_config.mutex so that we don't
> -	 * trample ongoing ->detect() and whatnot.
> -	 */
> -	mutex_lock(&dev->mode_config.mutex);
> -	drm_modeset_acquire_init(ctx, 0);
> -	while (1) {
> -		ret = drm_modeset_lock_all_ctx(dev, ctx);
> -		if (ret != -EDEADLK)
> -			break;
> +	state = drm_atomic_helper_duplicate_committed_state(&dev_priv->drm);
> +	if (IS_ERR(state)) {
> +		DRM_ERROR("Duplicating state failed with %ld\n",
> +			  PTR_ERR(state));
> +		return NULL;
> +	}
> +
> +	to_intel_atomic_state(state)->active_crtcs = 0;
> +	to_intel_atomic_state(state)->cdclk.logical = dev_priv->cdclk.hw;
> +	to_intel_atomic_state(state)->cdclk.actual = dev_priv->cdclk.hw;
> +
> +	init_intel_state(to_intel_atomic_state(state));
> +
> +	/* force a full update */
> +	for_each_new_crtc_in_state(state, crtc, crtc_state, i) {
> +		struct intel_crtc_state *intel_crtc_state =
> +			to_intel_crtc_state(crtc_state);
> +
> +		if (!crtc_state->active)
> +			continue;
>  
> -		drm_modeset_backoff(ctx);
> +		crtc_state->mode_changed = true;
> +		crtc_state->active_changed = true;
> +		crtc_state->planes_changed = true;
> +		crtc_state->connectors_changed = true;
> +		crtc_state->color_mgmt_changed = true;
> +		crtc_state->zpos_changed = true;
> +
> +		intel_crtc_state->update_pipe = true;
> +		intel_crtc_state->disable_lp_wm = true;
> +		intel_crtc_state->disable_cxsr = true;
> +		intel_crtc_state->update_wm_post = true;
> +		intel_crtc_state->fb_changed = true;
> +		intel_crtc_state->fifo_changed = true;
> +		intel_crtc_state->wm.need_postvbl_update = true;
>  	}
>  
> +	return state;
> +}
> +
> +void intel_prepare_reset(struct drm_i915_private *dev_priv)
> +{
> +	struct drm_atomic_state *disable_state, *restore_state = NULL;
> +	struct drm_crtc *crtc;
> +	struct drm_crtc_state *crtc_state;
> +	struct drm_plane *plane;
> +	struct drm_plane_state *plane_state;
> +	int i;
> +
> +	down_write(&dev_priv->commit_sem);
> +
>  	/* reset doesn't touch the display, but flips might get nuked anyway, */
>  	if (!i915.force_reset_modeset_test &&
>  	    !gpu_reset_clobbers_display(dev_priv))
> @@ -3521,30 +3580,40 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
>  	 * Disabling the crtcs gracefully seems nicer. Also the
>  	 * g33 docs say we should at least disable all the planes.
>  	 */
> -	state = drm_atomic_helper_duplicate_state(dev, ctx);
> -	if (IS_ERR(state)) {
> -		ret = PTR_ERR(state);
> -		DRM_ERROR("Duplicating state failed with %i\n", ret);
> -		return;
> -	}
> +	disable_state = intel_duplicate_committed_state(dev_priv);
> +	if (IS_ERR(disable_state))
> +		goto out;
>  
> -	ret = drm_atomic_helper_disable_all(dev, ctx);
> -	if (ret) {
> -		DRM_ERROR("Suspending crtc's failed with %i\n", ret);
> -		drm_atomic_state_put(state);
> -		return;
> -	}
> +	to_intel_atomic_state(disable_state)->active_crtcs = 0;
>  
> -	dev_priv->modeset_restore_state = state;
> -	state->acquire_ctx = ctx;
> +	for_each_new_crtc_in_state(disable_state, crtc, crtc_state, i)
> +		crtc_state->active = false;
> +	for_each_new_plane_in_state(disable_state, plane, plane_state, i)
> +		plane_state->visible = false;
> +
> +	__intel_atomic_commit_tail(disable_state, true);
> +
> +	drm_atomic_helper_clean_committed_state(disable_state);
> +	drm_atomic_state_put(disable_state);
> +
> +	restore_state = intel_duplicate_committed_state(dev_priv);
> +	if (IS_ERR(restore_state))
> +		restore_state = NULL;
> +
> +	for_each_old_crtc_in_state(restore_state, crtc, crtc_state, i)
> +		crtc_state->active = false;
> +	for_each_old_plane_in_state(restore_state, plane, plane_state, i)
> +		plane_state->visible = false;
> +
> +out:
> +	dev_priv->modeset_restore_state = restore_state;
>  }
>  
>  void intel_finish_reset(struct drm_i915_private *dev_priv)
>  {
>  	struct drm_device *dev = &dev_priv->drm;
> -	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
> -	struct drm_atomic_state *state = dev_priv->modeset_restore_state;
> -	int ret;
> +	struct drm_atomic_state *restore_state =
> +		dev_priv->modeset_restore_state;
>  
>  	/*
>  	 * Flips in the rings will be nuked by the reset,
> @@ -3557,7 +3626,7 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
>  
>  	/* reset doesn't touch the display */
>  	if (!gpu_reset_clobbers_display(dev_priv)) {
> -		if (!state) {
> +		if (!restore_state) {
>  			/*
>  			 * Flips in the rings have been nuked by the reset,
>  			 * so update the base address of all primary
> @@ -3569,11 +3638,11 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
>  			 */
>  			intel_update_primary_planes(dev);
>  		} else {
> -			ret = __intel_display_resume(dev, state, ctx);
> -			if (ret)
> -				DRM_ERROR("Restoring old state failed with %i\n", ret);
> +			__intel_atomic_commit_tail(restore_state, true);
>  		}
>  	} else {
> +		i915_redisable_vga(dev_priv);
> +
>  		/*
>  		 * The display has been reset as well,
>  		 * so need a full re-initialization.
> @@ -3589,18 +3658,17 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
>  			dev_priv->display.hpd_irq_setup(dev_priv);
>  		spin_unlock_irq(&dev_priv->irq_lock);
>  
> -		ret = __intel_display_resume(dev, state, ctx);
> -		if (ret)
> -			DRM_ERROR("Restoring old state failed with %i\n", ret);
> +		__intel_atomic_commit_tail(restore_state, true);
>  
>  		intel_hpd_init(dev_priv);
>  	}
>  
> -	if (state)
> -		drm_atomic_state_put(state);
> -	drm_modeset_drop_locks(ctx);
> -	drm_modeset_acquire_fini(ctx);
> -	mutex_unlock(&dev->mode_config.mutex);
> +	if (restore_state) {
> +		drm_atomic_helper_clean_committed_state(restore_state);
> +		drm_atomic_state_put(restore_state);
> +	}
> +
> +	up_write(&dev_priv->commit_sem);
>  }
>  
>  static bool abort_flip_on_reset(struct intel_crtc *crtc)
> @@ -12592,29 +12660,18 @@ static int intel_modeset_checks(struct drm_atomic_state *state)
>  {
>  	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
>  	struct drm_i915_private *dev_priv = to_i915(state->dev);
> -	struct drm_crtc *crtc;
> -	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
> -	int ret = 0, i;
> +	int ret = 0;
>  
>  	if (!check_digital_port_conflicts(state)) {
>  		DRM_DEBUG_KMS("rejecting conflicting digital port configuration\n");
>  		return -EINVAL;
>  	}
>  
> -	intel_state->modeset = true;
>  	intel_state->active_crtcs = dev_priv->active_crtcs;
>  	intel_state->cdclk.logical = dev_priv->cdclk.logical;
>  	intel_state->cdclk.actual = dev_priv->cdclk.actual;
>  
> -	for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, new_crtc_state, i) {
> -		if (new_crtc_state->active)
> -			intel_state->active_crtcs |= 1 << i;
> -		else
> -			intel_state->active_crtcs &= ~(1 << i);
> -
> -		if (old_crtc_state->active != new_crtc_state->active)
> -			intel_state->active_pipe_changes |= drm_crtc_mask(crtc);
> -	}
> +	init_intel_state(intel_state);
>  
>  	/*
>  	 * See if the config requires any additional preparation, e.g.
> @@ -13004,7 +13061,7 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work)
>  	intel_atomic_helper_free_state(dev_priv);
>  }
>  
> -static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
> +static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset)
>  {
>  	struct drm_device *dev = state->dev;
>  	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
> @@ -13068,10 +13125,18 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
>  
>  	/* Only after disabling all output pipelines that will be changed can we
>  	 * update the the output configuration. */
> -	intel_modeset_update_crtc_state(state);
> +	if (!is_reset)
> +		intel_modeset_update_crtc_state(state);
>  
>  	if (intel_state->modeset) {
> -		drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
> +		if (!is_reset) {
> +			drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
> +		} else {
> +			for_each_new_crtc_in_state(state, crtc, new_crtc_state, i) {
> +				if (new_crtc_state->enable)
> +					drm_calc_timestamping_constants(crtc, &new_crtc_state->adjusted_mode);
> +			}
> +		}
>  
>  		intel_set_cdclk(dev_priv, &dev_priv->cdclk.actual);
>  
> @@ -13082,7 +13147,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
>  		if (!intel_can_enable_sagv(state))
>  			intel_disable_sagv(dev_priv);
>  
> -		intel_modeset_verify_disabled(dev, state);
> +		if (!is_reset)
> +			intel_modeset_verify_disabled(dev, state);
>  	}
>  
>  	/* Complete the events for pipes that have now been disabled */
> @@ -13135,7 +13201,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
>  		if (put_domains[i])
>  			modeset_put_power_domains(dev_priv, put_domains[i]);
>  
> -		intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
> +		if (!is_reset)
> +			intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
>  	}
>  
>  	if (intel_state->modeset && intel_can_enable_sagv(state))
> @@ -13160,10 +13227,14 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
>  
>  	drm_atomic_helper_wait_for_dependencies(state);
>  
> -	__intel_atomic_commit_tail(state);
> +	down_read(&dev_priv->commit_sem);
> +
> +	__intel_atomic_commit_tail(state, false);
>  
>  	drm_atomic_helper_commit_hw_done(state);
>  
> +	up_read(&dev_priv->commit_sem);
> +
>  	mutex_lock(&dev->struct_mutex);
>  	drm_atomic_helper_cleanup_planes(dev, state);
>  	mutex_unlock(&dev->struct_mutex);
> @@ -15022,6 +15093,8 @@ int intel_modeset_init(struct drm_device *dev)
>  	INIT_WORK(&dev_priv->atomic_helper.free_work,
>  		  intel_atomic_helper_free_state_worker);
>  
> +	init_rwsem(&dev_priv->commit_sem);
> +
>  	intel_init_quirks(dev);
>  
>  	intel_init_pm(dev_priv);
> -- 
> 2.13.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-06-30 13:25     ` Daniel Vetter
  0 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-06-30 13:25 UTC (permalink / raw)
  To: ville.syrjala; +Cc: intel-gfx, dri-devel, stable, Daniel Vetter, Chris Wilson

On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:
> From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> 
> Introduce an rw_semaphore to protect the display commits. All normal
> commits use down_read() and hence can proceed in parallel, but GPU reset
> will use down_write() making sure no other commits are in progress when
> we have to pull the plug on the display engine on pre-g4x platforms.
> There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> itself, and we wait for all dependencies before the down_read(), and
> thus there is no chance of deadlocks with this scheme.

How does this solve the deadlock? Afaiui the deadlock is that the gpu
reset stopped unconditionally completing all requests before it did
anything else, including anything with the hw or taking modeset locks.

This ensured that any outstanding flips (we only had pageflips, no atomic
ones back then) could complete (although maybe displaying garbage). The
only thing we had to do was grab the locks (to avoid concurrent modesets)
and then we could happily nuke the hw (since the flips where lost in the
CS anyway), and restore it afterwards.

Then the TDR rewrite came around and broke that, which now makes atomic
time out waiting for the gpu to complete (because the gpu reset waits for
the modeset to complete first). Which means if you want to fix this
without breaking TDR, then you need to cancel the pending atomic commits.
That seems somewhat what you're attempting here with trying to figure out
what the latest hw-committed step is (but that function gets it wrong),
and lots more locking and stuff on top.

Why exactly can't we just go back to the old world of force-completing
dead requests on gen4 and earlier? That would be tons simpler imo instead
of throwing a pile of hacks (which really needs a complete rewrite of the
atomic code in i915) in as a work-around. We didn't have TDR on gen4 and
earlier for years, going back to that isn't going to hurt anyone.

Making working gen4 gpu reset contigent on cancellable atomic commits and
the full commit queue is imo nuts.
-Daniel

> 
> During reset we should be recommiting the state that was committed last.
> But for now we'll settle for recommiting the last state for each object.
> Hence we may commit things a bit too soon when a GPU reset occurs. The
> rw_semaphore should guarantee that whatever state we observe in
> obj->state during reset sticks around while we do the commit. The
> obj->state pointer might change for some objects if another swap_state()
> occurs while we do our thing, so there migth be some theoretical
> mismatch which I tink we could avoid by grabbing the rw_semaphore also
> around the swap_state(), but for now I didn't do that.
> 
> Another source of mismatch can happen because we sometimes use the
> intel_crtc->config during the actual commit, and that only gets updated
> when we do the commuit. Hence we may get some state via ->state, some
> via the duplicated ->state, and some via ->config. We'll want to
> fix this all by tracking the commited state properly, but that will
> some bigger refactroing so for now we'll just choose to accept these
> potential mismatches.
> 
> I left out the state readout from the post-reset display
> reinitialization as that still likes to clobber crtc->state etc.
> If we make it use a free standing atomic state and mke sure it doesn't
> need any locks we could reintroduce it. For now I just mark the
> post-reset display state as dirty as possible to make sure we
> reinitialize everything.
> 
> One other potential issue remains in the form of display detection.
> Either we need to protect that with the same rw_semaphore as well, or
> perhaps it would be enough to force gmbus into bitbanging mode while
> the reset is happening and we don't have interrupts, and just across
> the actual hardware GPU reset we could hold the gmbus mutex.
> 
> v2: Keep intel_prepare/finish_reset() outside struct_mutex (Chris)
> v3: Drop all the committed_state refactoring to make this less
>     obnoxious to backport (Daniel)
> 
> Cc: stable@vger.kernel.org # for the brave
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101597
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99093
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600
> Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
> Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h      |   2 +
>  drivers/gpu/drm/i915/i915_irq.c      |  44 +-------
>  drivers/gpu/drm/i915/intel_display.c | 199 ++++++++++++++++++++++++-----------
>  3 files changed, 139 insertions(+), 106 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index effbe4f72a64..88ddd27f61b0 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2237,6 +2237,8 @@ struct drm_i915_private {
>  	struct drm_atomic_state *modeset_restore_state;
>  	struct drm_modeset_acquire_ctx reset_ctx;
>  
> +	struct rw_semaphore commit_sem;
> +
>  	struct list_head vm_list; /* Global list of all address spaces */
>  	struct i915_ggtt ggtt; /* VM representing the global address space */
>  
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index eb4f1dca2077..9d591f17fda3 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2587,46 +2587,6 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
>  	return ret;
>  }
>  
> -struct wedge_me {
> -	struct delayed_work work;
> -	struct drm_i915_private *i915;
> -	const char *name;
> -};
> -
> -static void wedge_me(struct work_struct *work)
> -{
> -	struct wedge_me *w = container_of(work, typeof(*w), work.work);
> -
> -	dev_err(w->i915->drm.dev,
> -		"%s timed out, cancelling all in-flight rendering.\n",
> -		w->name);
> -	i915_gem_set_wedged(w->i915);
> -}
> -
> -static void __init_wedge(struct wedge_me *w,
> -			 struct drm_i915_private *i915,
> -			 long timeout,
> -			 const char *name)
> -{
> -	w->i915 = i915;
> -	w->name = name;
> -
> -	INIT_DELAYED_WORK_ONSTACK(&w->work, wedge_me);
> -	schedule_delayed_work(&w->work, timeout);
> -}
> -
> -static void __fini_wedge(struct wedge_me *w)
> -{
> -	cancel_delayed_work_sync(&w->work);
> -	destroy_delayed_work_on_stack(&w->work);
> -	w->i915 = NULL;
> -}
> -
> -#define i915_wedge_on_timeout(W, DEV, TIMEOUT)				\
> -	for (__init_wedge((W), (DEV), (TIMEOUT), __func__);		\
> -	     (W)->i915;							\
> -	     __fini_wedge((W)))
> -
>  /**
>   * i915_reset_device - do process context error handling work
>   * @dev_priv: i915 device private
> @@ -2640,15 +2600,13 @@ static void i915_reset_device(struct drm_i915_private *dev_priv)
>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>  	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
>  	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
> -	struct wedge_me w;
>  
>  	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>  
>  	DRM_DEBUG_DRIVER("resetting chip\n");
>  	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>  
> -	/* Use a watchdog to ensure that our reset completes */
> -	i915_wedge_on_timeout(&w, dev_priv, 5*HZ) {
> +	{
>  		intel_prepare_reset(dev_priv);
>  
>  		/* Signal that locked waiters should reset the GPU */
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index 7cdd6ec97f80..f13c7d81d4a9 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -123,6 +123,7 @@ static void ironlake_pfit_enable(struct intel_crtc *crtc);
>  static void intel_modeset_setup_hw_state(struct drm_device *dev,
>  					 struct drm_modeset_acquire_ctx *ctx);
>  static void intel_pre_disable_primary_noatomic(struct drm_crtc *crtc);
> +static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset);
>  
>  struct intel_limit {
>  	struct {
> @@ -3491,27 +3492,85 @@ static bool gpu_reset_clobbers_display(struct drm_i915_private *dev_priv)
>  		INTEL_GEN(dev_priv) < 5 && !IS_G4X(dev_priv);
>  }
>  
> -void intel_prepare_reset(struct drm_i915_private *dev_priv)
> +static void init_intel_state(struct intel_atomic_state *state)
> +{
> +	struct drm_crtc *crtc;
> +	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
> +	int i;
> +
> +	state->modeset = true;
> +
> +	for_each_oldnew_crtc_in_state(&state->base, crtc,
> +				      old_crtc_state, new_crtc_state, i) {
> +		if (new_crtc_state->active)
> +			state->active_crtcs |= 1 << i;
> +		else
> +			state->active_crtcs &= ~(1 << i);
> +
> +		if (old_crtc_state->active != new_crtc_state->active)
> +			state->active_pipe_changes |= drm_crtc_mask(crtc);
> +	}
> +}
> +
> +static struct drm_atomic_state *
> +intel_duplicate_committed_state(struct drm_i915_private *dev_priv)
>  {
> -	struct drm_device *dev = &dev_priv->drm;
> -	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
>  	struct drm_atomic_state *state;
> -	int ret;
> +	struct drm_crtc *crtc;
> +	struct drm_crtc_state *crtc_state;
> +	int i;
>  
> -	/*
> -	 * Need mode_config.mutex so that we don't
> -	 * trample ongoing ->detect() and whatnot.
> -	 */
> -	mutex_lock(&dev->mode_config.mutex);
> -	drm_modeset_acquire_init(ctx, 0);
> -	while (1) {
> -		ret = drm_modeset_lock_all_ctx(dev, ctx);
> -		if (ret != -EDEADLK)
> -			break;
> +	state = drm_atomic_helper_duplicate_committed_state(&dev_priv->drm);
> +	if (IS_ERR(state)) {
> +		DRM_ERROR("Duplicating state failed with %ld\n",
> +			  PTR_ERR(state));
> +		return NULL;
> +	}
> +
> +	to_intel_atomic_state(state)->active_crtcs = 0;
> +	to_intel_atomic_state(state)->cdclk.logical = dev_priv->cdclk.hw;
> +	to_intel_atomic_state(state)->cdclk.actual = dev_priv->cdclk.hw;
> +
> +	init_intel_state(to_intel_atomic_state(state));
> +
> +	/* force a full update */
> +	for_each_new_crtc_in_state(state, crtc, crtc_state, i) {
> +		struct intel_crtc_state *intel_crtc_state =
> +			to_intel_crtc_state(crtc_state);
> +
> +		if (!crtc_state->active)
> +			continue;
>  
> -		drm_modeset_backoff(ctx);
> +		crtc_state->mode_changed = true;
> +		crtc_state->active_changed = true;
> +		crtc_state->planes_changed = true;
> +		crtc_state->connectors_changed = true;
> +		crtc_state->color_mgmt_changed = true;
> +		crtc_state->zpos_changed = true;
> +
> +		intel_crtc_state->update_pipe = true;
> +		intel_crtc_state->disable_lp_wm = true;
> +		intel_crtc_state->disable_cxsr = true;
> +		intel_crtc_state->update_wm_post = true;
> +		intel_crtc_state->fb_changed = true;
> +		intel_crtc_state->fifo_changed = true;
> +		intel_crtc_state->wm.need_postvbl_update = true;
>  	}
>  
> +	return state;
> +}
> +
> +void intel_prepare_reset(struct drm_i915_private *dev_priv)
> +{
> +	struct drm_atomic_state *disable_state, *restore_state = NULL;
> +	struct drm_crtc *crtc;
> +	struct drm_crtc_state *crtc_state;
> +	struct drm_plane *plane;
> +	struct drm_plane_state *plane_state;
> +	int i;
> +
> +	down_write(&dev_priv->commit_sem);
> +
>  	/* reset doesn't touch the display, but flips might get nuked anyway, */
>  	if (!i915.force_reset_modeset_test &&
>  	    !gpu_reset_clobbers_display(dev_priv))
> @@ -3521,30 +3580,40 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
>  	 * Disabling the crtcs gracefully seems nicer. Also the
>  	 * g33 docs say we should at least disable all the planes.
>  	 */
> -	state = drm_atomic_helper_duplicate_state(dev, ctx);
> -	if (IS_ERR(state)) {
> -		ret = PTR_ERR(state);
> -		DRM_ERROR("Duplicating state failed with %i\n", ret);
> -		return;
> -	}
> +	disable_state = intel_duplicate_committed_state(dev_priv);
> +	if (IS_ERR(disable_state))
> +		goto out;
>  
> -	ret = drm_atomic_helper_disable_all(dev, ctx);
> -	if (ret) {
> -		DRM_ERROR("Suspending crtc's failed with %i\n", ret);
> -		drm_atomic_state_put(state);
> -		return;
> -	}
> +	to_intel_atomic_state(disable_state)->active_crtcs = 0;
>  
> -	dev_priv->modeset_restore_state = state;
> -	state->acquire_ctx = ctx;
> +	for_each_new_crtc_in_state(disable_state, crtc, crtc_state, i)
> +		crtc_state->active = false;
> +	for_each_new_plane_in_state(disable_state, plane, plane_state, i)
> +		plane_state->visible = false;
> +
> +	__intel_atomic_commit_tail(disable_state, true);
> +
> +	drm_atomic_helper_clean_committed_state(disable_state);
> +	drm_atomic_state_put(disable_state);
> +
> +	restore_state = intel_duplicate_committed_state(dev_priv);
> +	if (IS_ERR(restore_state))
> +		restore_state = NULL;
> +
> +	for_each_old_crtc_in_state(restore_state, crtc, crtc_state, i)
> +		crtc_state->active = false;
> +	for_each_old_plane_in_state(restore_state, plane, plane_state, i)
> +		plane_state->visible = false;
> +
> +out:
> +	dev_priv->modeset_restore_state = restore_state;
>  }
>  
>  void intel_finish_reset(struct drm_i915_private *dev_priv)
>  {
>  	struct drm_device *dev = &dev_priv->drm;
> -	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
> -	struct drm_atomic_state *state = dev_priv->modeset_restore_state;
> -	int ret;
> +	struct drm_atomic_state *restore_state =
> +		dev_priv->modeset_restore_state;
>  
>  	/*
>  	 * Flips in the rings will be nuked by the reset,
> @@ -3557,7 +3626,7 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
>  
>  	/* reset doesn't touch the display */
>  	if (!gpu_reset_clobbers_display(dev_priv)) {
> -		if (!state) {
> +		if (!restore_state) {
>  			/*
>  			 * Flips in the rings have been nuked by the reset,
>  			 * so update the base address of all primary
> @@ -3569,11 +3638,11 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
>  			 */
>  			intel_update_primary_planes(dev);
>  		} else {
> -			ret = __intel_display_resume(dev, state, ctx);
> -			if (ret)
> -				DRM_ERROR("Restoring old state failed with %i\n", ret);
> +			__intel_atomic_commit_tail(restore_state, true);
>  		}
>  	} else {
> +		i915_redisable_vga(dev_priv);
> +
>  		/*
>  		 * The display has been reset as well,
>  		 * so need a full re-initialization.
> @@ -3589,18 +3658,17 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
>  			dev_priv->display.hpd_irq_setup(dev_priv);
>  		spin_unlock_irq(&dev_priv->irq_lock);
>  
> -		ret = __intel_display_resume(dev, state, ctx);
> -		if (ret)
> -			DRM_ERROR("Restoring old state failed with %i\n", ret);
> +		__intel_atomic_commit_tail(restore_state, true);
>  
>  		intel_hpd_init(dev_priv);
>  	}
>  
> -	if (state)
> -		drm_atomic_state_put(state);
> -	drm_modeset_drop_locks(ctx);
> -	drm_modeset_acquire_fini(ctx);
> -	mutex_unlock(&dev->mode_config.mutex);
> +	if (restore_state) {
> +		drm_atomic_helper_clean_committed_state(restore_state);
> +		drm_atomic_state_put(restore_state);
> +	}
> +
> +	up_write(&dev_priv->commit_sem);
>  }
>  
>  static bool abort_flip_on_reset(struct intel_crtc *crtc)
> @@ -12592,29 +12660,18 @@ static int intel_modeset_checks(struct drm_atomic_state *state)
>  {
>  	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
>  	struct drm_i915_private *dev_priv = to_i915(state->dev);
> -	struct drm_crtc *crtc;
> -	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
> -	int ret = 0, i;
> +	int ret = 0;
>  
>  	if (!check_digital_port_conflicts(state)) {
>  		DRM_DEBUG_KMS("rejecting conflicting digital port configuration\n");
>  		return -EINVAL;
>  	}
>  
> -	intel_state->modeset = true;
>  	intel_state->active_crtcs = dev_priv->active_crtcs;
>  	intel_state->cdclk.logical = dev_priv->cdclk.logical;
>  	intel_state->cdclk.actual = dev_priv->cdclk.actual;
>  
> -	for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, new_crtc_state, i) {
> -		if (new_crtc_state->active)
> -			intel_state->active_crtcs |= 1 << i;
> -		else
> -			intel_state->active_crtcs &= ~(1 << i);
> -
> -		if (old_crtc_state->active != new_crtc_state->active)
> -			intel_state->active_pipe_changes |= drm_crtc_mask(crtc);
> -	}
> +	init_intel_state(intel_state);
>  
>  	/*
>  	 * See if the config requires any additional preparation, e.g.
> @@ -13004,7 +13061,7 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work)
>  	intel_atomic_helper_free_state(dev_priv);
>  }
>  
> -static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
> +static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset)
>  {
>  	struct drm_device *dev = state->dev;
>  	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
> @@ -13068,10 +13125,18 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
>  
>  	/* Only after disabling all output pipelines that will be changed can we
>  	 * update the the output configuration. */
> -	intel_modeset_update_crtc_state(state);
> +	if (!is_reset)
> +		intel_modeset_update_crtc_state(state);
>  
>  	if (intel_state->modeset) {
> -		drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
> +		if (!is_reset) {
> +			drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
> +		} else {
> +			for_each_new_crtc_in_state(state, crtc, new_crtc_state, i) {
> +				if (new_crtc_state->enable)
> +					drm_calc_timestamping_constants(crtc, &new_crtc_state->adjusted_mode);
> +			}
> +		}
>  
>  		intel_set_cdclk(dev_priv, &dev_priv->cdclk.actual);
>  
> @@ -13082,7 +13147,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
>  		if (!intel_can_enable_sagv(state))
>  			intel_disable_sagv(dev_priv);
>  
> -		intel_modeset_verify_disabled(dev, state);
> +		if (!is_reset)
> +			intel_modeset_verify_disabled(dev, state);
>  	}
>  
>  	/* Complete the events for pipes that have now been disabled */
> @@ -13135,7 +13201,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
>  		if (put_domains[i])
>  			modeset_put_power_domains(dev_priv, put_domains[i]);
>  
> -		intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
> +		if (!is_reset)
> +			intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
>  	}
>  
>  	if (intel_state->modeset && intel_can_enable_sagv(state))
> @@ -13160,10 +13227,14 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
>  
>  	drm_atomic_helper_wait_for_dependencies(state);
>  
> -	__intel_atomic_commit_tail(state);
> +	down_read(&dev_priv->commit_sem);
> +
> +	__intel_atomic_commit_tail(state, false);
>  
>  	drm_atomic_helper_commit_hw_done(state);
>  
> +	up_read(&dev_priv->commit_sem);
> +
>  	mutex_lock(&dev->struct_mutex);
>  	drm_atomic_helper_cleanup_planes(dev, state);
>  	mutex_unlock(&dev->struct_mutex);
> @@ -15022,6 +15093,8 @@ int intel_modeset_init(struct drm_device *dev)
>  	INIT_WORK(&dev_priv->atomic_helper.free_work,
>  		  intel_atomic_helper_free_state_worker);
>  
> +	init_rwsem(&dev_priv->commit_sem);
> +
>  	intel_init_quirks(dev);
>  
>  	intel_init_pm(dev_priv);
> -- 
> 2.13.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 13:25     ` Daniel Vetter
@ 2017-06-30 13:30       ` Daniel Vetter
  -1 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-06-30 13:30 UTC (permalink / raw)
  To: ville.syrjala; +Cc: intel-gfx, dri-devel, stable, Daniel Vetter, Chris Wilson

On Fri, Jun 30, 2017 at 03:25:58PM +0200, Daniel Vetter wrote:
> On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:
> > From: Ville Syrj�l� <ville.syrjala@linux.intel.com>
> > 
> > Introduce an rw_semaphore to protect the display commits. All normal
> > commits use down_read() and hence can proceed in parallel, but GPU reset
> > will use down_write() making sure no other commits are in progress when
> > we have to pull the plug on the display engine on pre-g4x platforms.
> > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > itself, and we wait for all dependencies before the down_read(), and
> > thus there is no chance of deadlocks with this scheme.
> 
> How does this solve the deadlock? Afaiui the deadlock is that the gpu
> reset stopped unconditionally completing all requests before it did
> anything else, including anything with the hw or taking modeset locks.
> 
> This ensured that any outstanding flips (we only had pageflips, no atomic
> ones back then) could complete (although maybe displaying garbage). The
> only thing we had to do was grab the locks (to avoid concurrent modesets)
> and then we could happily nuke the hw (since the flips where lost in the
> CS anyway), and restore it afterwards.
> 
> Then the TDR rewrite came around and broke that, which now makes atomic
> time out waiting for the gpu to complete (because the gpu reset waits for
> the modeset to complete first). Which means if you want to fix this
> without breaking TDR, then you need to cancel the pending atomic commits.
> That seems somewhat what you're attempting here with trying to figure out
> what the latest hw-committed step is (but that function gets it wrong),
> and lots more locking and stuff on top.
> 
> Why exactly can't we just go back to the old world of force-completing
> dead requests on gen4 and earlier? That would be tons simpler imo instead
> of throwing a pile of hacks (which really needs a complete rewrite of the
> atomic code in i915) in as a work-around. We didn't have TDR on gen4 and
> earlier for years, going back to that isn't going to hurt anyone.
> 
> Making working gen4 gpu reset contigent on cancellable atomic commits and
> the full commit queue is imo nuts.

And if the GEM folks insist the old behavior can't be restored, then we
just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
in i915_sw_fence. Force-completing all render requests atomic updates
depend upon is imo the simplest solution to this, and we've had a driver
that worked like that for years.

And as long as TDR resubmits the batches the render-glitch will at most be
temporary, but that's totally fine: We're killing the entire display block
in the reset anyway, there's no way the user won't notice _that_ kind of
glitch.

Either way, let's please not over-engineer something for a dead-old
platform when something much, much, much simpler worked for years, and
should keep on working for another few years.
-Daniel

PS: One issue with atomic is that there's the small matter of having to
wait for pending atomic commits to complete before we nuke the display, to
avoid upsetting the display code. We could do that with a dummy commit, or
just having a special wait_for_depencies that waits for all CRTC to
complete their pending atomic updates.
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-06-30 13:30       ` Daniel Vetter
  0 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-06-30 13:30 UTC (permalink / raw)
  To: ville.syrjala; +Cc: Daniel Vetter, intel-gfx, stable, dri-devel

On Fri, Jun 30, 2017 at 03:25:58PM +0200, Daniel Vetter wrote:
> On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:
> > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > 
> > Introduce an rw_semaphore to protect the display commits. All normal
> > commits use down_read() and hence can proceed in parallel, but GPU reset
> > will use down_write() making sure no other commits are in progress when
> > we have to pull the plug on the display engine on pre-g4x platforms.
> > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > itself, and we wait for all dependencies before the down_read(), and
> > thus there is no chance of deadlocks with this scheme.
> 
> How does this solve the deadlock? Afaiui the deadlock is that the gpu
> reset stopped unconditionally completing all requests before it did
> anything else, including anything with the hw or taking modeset locks.
> 
> This ensured that any outstanding flips (we only had pageflips, no atomic
> ones back then) could complete (although maybe displaying garbage). The
> only thing we had to do was grab the locks (to avoid concurrent modesets)
> and then we could happily nuke the hw (since the flips where lost in the
> CS anyway), and restore it afterwards.
> 
> Then the TDR rewrite came around and broke that, which now makes atomic
> time out waiting for the gpu to complete (because the gpu reset waits for
> the modeset to complete first). Which means if you want to fix this
> without breaking TDR, then you need to cancel the pending atomic commits.
> That seems somewhat what you're attempting here with trying to figure out
> what the latest hw-committed step is (but that function gets it wrong),
> and lots more locking and stuff on top.
> 
> Why exactly can't we just go back to the old world of force-completing
> dead requests on gen4 and earlier? That would be tons simpler imo instead
> of throwing a pile of hacks (which really needs a complete rewrite of the
> atomic code in i915) in as a work-around. We didn't have TDR on gen4 and
> earlier for years, going back to that isn't going to hurt anyone.
> 
> Making working gen4 gpu reset contigent on cancellable atomic commits and
> the full commit queue is imo nuts.

And if the GEM folks insist the old behavior can't be restored, then we
just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
in i915_sw_fence. Force-completing all render requests atomic updates
depend upon is imo the simplest solution to this, and we've had a driver
that worked like that for years.

And as long as TDR resubmits the batches the render-glitch will at most be
temporary, but that's totally fine: We're killing the entire display block
in the reset anyway, there's no way the user won't notice _that_ kind of
glitch.

Either way, let's please not over-engineer something for a dead-old
platform when something much, much, much simpler worked for years, and
should keep on working for another few years.
-Daniel

PS: One issue with atomic is that there's the small matter of having to
wait for pending atomic commits to complete before we nuke the display, to
avoid upsetting the display code. We could do that with a dummy commit, or
just having a special wait_for_depencies that waits for all CRTC to
complete their pending atomic updates.
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-29 19:26         ` Ville Syrjälä
@ 2017-06-30 13:35           ` Daniel Vetter
  -1 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-06-30 13:35 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Chris Wilson, intel-gfx, dri-devel, stable, Daniel Vetter

On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrj�l� wrote:
> On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> > Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > > From: Ville Syrj�l� <ville.syrjala@linux.intel.com>
> > > 
> > > Introduce an rw_semaphore to protect the display commits. All normal
> > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > will use down_write() making sure no other commits are in progress when
> > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > itself, and we wait for all dependencies before the down_read(), and
> > > thus there is no chance of deadlocks with this scheme.
> > 
> > This matches what I thought should be done (I didn't think of using
> > rwsem just a mutex, nice touch). The point I got stuck on was what
> > should be done after the reset? I expected another modeset to return the
> > state back or otherwise the inflight would get confused?
> 
> I guess that can happen. For instance, if we have a crtc_enable() in flight,
> and then we do a reset before it gets committed we would end up doing
> crtc_enable() twice in a row without a crtc_disable in between. For page
> flips and such this shouldn't be a big deal in general.

atomic commits are ordered. You have to wait for the previous ones to
complete before you do a new one. If you don't do that, then all hell
breaks loose.

What you really can't do with atomic (without rewriting everything once
more) is cancel a commit. Pre-atomic we could do that on gen4 since the
mmio flips died with the gpu, but that's the one design change we need to
cope with (plus TDR insisting it can't force-complete requests anymore).

> > > During reset we should be recommiting the state that was committed last.
> > > But for now we'll settle for recommiting the last state for each object.
> > 
> > Ah, I guess that explains the above. What is the complication with
> > restoring the current state as opposed to the next state?
> 
> Well the main thing is just tracking which is the current state. That
> just needs refactoring the .atomic_duplicate_state() calling convention
> across the whole tree so that we can then duplicate the committed state
> rather than the latest state.
> 
> Also due to the commit_hw_done() being potentially done after the
> modeset locks have been dropped, I don't think we can be certain
> of it getting called in the same order as swap_state(), hence
> when we track the committed state in commit_hw_done() we'll have
> to have some way to figure out if our new state is in fact the
> latest committed state for each object or if the calls got
> reordered. We don't insert any dependencies between two commits
> unless they touch the same active crtc, thus this reordering
> seems very much possible. Dunno if we should add some way to add
> such dependeencies whenever the same object is part of two otherwise
> independent commits, or if we just want to try and work with the
> reordered calls. My idea for the latter was some kind of seqno/age
> counter on the object states that allows me to recongnize which
> state is more recent. The object states aren't refcounted so hanging
> on to the wrong pointer could cause an oops the next time we have to
> perform a GPU reset.

Atomic commits are strongly ordered on a given CRTC, so stuff can't be
out-of-order within one. Across them the idea was to just add more CRTC
states into the atomic commit to make sure stuff is ordered correctly.

Laurent just pointed out that for switching planes that doesn't happen
atm, but for i915 we should be safe (I guess I thought too much about i915
when typing the commit tracking code).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-06-30 13:35           ` Daniel Vetter
  0 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-06-30 13:35 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Chris Wilson, intel-gfx, dri-devel, stable, Daniel Vetter

On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjälä wrote:
> On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> > Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > 
> > > Introduce an rw_semaphore to protect the display commits. All normal
> > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > will use down_write() making sure no other commits are in progress when
> > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > itself, and we wait for all dependencies before the down_read(), and
> > > thus there is no chance of deadlocks with this scheme.
> > 
> > This matches what I thought should be done (I didn't think of using
> > rwsem just a mutex, nice touch). The point I got stuck on was what
> > should be done after the reset? I expected another modeset to return the
> > state back or otherwise the inflight would get confused?
> 
> I guess that can happen. For instance, if we have a crtc_enable() in flight,
> and then we do a reset before it gets committed we would end up doing
> crtc_enable() twice in a row without a crtc_disable in between. For page
> flips and such this shouldn't be a big deal in general.

atomic commits are ordered. You have to wait for the previous ones to
complete before you do a new one. If you don't do that, then all hell
breaks loose.

What you really can't do with atomic (without rewriting everything once
more) is cancel a commit. Pre-atomic we could do that on gen4 since the
mmio flips died with the gpu, but that's the one design change we need to
cope with (plus TDR insisting it can't force-complete requests anymore).

> > > During reset we should be recommiting the state that was committed last.
> > > But for now we'll settle for recommiting the last state for each object.
> > 
> > Ah, I guess that explains the above. What is the complication with
> > restoring the current state as opposed to the next state?
> 
> Well the main thing is just tracking which is the current state. That
> just needs refactoring the .atomic_duplicate_state() calling convention
> across the whole tree so that we can then duplicate the committed state
> rather than the latest state.
> 
> Also due to the commit_hw_done() being potentially done after the
> modeset locks have been dropped, I don't think we can be certain
> of it getting called in the same order as swap_state(), hence
> when we track the committed state in commit_hw_done() we'll have
> to have some way to figure out if our new state is in fact the
> latest committed state for each object or if the calls got
> reordered. We don't insert any dependencies between two commits
> unless they touch the same active crtc, thus this reordering
> seems very much possible. Dunno if we should add some way to add
> such dependeencies whenever the same object is part of two otherwise
> independent commits, or if we just want to try and work with the
> reordered calls. My idea for the latter was some kind of seqno/age
> counter on the object states that allows me to recongnize which
> state is more recent. The object states aren't refcounted so hanging
> on to the wrong pointer could cause an oops the next time we have to
> perform a GPU reset.

Atomic commits are strongly ordered on a given CRTC, so stuff can't be
out-of-order within one. Across them the idea was to just add more CRTC
states into the atomic commit to make sure stuff is ordered correctly.

Laurent just pointed out that for switching planes that doesn't happen
atm, but for i915 we should be safe (I guess I thought too much about i915
when typing the commit tracking code).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 13:35           ` Daniel Vetter
@ 2017-06-30 13:53             ` Ville Syrjälä
  -1 siblings, 0 replies; 44+ messages in thread
From: Ville Syrjälä @ 2017-06-30 13:53 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Chris Wilson, intel-gfx, dri-devel, stable, Daniel Vetter

On Fri, Jun 30, 2017 at 03:35:03PM +0200, Daniel Vetter wrote:
> On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrj�l� wrote:
> > On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> > > Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > > > From: Ville Syrj�l� <ville.syrjala@linux.intel.com>
> > > > 
> > > > Introduce an rw_semaphore to protect the display commits. All normal
> > > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > > will use down_write() making sure no other commits are in progress when
> > > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > > itself, and we wait for all dependencies before the down_read(), and
> > > > thus there is no chance of deadlocks with this scheme.
> > > 
> > > This matches what I thought should be done (I didn't think of using
> > > rwsem just a mutex, nice touch). The point I got stuck on was what
> > > should be done after the reset? I expected another modeset to return the
> > > state back or otherwise the inflight would get confused?
> > 
> > I guess that can happen. For instance, if we have a crtc_enable() in flight,
> > and then we do a reset before it gets committed we would end up doing
> > crtc_enable() twice in a row without a crtc_disable in between. For page
> > flips and such this shouldn't be a big deal in general.
> 
> atomic commits are ordered. You have to wait for the previous ones to
> complete before you do a new one. If you don't do that, then all hell
> breaks loose.

What we're effectively doing here is inserting two new commits in
the middle of the timeline, one to disable everything, and another
one to re-enable everything. At the end of the the re-enable we would
want the hardware state should match exactly what was there before
the disable, hence any commits still in the timeline should work
correctly. That is, their old_state will match the hardware state
when it comes time to commit them.

But we can only do that properly after we start to track the committed
state. Without that tracking we can get into trouble wrt. the
hardware state not matching the old state when it comes time to
commit the new state.

> 
> What you really can't do with atomic (without rewriting everything once
> more) is cancel a commit. Pre-atomic we could do that on gen4 since the
> mmio flips died with the gpu, but that's the one design change we need to
> cope with (plus TDR insisting it can't force-complete requests anymore).
> 
> > > > During reset we should be recommiting the state that was committed last.
> > > > But for now we'll settle for recommiting the last state for each object.
> > > 
> > > Ah, I guess that explains the above. What is the complication with
> > > restoring the current state as opposed to the next state?
> > 
> > Well the main thing is just tracking which is the current state. That
> > just needs refactoring the .atomic_duplicate_state() calling convention
> > across the whole tree so that we can then duplicate the committed state
> > rather than the latest state.
> > 
> > Also due to the commit_hw_done() being potentially done after the
> > modeset locks have been dropped, I don't think we can be certain
> > of it getting called in the same order as swap_state(), hence
> > when we track the committed state in commit_hw_done() we'll have
> > to have some way to figure out if our new state is in fact the
> > latest committed state for each object or if the calls got
> > reordered. We don't insert any dependencies between two commits
> > unless they touch the same active crtc, thus this reordering
> > seems very much possible. Dunno if we should add some way to add
> > such dependeencies whenever the same object is part of two otherwise
> > independent commits, or if we just want to try and work with the
> > reordered calls. My idea for the latter was some kind of seqno/age
> > counter on the object states that allows me to recongnize which
> > state is more recent. The object states aren't refcounted so hanging
> > on to the wrong pointer could cause an oops the next time we have to
> > perform a GPU reset.
> 
> Atomic commits are strongly ordered on a given CRTC, so stuff can't be
> out-of-order within one. Across them the idea was to just add more CRTC
> states into the atomic commit to make sure stuff is ordered correctly.

And atomic commits not touching the same crtc will not be ordered in any
way. Thus if they touch the same object (eg. disabled plane or connector)
we can't currently tell if the commit_hw_done() calls happened in the same
order as the swap_state() calls for that particular object.

-- 
Ville Syrj�l�
Intel OTC

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-06-30 13:53             ` Ville Syrjälä
  0 siblings, 0 replies; 44+ messages in thread
From: Ville Syrjälä @ 2017-06-30 13:53 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, intel-gfx, stable, dri-devel

On Fri, Jun 30, 2017 at 03:35:03PM +0200, Daniel Vetter wrote:
> On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjälä wrote:
> > On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> > > Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > 
> > > > Introduce an rw_semaphore to protect the display commits. All normal
> > > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > > will use down_write() making sure no other commits are in progress when
> > > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > > itself, and we wait for all dependencies before the down_read(), and
> > > > thus there is no chance of deadlocks with this scheme.
> > > 
> > > This matches what I thought should be done (I didn't think of using
> > > rwsem just a mutex, nice touch). The point I got stuck on was what
> > > should be done after the reset? I expected another modeset to return the
> > > state back or otherwise the inflight would get confused?
> > 
> > I guess that can happen. For instance, if we have a crtc_enable() in flight,
> > and then we do a reset before it gets committed we would end up doing
> > crtc_enable() twice in a row without a crtc_disable in between. For page
> > flips and such this shouldn't be a big deal in general.
> 
> atomic commits are ordered. You have to wait for the previous ones to
> complete before you do a new one. If you don't do that, then all hell
> breaks loose.

What we're effectively doing here is inserting two new commits in
the middle of the timeline, one to disable everything, and another
one to re-enable everything. At the end of the the re-enable we would
want the hardware state should match exactly what was there before
the disable, hence any commits still in the timeline should work
correctly. That is, their old_state will match the hardware state
when it comes time to commit them.

But we can only do that properly after we start to track the committed
state. Without that tracking we can get into trouble wrt. the
hardware state not matching the old state when it comes time to
commit the new state.

> 
> What you really can't do with atomic (without rewriting everything once
> more) is cancel a commit. Pre-atomic we could do that on gen4 since the
> mmio flips died with the gpu, but that's the one design change we need to
> cope with (plus TDR insisting it can't force-complete requests anymore).
> 
> > > > During reset we should be recommiting the state that was committed last.
> > > > But for now we'll settle for recommiting the last state for each object.
> > > 
> > > Ah, I guess that explains the above. What is the complication with
> > > restoring the current state as opposed to the next state?
> > 
> > Well the main thing is just tracking which is the current state. That
> > just needs refactoring the .atomic_duplicate_state() calling convention
> > across the whole tree so that we can then duplicate the committed state
> > rather than the latest state.
> > 
> > Also due to the commit_hw_done() being potentially done after the
> > modeset locks have been dropped, I don't think we can be certain
> > of it getting called in the same order as swap_state(), hence
> > when we track the committed state in commit_hw_done() we'll have
> > to have some way to figure out if our new state is in fact the
> > latest committed state for each object or if the calls got
> > reordered. We don't insert any dependencies between two commits
> > unless they touch the same active crtc, thus this reordering
> > seems very much possible. Dunno if we should add some way to add
> > such dependeencies whenever the same object is part of two otherwise
> > independent commits, or if we just want to try and work with the
> > reordered calls. My idea for the latter was some kind of seqno/age
> > counter on the object states that allows me to recongnize which
> > state is more recent. The object states aren't refcounted so hanging
> > on to the wrong pointer could cause an oops the next time we have to
> > perform a GPU reset.
> 
> Atomic commits are strongly ordered on a given CRTC, so stuff can't be
> out-of-order within one. Across them the idea was to just add more CRTC
> states into the atomic commit to make sure stuff is ordered correctly.

And atomic commits not touching the same crtc will not be ordered in any
way. Thus if they touch the same object (eg. disabled plane or connector)
we can't currently tell if the commit_hw_done() calls happened in the same
order as the swap_state() calls for that particular object.

-- 
Ville Syrjälä
Intel OTC
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 13:53             ` Ville Syrjälä
@ 2017-06-30 15:30               ` Daniel Vetter
  -1 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-06-30 15:30 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Daniel Vetter, Chris Wilson, intel-gfx, dri-devel, stable, Daniel Vetter

On Fri, Jun 30, 2017 at 04:53:19PM +0300, Ville Syrj�l� wrote:
> On Fri, Jun 30, 2017 at 03:35:03PM +0200, Daniel Vetter wrote:
> > On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrj�l� wrote:
> > > On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> > > > Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > > > > From: Ville Syrj�l� <ville.syrjala@linux.intel.com>
> > > > > 
> > > > > Introduce an rw_semaphore to protect the display commits. All normal
> > > > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > > > will use down_write() making sure no other commits are in progress when
> > > > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > > > itself, and we wait for all dependencies before the down_read(), and
> > > > > thus there is no chance of deadlocks with this scheme.
> > > > 
> > > > This matches what I thought should be done (I didn't think of using
> > > > rwsem just a mutex, nice touch). The point I got stuck on was what
> > > > should be done after the reset? I expected another modeset to return the
> > > > state back or otherwise the inflight would get confused?
> > > 
> > > I guess that can happen. For instance, if we have a crtc_enable() in flight,
> > > and then we do a reset before it gets committed we would end up doing
> > > crtc_enable() twice in a row without a crtc_disable in between. For page
> > > flips and such this shouldn't be a big deal in general.
> > 
> > atomic commits are ordered. You have to wait for the previous ones to
> > complete before you do a new one. If you don't do that, then all hell
> > breaks loose.
> 
> What we're effectively doing here is inserting two new commits in
> the middle of the timeline, one to disable everything, and another
> one to re-enable everything. At the end of the the re-enable we would
> want the hardware state should match exactly what was there before
> the disable, hence any commits still in the timeline should work
> correctly. That is, their old_state will match the hardware state
> when it comes time to commit them.
> 
> But we can only do that properly after we start to track the committed
> state. Without that tracking we can get into trouble wrt. the
> hardware state not matching the old state when it comes time to
> commit the new state.

Yeah, but my point is that this here is an extremely fancy and fragile
(and afaics in this form, incomplete) fix for something that in the past
was fixed much, much simpler. Why do we need this massive amount of
complexity now? Who's asking for all this (we don't even have anyone yet
asking for fully queued atomic commits, much less on gen4)?

I mean rewriting the entire modeset code is fun and all, but for gen3-4?

> > What you really can't do with atomic (without rewriting everything once
> > more) is cancel a commit. Pre-atomic we could do that on gen4 since the
> > mmio flips died with the gpu, but that's the one design change we need to
> > cope with (plus TDR insisting it can't force-complete requests anymore).
> > 
> > > > > During reset we should be recommiting the state that was committed last.
> > > > > But for now we'll settle for recommiting the last state for each object.
> > > > 
> > > > Ah, I guess that explains the above. What is the complication with
> > > > restoring the current state as opposed to the next state?
> > > 
> > > Well the main thing is just tracking which is the current state. That
> > > just needs refactoring the .atomic_duplicate_state() calling convention
> > > across the whole tree so that we can then duplicate the committed state
> > > rather than the latest state.
> > > 
> > > Also due to the commit_hw_done() being potentially done after the
> > > modeset locks have been dropped, I don't think we can be certain
> > > of it getting called in the same order as swap_state(), hence
> > > when we track the committed state in commit_hw_done() we'll have
> > > to have some way to figure out if our new state is in fact the
> > > latest committed state for each object or if the calls got
> > > reordered. We don't insert any dependencies between two commits
> > > unless they touch the same active crtc, thus this reordering
> > > seems very much possible. Dunno if we should add some way to add
> > > such dependeencies whenever the same object is part of two otherwise
> > > independent commits, or if we just want to try and work with the
> > > reordered calls. My idea for the latter was some kind of seqno/age
> > > counter on the object states that allows me to recongnize which
> > > state is more recent. The object states aren't refcounted so hanging
> > > on to the wrong pointer could cause an oops the next time we have to
> > > perform a GPU reset.
> > 
> > Atomic commits are strongly ordered on a given CRTC, so stuff can't be
> > out-of-order within one. Across them the idea was to just add more CRTC
> > states into the atomic commit to make sure stuff is ordered correctly.
> 
> And atomic commits not touching the same crtc will not be ordered in any
> way. Thus if they touch the same object (eg. disabled plane or connector)
> we can't currently tell if the commit_hw_done() calls happened in the same
> order as the swap_state() calls for that particular object.

Yes, but I think for i915 it's ok, because we don't have planes that move
around (not supported at least), and for other shared stuff we just grab
them all. It is a bug in general though, and for i915 it would probably be
best if we'd add the various drm_crtc_commit waits to the i915_sw_fence.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-06-30 15:30               ` Daniel Vetter
  0 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-06-30 15:30 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Daniel Vetter, Chris Wilson, intel-gfx, dri-devel, stable, Daniel Vetter

On Fri, Jun 30, 2017 at 04:53:19PM +0300, Ville Syrjälä wrote:
> On Fri, Jun 30, 2017 at 03:35:03PM +0200, Daniel Vetter wrote:
> > On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjälä wrote:
> > > On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> > > > Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > 
> > > > > Introduce an rw_semaphore to protect the display commits. All normal
> > > > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > > > will use down_write() making sure no other commits are in progress when
> > > > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > > > itself, and we wait for all dependencies before the down_read(), and
> > > > > thus there is no chance of deadlocks with this scheme.
> > > > 
> > > > This matches what I thought should be done (I didn't think of using
> > > > rwsem just a mutex, nice touch). The point I got stuck on was what
> > > > should be done after the reset? I expected another modeset to return the
> > > > state back or otherwise the inflight would get confused?
> > > 
> > > I guess that can happen. For instance, if we have a crtc_enable() in flight,
> > > and then we do a reset before it gets committed we would end up doing
> > > crtc_enable() twice in a row without a crtc_disable in between. For page
> > > flips and such this shouldn't be a big deal in general.
> > 
> > atomic commits are ordered. You have to wait for the previous ones to
> > complete before you do a new one. If you don't do that, then all hell
> > breaks loose.
> 
> What we're effectively doing here is inserting two new commits in
> the middle of the timeline, one to disable everything, and another
> one to re-enable everything. At the end of the the re-enable we would
> want the hardware state should match exactly what was there before
> the disable, hence any commits still in the timeline should work
> correctly. That is, their old_state will match the hardware state
> when it comes time to commit them.
> 
> But we can only do that properly after we start to track the committed
> state. Without that tracking we can get into trouble wrt. the
> hardware state not matching the old state when it comes time to
> commit the new state.

Yeah, but my point is that this here is an extremely fancy and fragile
(and afaics in this form, incomplete) fix for something that in the past
was fixed much, much simpler. Why do we need this massive amount of
complexity now? Who's asking for all this (we don't even have anyone yet
asking for fully queued atomic commits, much less on gen4)?

I mean rewriting the entire modeset code is fun and all, but for gen3-4?

> > What you really can't do with atomic (without rewriting everything once
> > more) is cancel a commit. Pre-atomic we could do that on gen4 since the
> > mmio flips died with the gpu, but that's the one design change we need to
> > cope with (plus TDR insisting it can't force-complete requests anymore).
> > 
> > > > > During reset we should be recommiting the state that was committed last.
> > > > > But for now we'll settle for recommiting the last state for each object.
> > > > 
> > > > Ah, I guess that explains the above. What is the complication with
> > > > restoring the current state as opposed to the next state?
> > > 
> > > Well the main thing is just tracking which is the current state. That
> > > just needs refactoring the .atomic_duplicate_state() calling convention
> > > across the whole tree so that we can then duplicate the committed state
> > > rather than the latest state.
> > > 
> > > Also due to the commit_hw_done() being potentially done after the
> > > modeset locks have been dropped, I don't think we can be certain
> > > of it getting called in the same order as swap_state(), hence
> > > when we track the committed state in commit_hw_done() we'll have
> > > to have some way to figure out if our new state is in fact the
> > > latest committed state for each object or if the calls got
> > > reordered. We don't insert any dependencies between two commits
> > > unless they touch the same active crtc, thus this reordering
> > > seems very much possible. Dunno if we should add some way to add
> > > such dependeencies whenever the same object is part of two otherwise
> > > independent commits, or if we just want to try and work with the
> > > reordered calls. My idea for the latter was some kind of seqno/age
> > > counter on the object states that allows me to recongnize which
> > > state is more recent. The object states aren't refcounted so hanging
> > > on to the wrong pointer could cause an oops the next time we have to
> > > perform a GPU reset.
> > 
> > Atomic commits are strongly ordered on a given CRTC, so stuff can't be
> > out-of-order within one. Across them the idea was to just add more CRTC
> > states into the atomic commit to make sure stuff is ordered correctly.
> 
> And atomic commits not touching the same crtc will not be ordered in any
> way. Thus if they touch the same object (eg. disabled plane or connector)
> we can't currently tell if the commit_hw_done() calls happened in the same
> order as the swap_state() calls for that particular object.

Yes, but I think for i915 it's ok, because we don't have planes that move
around (not supported at least), and for other shared stuff we just grab
them all. It is a bug in general though, and for i915 it would probably be
best if we'd add the various drm_crtc_commit waits to the i915_sw_fence.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 15:30               ` Daniel Vetter
@ 2017-06-30 15:39                 ` Chris Wilson
  -1 siblings, 0 replies; 44+ messages in thread
From: Chris Wilson @ 2017-06-30 15:39 UTC (permalink / raw)
  To: Daniel Vetter, Ville Syrjälä
  Cc: Daniel Vetter, intel-gfx, dri-devel, stable, Daniel Vetter

Quoting Daniel Vetter (2017-06-30 16:30:59)
> On Fri, Jun 30, 2017 at 04:53:19PM +0300, Ville Syrjälä wrote:
> > On Fri, Jun 30, 2017 at 03:35:03PM +0200, Daniel Vetter wrote:
> > > On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjälä wrote:
> > > > On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> > > > > Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > > > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > > 
> > > > > > Introduce an rw_semaphore to protect the display commits. All normal
> > > > > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > > > > will use down_write() making sure no other commits are in progress when
> > > > > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > > > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > > > > itself, and we wait for all dependencies before the down_read(), and
> > > > > > thus there is no chance of deadlocks with this scheme.
> > > > > 
> > > > > This matches what I thought should be done (I didn't think of using
> > > > > rwsem just a mutex, nice touch). The point I got stuck on was what
> > > > > should be done after the reset? I expected another modeset to return the
> > > > > state back or otherwise the inflight would get confused?
> > > > 
> > > > I guess that can happen. For instance, if we have a crtc_enable() in flight,
> > > > and then we do a reset before it gets committed we would end up doing
> > > > crtc_enable() twice in a row without a crtc_disable in between. For page
> > > > flips and such this shouldn't be a big deal in general.
> > > 
> > > atomic commits are ordered. You have to wait for the previous ones to
> > > complete before you do a new one. If you don't do that, then all hell
> > > breaks loose.
> > 
> > What we're effectively doing here is inserting two new commits in
> > the middle of the timeline, one to disable everything, and another
> > one to re-enable everything. At the end of the the re-enable we would
> > want the hardware state should match exactly what was there before
> > the disable, hence any commits still in the timeline should work
> > correctly. That is, their old_state will match the hardware state
> > when it comes time to commit them.
> > 
> > But we can only do that properly after we start to track the committed
> > state. Without that tracking we can get into trouble wrt. the
> > hardware state not matching the old state when it comes time to
> > commit the new state.
> 
> Yeah, but my point is that this here is an extremely fancy and fragile
> (and afaics in this form, incomplete) fix for something that in the past
> was fixed much, much simpler. Why do we need this massive amount of
> complexity now? Who's asking for all this (we don't even have anyone yet
> asking for fully queued atomic commits, much less on gen4)?

It was never "fixed", it was always borked; broken by design.
-Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-06-30 15:39                 ` Chris Wilson
  0 siblings, 0 replies; 44+ messages in thread
From: Chris Wilson @ 2017-06-30 15:39 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Daniel Vetter, intel-gfx, dri-devel, stable, Daniel Vetter

Quoting Daniel Vetter (2017-06-30 16:30:59)
> On Fri, Jun 30, 2017 at 04:53:19PM +0300, Ville Syrjälä wrote:
> > On Fri, Jun 30, 2017 at 03:35:03PM +0200, Daniel Vetter wrote:
> > > On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjälä wrote:
> > > > On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> > > > > Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > > > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > > 
> > > > > > Introduce an rw_semaphore to protect the display commits. All normal
> > > > > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > > > > will use down_write() making sure no other commits are in progress when
> > > > > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > > > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > > > > itself, and we wait for all dependencies before the down_read(), and
> > > > > > thus there is no chance of deadlocks with this scheme.
> > > > > 
> > > > > This matches what I thought should be done (I didn't think of using
> > > > > rwsem just a mutex, nice touch). The point I got stuck on was what
> > > > > should be done after the reset? I expected another modeset to return the
> > > > > state back or otherwise the inflight would get confused?
> > > > 
> > > > I guess that can happen. For instance, if we have a crtc_enable() in flight,
> > > > and then we do a reset before it gets committed we would end up doing
> > > > crtc_enable() twice in a row without a crtc_disable in between. For page
> > > > flips and such this shouldn't be a big deal in general.
> > > 
> > > atomic commits are ordered. You have to wait for the previous ones to
> > > complete before you do a new one. If you don't do that, then all hell
> > > breaks loose.
> > 
> > What we're effectively doing here is inserting two new commits in
> > the middle of the timeline, one to disable everything, and another
> > one to re-enable everything. At the end of the the re-enable we would
> > want the hardware state should match exactly what was there before
> > the disable, hence any commits still in the timeline should work
> > correctly. That is, their old_state will match the hardware state
> > when it comes time to commit them.
> > 
> > But we can only do that properly after we start to track the committed
> > state. Without that tracking we can get into trouble wrt. the
> > hardware state not matching the old state when it comes time to
> > commit the new state.
> 
> Yeah, but my point is that this here is an extremely fancy and fragile
> (and afaics in this form, incomplete) fix for something that in the past
> was fixed much, much simpler. Why do we need this massive amount of
> complexity now? Who's asking for all this (we don't even have anyone yet
> asking for fully queued atomic commits, much less on gen4)?

It was never "fixed", it was always borked; broken by design.
-Chris

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 13:30       ` Daniel Vetter
@ 2017-06-30 15:44         ` Ville Syrjälä
  -1 siblings, 0 replies; 44+ messages in thread
From: Ville Syrjälä @ 2017-06-30 15:44 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, stable, Daniel Vetter, Chris Wilson

On Fri, Jun 30, 2017 at 03:30:33PM +0200, Daniel Vetter wrote:
> On Fri, Jun 30, 2017 at 03:25:58PM +0200, Daniel Vetter wrote:
> > On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:
> > > From: Ville Syrj�l� <ville.syrjala@linux.intel.com>
> > > 
> > > Introduce an rw_semaphore to protect the display commits. All normal
> > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > will use down_write() making sure no other commits are in progress when
> > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > itself, and we wait for all dependencies before the down_read(), and
> > > thus there is no chance of deadlocks with this scheme.
> > 
> > How does this solve the deadlock? Afaiui the deadlock is that the gpu
> > reset stopped unconditionally completing all requests before it did
> > anything else, including anything with the hw or taking modeset locks.
> > 
> > This ensured that any outstanding flips (we only had pageflips, no atomic
> > ones back then) could complete (although maybe displaying garbage). The
> > only thing we had to do was grab the locks (to avoid concurrent modesets)
> > and then we could happily nuke the hw (since the flips where lost in the
> > CS anyway), and restore it afterwards.
> > 
> > Then the TDR rewrite came around and broke that, which now makes atomic
> > time out waiting for the gpu to complete (because the gpu reset waits for
> > the modeset to complete first). Which means if you want to fix this
> > without breaking TDR, then you need to cancel the pending atomic commits.
> > That seems somewhat what you're attempting here with trying to figure out
> > what the latest hw-committed step is (but that function gets it wrong),
> > and lots more locking and stuff on top.
> > 
> > Why exactly can't we just go back to the old world of force-completing
> > dead requests on gen4 and earlier? That would be tons simpler imo instead
> > of throwing a pile of hacks (which really needs a complete rewrite of the
> > atomic code in i915) in as a work-around. We didn't have TDR on gen4 and
> > earlier for years, going back to that isn't going to hurt anyone.
> > 
> > Making working gen4 gpu reset contigent on cancellable atomic commits and
> > the full commit queue is imo nuts.
> 
> And if the GEM folks insist the old behavior can't be restored, then we
> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> in i915_sw_fence. Force-completing all render requests atomic updates
> depend upon is imo the simplest solution to this, and we've had a driver
> that worked like that for years.

And it used to break all the time. I think we've had to fix it at least
three times by now. So I tend to think it's better to fix it in a way
that won't break so easily.

-- 
Ville Syrj�l�
Intel OTC

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-06-30 15:44         ` Ville Syrjälä
  0 siblings, 0 replies; 44+ messages in thread
From: Ville Syrjälä @ 2017-06-30 15:44 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, intel-gfx, stable, dri-devel

On Fri, Jun 30, 2017 at 03:30:33PM +0200, Daniel Vetter wrote:
> On Fri, Jun 30, 2017 at 03:25:58PM +0200, Daniel Vetter wrote:
> > On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:
> > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > 
> > > Introduce an rw_semaphore to protect the display commits. All normal
> > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > will use down_write() making sure no other commits are in progress when
> > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > itself, and we wait for all dependencies before the down_read(), and
> > > thus there is no chance of deadlocks with this scheme.
> > 
> > How does this solve the deadlock? Afaiui the deadlock is that the gpu
> > reset stopped unconditionally completing all requests before it did
> > anything else, including anything with the hw or taking modeset locks.
> > 
> > This ensured that any outstanding flips (we only had pageflips, no atomic
> > ones back then) could complete (although maybe displaying garbage). The
> > only thing we had to do was grab the locks (to avoid concurrent modesets)
> > and then we could happily nuke the hw (since the flips where lost in the
> > CS anyway), and restore it afterwards.
> > 
> > Then the TDR rewrite came around and broke that, which now makes atomic
> > time out waiting for the gpu to complete (because the gpu reset waits for
> > the modeset to complete first). Which means if you want to fix this
> > without breaking TDR, then you need to cancel the pending atomic commits.
> > That seems somewhat what you're attempting here with trying to figure out
> > what the latest hw-committed step is (but that function gets it wrong),
> > and lots more locking and stuff on top.
> > 
> > Why exactly can't we just go back to the old world of force-completing
> > dead requests on gen4 and earlier? That would be tons simpler imo instead
> > of throwing a pile of hacks (which really needs a complete rewrite of the
> > atomic code in i915) in as a work-around. We didn't have TDR on gen4 and
> > earlier for years, going back to that isn't going to hurt anyone.
> > 
> > Making working gen4 gpu reset contigent on cancellable atomic commits and
> > the full commit queue is imo nuts.
> 
> And if the GEM folks insist the old behavior can't be restored, then we
> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> in i915_sw_fence. Force-completing all render requests atomic updates
> depend upon is imo the simplest solution to this, and we've had a driver
> that worked like that for years.

And it used to break all the time. I think we've had to fix it at least
three times by now. So I tend to think it's better to fix it in a way
that won't break so easily.

-- 
Ville Syrjälä
Intel OTC
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 15:44         ` Ville Syrjälä
  (?)
@ 2017-06-30 18:23         ` Daniel Vetter
  2017-06-30 18:46             ` Ville Syrjälä
  -1 siblings, 1 reply; 44+ messages in thread
From: Daniel Vetter @ 2017-06-30 18:23 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: intel-gfx, dri-devel, stable, Chris Wilson

On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjälä
<ville.syrjala@linux.intel.com> wrote:
>> And if the GEM folks insist the old behavior can't be restored, then we
>> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
>> in i915_sw_fence. Force-completing all render requests atomic updates
>> depend upon is imo the simplest solution to this, and we've had a driver
>> that worked like that for years.
>
> And it used to break all the time. I think we've had to fix it at least
> three times by now. So I tend to think it's better to fix it in a way
> that won't break so easily.

Why exactly is making the atomic code massive more tricky the easy
fix? That's the part I don't get. Yes it got broken a bunch because no
one runs CI and everyone forgets that gen3/4 reset the display in gpu
reset, but in the end we do have a depency loop, and either the
modeset side or the render side needs to bail out and cancel it's
async stuff (whether that's a request or a nonblocking flip/atomic
commit doesn't matter). In my opinion, cancelling the request (even if
we're clever and only cancel the requests for the modeset waiters,
which isn't to hard to pull off) seems about the simplest option.
Especially since we need that code anyway, even TDR can't safe
everything and resubmit under all circumstances (at least the buggy
batch can't be resubmitted).

Cancelling any kind of atomic commit otoh looks like a lot more
complexity. Why do you think this is the easier, or at least less
fragile option? This patch series is full of FIXMEs, and even the more
complete set seems to have a pile of holes. Plus we need to stop using
obj->state, and I don't see any easy way to test for that (since the
gen3/4 gpu reset case is the only corner cases that currently needs
that).

So not seeing how this is easier or more robust at all. What do I miss?

Thanks, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 18:23         ` Daniel Vetter
@ 2017-06-30 18:46             ` Ville Syrjälä
  0 siblings, 0 replies; 44+ messages in thread
From: Ville Syrjälä @ 2017-06-30 18:46 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, stable, Chris Wilson

On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:
> On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrj�l�
> <ville.syrjala@linux.intel.com> wrote:
> >> And if the GEM folks insist the old behavior can't be restored, then we
> >> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> >> in i915_sw_fence. Force-completing all render requests atomic updates
> >> depend upon is imo the simplest solution to this, and we've had a driver
> >> that worked like that for years.
> >
> > And it used to break all the time. I think we've had to fix it at least
> > three times by now. So I tend to think it's better to fix it in a way
> > that won't break so easily.
> 
> Why exactly is making the atomic code massive more tricky the easy
> fix?

I don't see what this massive trickyness is. Compared to the rest of
atomic what I have is absolutely trivial. Just the
duplicate_committed_state() and the '.committed_state = foo'
assignments in hw_done(). That's it really.

> That's the part I don't get. Yes it got broken a bunch because no
> one runs CI and everyone forgets that gen3/4 reset the display in gpu
> reset, but in the end we do have a depency loop, and either the
> modeset side or the render side needs to bail out and cancel it's
> async stuff (whether that's a request or a nonblocking flip/atomic
> commit doesn't matter). In my opinion, cancelling the request (even if
> we're clever and only cancel the requests for the modeset waiters,
> which isn't to hard to pull off) seems about the simplest option.
> Especially since we need that code anyway, even TDR can't safe
> everything and resubmit under all circumstances (at least the buggy
> batch can't be resubmitted).
> 
> Cancelling any kind of atomic commit otoh looks like a lot more
> complexity.

I'm not cancelling anything.

> Why do you think this is the easier, or at least less
> fragile option? This patch series is full of FIXMEs, and even the more
> complete set seems to have a pile of holes. Plus we need to stop using
> obj->state, and I don't see any easy way to test for that (since the
> gen3/4 gpu reset case is the only corner cases that currently needs
> that).

We need to fix that stuff anyway if we ever want to queue up multiple
commits for the same crtc. The stuff I have that is specific to this
reset stuff is actually very simple. The rest is just fixing up the
huge mess we've already made.

-- 
Ville Syrj�l�
Intel OTC

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-06-30 18:46             ` Ville Syrjälä
  0 siblings, 0 replies; 44+ messages in thread
From: Ville Syrjälä @ 2017-06-30 18:46 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, stable, Chris Wilson

On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:
> On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjälä
> <ville.syrjala@linux.intel.com> wrote:
> >> And if the GEM folks insist the old behavior can't be restored, then we
> >> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> >> in i915_sw_fence. Force-completing all render requests atomic updates
> >> depend upon is imo the simplest solution to this, and we've had a driver
> >> that worked like that for years.
> >
> > And it used to break all the time. I think we've had to fix it at least
> > three times by now. So I tend to think it's better to fix it in a way
> > that won't break so easily.
> 
> Why exactly is making the atomic code massive more tricky the easy
> fix?

I don't see what this massive trickyness is. Compared to the rest of
atomic what I have is absolutely trivial. Just the
duplicate_committed_state() and the '.committed_state = foo'
assignments in hw_done(). That's it really.

> That's the part I don't get. Yes it got broken a bunch because no
> one runs CI and everyone forgets that gen3/4 reset the display in gpu
> reset, but in the end we do have a depency loop, and either the
> modeset side or the render side needs to bail out and cancel it's
> async stuff (whether that's a request or a nonblocking flip/atomic
> commit doesn't matter). In my opinion, cancelling the request (even if
> we're clever and only cancel the requests for the modeset waiters,
> which isn't to hard to pull off) seems about the simplest option.
> Especially since we need that code anyway, even TDR can't safe
> everything and resubmit under all circumstances (at least the buggy
> batch can't be resubmitted).
> 
> Cancelling any kind of atomic commit otoh looks like a lot more
> complexity.

I'm not cancelling anything.

> Why do you think this is the easier, or at least less
> fragile option? This patch series is full of FIXMEs, and even the more
> complete set seems to have a pile of holes. Plus we need to stop using
> obj->state, and I don't see any easy way to test for that (since the
> gen3/4 gpu reset case is the only corner cases that currently needs
> that).

We need to fix that stuff anyway if we ever want to queue up multiple
commits for the same crtc. The stuff I have that is specific to this
reset stuff is actually very simple. The rest is just fixing up the
huge mess we've already made.

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 18:46             ` Ville Syrjälä
@ 2017-07-03  7:55               ` Daniel Vetter
  -1 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-07-03  7:55 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Daniel Vetter, intel-gfx, dri-devel, stable, Chris Wilson

On Fri, Jun 30, 2017 at 09:46:36PM +0300, Ville Syrj�l� wrote:
> On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:
> > On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrj�l�
> > <ville.syrjala@linux.intel.com> wrote:
> > >> And if the GEM folks insist the old behavior can't be restored, then we
> > >> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> > >> in i915_sw_fence. Force-completing all render requests atomic updates
> > >> depend upon is imo the simplest solution to this, and we've had a driver
> > >> that worked like that for years.
> > >
> > > And it used to break all the time. I think we've had to fix it at least
> > > three times by now. So I tend to think it's better to fix it in a way
> > > that won't break so easily.
> > 
> > Why exactly is making the atomic code massive more tricky the easy
> > fix?
> 
> I don't see what this massive trickyness is. Compared to the rest of
> atomic what I have is absolutely trivial. Just the
> duplicate_committed_state() and the '.committed_state = foo'
> assignments in hw_done(). That's it really.

>From a quick look and your description it seems full of races. I'm not
sure it'll still be simple once those are fixed.

> > That's the part I don't get. Yes it got broken a bunch because no
> > one runs CI and everyone forgets that gen3/4 reset the display in gpu
> > reset, but in the end we do have a depency loop, and either the
> > modeset side or the render side needs to bail out and cancel it's
> > async stuff (whether that's a request or a nonblocking flip/atomic
> > commit doesn't matter). In my opinion, cancelling the request (even if
> > we're clever and only cancel the requests for the modeset waiters,
> > which isn't to hard to pull off) seems about the simplest option.
> > Especially since we need that code anyway, even TDR can't safe
> > everything and resubmit under all circumstances (at least the buggy
> > batch can't be resubmitted).
> > 
> > Cancelling any kind of atomic commit otoh looks like a lot more
> > complexity.
> 
> I'm not cancelling anything.

Well by overtaking the in-flight commit you are at least fighting with
that. Either you need to cancel that one, or insert the gpu reset commit
at the right point (and with the right state). Current code drops that and
instead seems to just hope it doesn't lead to tears too much.

> > Why do you think this is the easier, or at least less
> > fragile option? This patch series is full of FIXMEs, and even the more
> > complete set seems to have a pile of holes. Plus we need to stop using
> > obj->state, and I don't see any easy way to test for that (since the
> > gen3/4 gpu reset case is the only corner cases that currently needs
> > that).
> 
> We need to fix that stuff anyway if we ever want to queue up multiple
> commits for the same crtc. The stuff I have that is specific to this
> reset stuff is actually very simple. The rest is just fixing up the
> huge mess we've already made.

Rewriting the world for a regression fix seems a bit much is all I'm
saying. And I'm not sure your approach works without that "rewrite the
world" step. Defacto what your current patches seem to result in is
- we commit the final sw state in gpu reset
- before we resubmit the rendering

That's much easier to pull of by simply force-completing all
i915_sw_fences before we take any modeset locks in the gpu reset path.
Note that we don't need to force-complete any i915_gem_request, we can
just force-complete the i915_sw_fences the work item is blocked on. Needs
some care to avoid races with a new atomic commit (since we need to
force-complete before we grab locks one might sneak in), but that's a
standard pattern.

Plus we then need to wait for all outstanding nonblocking commits once we
do have all modeset locks, since with atomic holding the locks only syncs
against synchronous hw commits (i.e do a fake synchronous commit before we
nuke the display and we're good). A variant of wait_for_dependencies that
waits for all crtc instead of just the crtc in an atomic commit would do I
think.

None of this requires any of the prep work we need the fancy additional
atomic stuff you plan to do. Which I think is really good for a regression
fix.

So again, why do we need to rewrite the world (since these patches here
seem to just be the racy poc) to fix reset on gen3/4?

I know you want to do all this, but tangling up a regression fix in a
rewrite isn't a good idea in my opinion. I'm not against your long-term
plans, I just think it'd be good if we can have them as orthogonal pieces
if feasible.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-07-03  7:55               ` Daniel Vetter
  0 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-07-03  7:55 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Daniel Vetter, intel-gfx, dri-devel, stable, Chris Wilson

On Fri, Jun 30, 2017 at 09:46:36PM +0300, Ville Syrjälä wrote:
> On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:
> > On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjälä
> > <ville.syrjala@linux.intel.com> wrote:
> > >> And if the GEM folks insist the old behavior can't be restored, then we
> > >> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> > >> in i915_sw_fence. Force-completing all render requests atomic updates
> > >> depend upon is imo the simplest solution to this, and we've had a driver
> > >> that worked like that for years.
> > >
> > > And it used to break all the time. I think we've had to fix it at least
> > > three times by now. So I tend to think it's better to fix it in a way
> > > that won't break so easily.
> > 
> > Why exactly is making the atomic code massive more tricky the easy
> > fix?
> 
> I don't see what this massive trickyness is. Compared to the rest of
> atomic what I have is absolutely trivial. Just the
> duplicate_committed_state() and the '.committed_state = foo'
> assignments in hw_done(). That's it really.

>From a quick look and your description it seems full of races. I'm not
sure it'll still be simple once those are fixed.

> > That's the part I don't get. Yes it got broken a bunch because no
> > one runs CI and everyone forgets that gen3/4 reset the display in gpu
> > reset, but in the end we do have a depency loop, and either the
> > modeset side or the render side needs to bail out and cancel it's
> > async stuff (whether that's a request or a nonblocking flip/atomic
> > commit doesn't matter). In my opinion, cancelling the request (even if
> > we're clever and only cancel the requests for the modeset waiters,
> > which isn't to hard to pull off) seems about the simplest option.
> > Especially since we need that code anyway, even TDR can't safe
> > everything and resubmit under all circumstances (at least the buggy
> > batch can't be resubmitted).
> > 
> > Cancelling any kind of atomic commit otoh looks like a lot more
> > complexity.
> 
> I'm not cancelling anything.

Well by overtaking the in-flight commit you are at least fighting with
that. Either you need to cancel that one, or insert the gpu reset commit
at the right point (and with the right state). Current code drops that and
instead seems to just hope it doesn't lead to tears too much.

> > Why do you think this is the easier, or at least less
> > fragile option? This patch series is full of FIXMEs, and even the more
> > complete set seems to have a pile of holes. Plus we need to stop using
> > obj->state, and I don't see any easy way to test for that (since the
> > gen3/4 gpu reset case is the only corner cases that currently needs
> > that).
> 
> We need to fix that stuff anyway if we ever want to queue up multiple
> commits for the same crtc. The stuff I have that is specific to this
> reset stuff is actually very simple. The rest is just fixing up the
> huge mess we've already made.

Rewriting the world for a regression fix seems a bit much is all I'm
saying. And I'm not sure your approach works without that "rewrite the
world" step. Defacto what your current patches seem to result in is
- we commit the final sw state in gpu reset
- before we resubmit the rendering

That's much easier to pull of by simply force-completing all
i915_sw_fences before we take any modeset locks in the gpu reset path.
Note that we don't need to force-complete any i915_gem_request, we can
just force-complete the i915_sw_fences the work item is blocked on. Needs
some care to avoid races with a new atomic commit (since we need to
force-complete before we grab locks one might sneak in), but that's a
standard pattern.

Plus we then need to wait for all outstanding nonblocking commits once we
do have all modeset locks, since with atomic holding the locks only syncs
against synchronous hw commits (i.e do a fake synchronous commit before we
nuke the display and we're good). A variant of wait_for_dependencies that
waits for all crtc instead of just the crtc in an atomic commit would do I
think.

None of this requires any of the prep work we need the fancy additional
atomic stuff you plan to do. Which I think is really good for a regression
fix.

So again, why do we need to rewrite the world (since these patches here
seem to just be the racy poc) to fix reset on gen3/4?

I know you want to do all this, but tangling up a regression fix in a
rewrite isn't a good idea in my opinion. I'm not against your long-term
plans, I just think it'd be good if we can have them as orthogonal pieces
if feasible.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 15:39                 ` Chris Wilson
@ 2017-07-03  8:03                   ` Daniel Vetter
  -1 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-07-03  8:03 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Ville Syrjälä, intel-gfx, dri-devel, stable

On Fri, Jun 30, 2017 at 5:39 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> Yeah, but my point is that this here is an extremely fancy and fragile
>> (and afaics in this form, incomplete) fix for something that in the past
>> was fixed much, much simpler. Why do we need this massive amount of
>> complexity now? Who's asking for all this (we don't even have anyone yet
>> asking for fully queued atomic commits, much less on gen4)?
>
> It was never "fixed", it was always borked; broken by design.

Hm, what was broken by design in gen3/4 reset? We never bothered to
resubmit rendering when the gpu died, but besides that I'm not aware
of a deisgn issue in that logic. We nuked in-flight pageflips (and
restored those), and we stalled for any pending modesets (grabbing
locks did that since all modesets where blocking), and that made sure
the hw was in a consistent state.

We always leaked the vblank state to userspace, but this approach here
also doesn't fix this. Plus broken rendering, but for these old
platforms I'm not too worried about displaying a few wrong frames
(with the new reset we will resubmit, so proper rendering should show
up soonish) - after all gpu reset nukes the entire display, there's no
way for the user to not notice that.

It would be neat to not have to do that, and Ville has a plan, but
meanwhile we still have this regression at hand that seems to be the
blocker for adding more machines to CI. I'd like to have the least
complex path to get that address (but maybe not long-term fixed, I'm
clear on that). If feasible.

If that's a unicorn, then let's go with Ville's approach, but then I
think we need the full thing with the races properly closed.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-07-03  8:03                   ` Daniel Vetter
  0 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-07-03  8:03 UTC (permalink / raw)
  To: Chris Wilson; +Cc: stable, intel-gfx, dri-devel

On Fri, Jun 30, 2017 at 5:39 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> Yeah, but my point is that this here is an extremely fancy and fragile
>> (and afaics in this form, incomplete) fix for something that in the past
>> was fixed much, much simpler. Why do we need this massive amount of
>> complexity now? Who's asking for all this (we don't even have anyone yet
>> asking for fully queued atomic commits, much less on gen4)?
>
> It was never "fixed", it was always borked; broken by design.

Hm, what was broken by design in gen3/4 reset? We never bothered to
resubmit rendering when the gpu died, but besides that I'm not aware
of a deisgn issue in that logic. We nuked in-flight pageflips (and
restored those), and we stalled for any pending modesets (grabbing
locks did that since all modesets where blocking), and that made sure
the hw was in a consistent state.

We always leaked the vblank state to userspace, but this approach here
also doesn't fix this. Plus broken rendering, but for these old
platforms I'm not too worried about displaying a few wrong frames
(with the new reset we will resubmit, so proper rendering should show
up soonish) - after all gpu reset nukes the entire display, there's no
way for the user to not notice that.

It would be neat to not have to do that, and Ville has a plan, but
meanwhile we still have this regression at hand that seems to be the
blocker for adding more machines to CI. I'd like to have the least
complex path to get that address (but maybe not long-term fixed, I'm
clear on that). If feasible.

If that's a unicorn, then let's go with Ville's approach, but then I
think we need the full thing with the races properly closed.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-07-03  8:03                   ` Daniel Vetter
  (?)
@ 2017-07-03  8:16                   ` Chris Wilson
  2017-07-03 16:42                       ` Daniel Vetter
  -1 siblings, 1 reply; 44+ messages in thread
From: Chris Wilson @ 2017-07-03  8:16 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: stable, intel-gfx, dri-devel

Quoting Daniel Vetter (2017-07-03 09:03:36)
> On Fri, Jun 30, 2017 at 5:39 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >> Yeah, but my point is that this here is an extremely fancy and fragile
> >> (and afaics in this form, incomplete) fix for something that in the past
> >> was fixed much, much simpler. Why do we need this massive amount of
> >> complexity now? Who's asking for all this (we don't even have anyone yet
> >> asking for fully queued atomic commits, much less on gen4)?
> >
> > It was never "fixed", it was always borked; broken by design.
> 
> Hm, what was broken by design in gen3/4 reset? We never bothered to
> resubmit rendering when the gpu died, but besides that I'm not aware
> of a deisgn issue in that logic. We nuked in-flight pageflips (and
> restored those), and we stalled for any pending modesets (grabbing
> locks did that since all modesets where blocking), and that made sure
> the hw was in a consistent state.

KMS reset was taking mutexes within reset without any means of breaking the
inherent deadlock, instead relying on something else to magically fix
it. We only ever engineered around struct_mutex for reset, the remains
of the deadlock upon struct_mutex within modeset is an issue but not the
one causing trouble here.

Please do note the bugs that indicate that even without modeset reset,
hw is not in a consistent state.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-07-03  7:55               ` Daniel Vetter
@ 2017-07-03  9:30                 ` Ville Syrjälä
  -1 siblings, 0 replies; 44+ messages in thread
From: Ville Syrjälä @ 2017-07-03  9:30 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, stable, Chris Wilson

On Mon, Jul 03, 2017 at 09:55:48AM +0200, Daniel Vetter wrote:
> On Fri, Jun 30, 2017 at 09:46:36PM +0300, Ville Syrj�l� wrote:
> > On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:
> > > On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrj�l�
> > > <ville.syrjala@linux.intel.com> wrote:
> > > >> And if the GEM folks insist the old behavior can't be restored, then we
> > > >> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> > > >> in i915_sw_fence. Force-completing all render requests atomic updates
> > > >> depend upon is imo the simplest solution to this, and we've had a driver
> > > >> that worked like that for years.
> > > >
> > > > And it used to break all the time. I think we've had to fix it at least
> > > > three times by now. So I tend to think it's better to fix it in a way
> > > > that won't break so easily.
> > > 
> > > Why exactly is making the atomic code massive more tricky the easy
> > > fix?
> > 
> > I don't see what this massive trickyness is. Compared to the rest of
> > atomic what I have is absolutely trivial. Just the
> > duplicate_committed_state() and the '.committed_state = foo'
> > assignments in hw_done(). That's it really.
> 
> >From a quick look and your description it seems full of races.

This "simple" version has problems, yes. The full versions has just
the one potential race between swap_state() and hw_done(). That
seems like pretty easy to handle.

> I'm not
> sure it'll still be simple once those are fixed.
> 
> > > That's the part I don't get. Yes it got broken a bunch because no
> > > one runs CI and everyone forgets that gen3/4 reset the display in gpu
> > > reset, but in the end we do have a depency loop, and either the
> > > modeset side or the render side needs to bail out and cancel it's
> > > async stuff (whether that's a request or a nonblocking flip/atomic
> > > commit doesn't matter). In my opinion, cancelling the request (even if
> > > we're clever and only cancel the requests for the modeset waiters,
> > > which isn't to hard to pull off) seems about the simplest option.
> > > Especially since we need that code anyway, even TDR can't safe
> > > everything and resubmit under all circumstances (at least the buggy
> > > batch can't be resubmitted).
> > > 
> > > Cancelling any kind of atomic commit otoh looks like a lot more
> > > complexity.
> > 
> > I'm not cancelling anything.
> 
> Well by overtaking the in-flight commit you are at least fighting with
> that. Either you need to cancel that one, or insert the gpu reset commit
> at the right point (and with the right state). Current code drops that and
> instead seems to just hope it doesn't lead to tears too much.

The simple version is pretty much "cross our fingers and hope for
the best" type of thing, yes. The full version I cooked up earlier
doesn't rely on hope.

In fact I was able to come up with a testcase that does make the simple
version explode, whereas the full version works just fine. So we should
abandon the idea of using this simple version actually.

> 
> > > Why do you think this is the easier, or at least less
> > > fragile option? This patch series is full of FIXMEs, and even the more
> > > complete set seems to have a pile of holes. Plus we need to stop using
> > > obj->state, and I don't see any easy way to test for that (since the
> > > gen3/4 gpu reset case is the only corner cases that currently needs
> > > that).
> > 
> > We need to fix that stuff anyway if we ever want to queue up multiple
> > commits for the same crtc. The stuff I have that is specific to this
> > reset stuff is actually very simple. The rest is just fixing up the
> > huge mess we've already made.
> 
> Rewriting the world for a regression fix seems a bit much is all I'm
> saying. And I'm not sure your approach works without that "rewrite the
> world" step. Defacto what your current patches seem to result in is
> - we commit the final sw state in gpu reset
> - before we resubmit the rendering

That's true. And we do appear to need the refactoring to make it
actually work. Most of the refactoring amounts to a couple of small
cocci scripts though so not a big deal really.

> 
> That's much easier to pull of by simply force-completing all
> i915_sw_fences before we take any modeset locks in the gpu reset path.
> Note that we don't need to force-complete any i915_gem_request, we can
> just force-complete the i915_sw_fences the work item is blocked on. Needs
> some care to avoid races with a new atomic commit (since we need to
> force-complete before we grab locks one might sneak in), but that's a
> standard pattern.
> 
> Plus we then need to wait for all outstanding nonblocking commits once we
> do have all modeset locks, since with atomic holding the locks only syncs
> against synchronous hw commits (i.e do a fake synchronous commit before we
> nuke the display and we're good). A variant of wait_for_dependencies that
> waits for all crtc instead of just the crtc in an atomic commit would do I
> think.
> 
> None of this requires any of the prep work we need the fancy additional
> atomic stuff you plan to do. Which I think is really good for a regression
> fix.
> 
> So again, why do we need to rewrite the world (since these patches here
> seem to just be the racy poc) to fix reset on gen3/4?
> 
> I know you want to do all this, but tangling up a regression fix in a
> rewrite isn't a good idea in my opinion. I'm not against your long-term
> plans, I just think it'd be good if we can have them as orthogonal pieces
> if feasible.

I've pretty much decided that I'll be happy if I can solve this for
future kernels. If someone else wants to try and cook up some kind
of simple regression fix I don't have any real objections.

-- 
Ville Syrj�l�
Intel OTC

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-07-03  9:30                 ` Ville Syrjälä
  0 siblings, 0 replies; 44+ messages in thread
From: Ville Syrjälä @ 2017-07-03  9:30 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, stable, Chris Wilson

On Mon, Jul 03, 2017 at 09:55:48AM +0200, Daniel Vetter wrote:
> On Fri, Jun 30, 2017 at 09:46:36PM +0300, Ville Syrjälä wrote:
> > On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:
> > > On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjälä
> > > <ville.syrjala@linux.intel.com> wrote:
> > > >> And if the GEM folks insist the old behavior can't be restored, then we
> > > >> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> > > >> in i915_sw_fence. Force-completing all render requests atomic updates
> > > >> depend upon is imo the simplest solution to this, and we've had a driver
> > > >> that worked like that for years.
> > > >
> > > > And it used to break all the time. I think we've had to fix it at least
> > > > three times by now. So I tend to think it's better to fix it in a way
> > > > that won't break so easily.
> > > 
> > > Why exactly is making the atomic code massive more tricky the easy
> > > fix?
> > 
> > I don't see what this massive trickyness is. Compared to the rest of
> > atomic what I have is absolutely trivial. Just the
> > duplicate_committed_state() and the '.committed_state = foo'
> > assignments in hw_done(). That's it really.
> 
> >From a quick look and your description it seems full of races.

This "simple" version has problems, yes. The full versions has just
the one potential race between swap_state() and hw_done(). That
seems like pretty easy to handle.

> I'm not
> sure it'll still be simple once those are fixed.
> 
> > > That's the part I don't get. Yes it got broken a bunch because no
> > > one runs CI and everyone forgets that gen3/4 reset the display in gpu
> > > reset, but in the end we do have a depency loop, and either the
> > > modeset side or the render side needs to bail out and cancel it's
> > > async stuff (whether that's a request or a nonblocking flip/atomic
> > > commit doesn't matter). In my opinion, cancelling the request (even if
> > > we're clever and only cancel the requests for the modeset waiters,
> > > which isn't to hard to pull off) seems about the simplest option.
> > > Especially since we need that code anyway, even TDR can't safe
> > > everything and resubmit under all circumstances (at least the buggy
> > > batch can't be resubmitted).
> > > 
> > > Cancelling any kind of atomic commit otoh looks like a lot more
> > > complexity.
> > 
> > I'm not cancelling anything.
> 
> Well by overtaking the in-flight commit you are at least fighting with
> that. Either you need to cancel that one, or insert the gpu reset commit
> at the right point (and with the right state). Current code drops that and
> instead seems to just hope it doesn't lead to tears too much.

The simple version is pretty much "cross our fingers and hope for
the best" type of thing, yes. The full version I cooked up earlier
doesn't rely on hope.

In fact I was able to come up with a testcase that does make the simple
version explode, whereas the full version works just fine. So we should
abandon the idea of using this simple version actually.

> 
> > > Why do you think this is the easier, or at least less
> > > fragile option? This patch series is full of FIXMEs, and even the more
> > > complete set seems to have a pile of holes. Plus we need to stop using
> > > obj->state, and I don't see any easy way to test for that (since the
> > > gen3/4 gpu reset case is the only corner cases that currently needs
> > > that).
> > 
> > We need to fix that stuff anyway if we ever want to queue up multiple
> > commits for the same crtc. The stuff I have that is specific to this
> > reset stuff is actually very simple. The rest is just fixing up the
> > huge mess we've already made.
> 
> Rewriting the world for a regression fix seems a bit much is all I'm
> saying. And I'm not sure your approach works without that "rewrite the
> world" step. Defacto what your current patches seem to result in is
> - we commit the final sw state in gpu reset
> - before we resubmit the rendering

That's true. And we do appear to need the refactoring to make it
actually work. Most of the refactoring amounts to a couple of small
cocci scripts though so not a big deal really.

> 
> That's much easier to pull of by simply force-completing all
> i915_sw_fences before we take any modeset locks in the gpu reset path.
> Note that we don't need to force-complete any i915_gem_request, we can
> just force-complete the i915_sw_fences the work item is blocked on. Needs
> some care to avoid races with a new atomic commit (since we need to
> force-complete before we grab locks one might sneak in), but that's a
> standard pattern.
> 
> Plus we then need to wait for all outstanding nonblocking commits once we
> do have all modeset locks, since with atomic holding the locks only syncs
> against synchronous hw commits (i.e do a fake synchronous commit before we
> nuke the display and we're good). A variant of wait_for_dependencies that
> waits for all crtc instead of just the crtc in an atomic commit would do I
> think.
> 
> None of this requires any of the prep work we need the fancy additional
> atomic stuff you plan to do. Which I think is really good for a regression
> fix.
> 
> So again, why do we need to rewrite the world (since these patches here
> seem to just be the racy poc) to fix reset on gen3/4?
> 
> I know you want to do all this, but tangling up a regression fix in a
> rewrite isn't a good idea in my opinion. I'm not against your long-term
> plans, I just think it'd be good if we can have them as orthogonal pieces
> if feasible.

I've pretty much decided that I'll be happy if I can solve this for
future kernels. If someone else wants to try and cook up some kind
of simple regression fix I don't have any real objections.

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-07-03  8:16                   ` Chris Wilson
@ 2017-07-03 16:42                       ` Daniel Vetter
  0 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-07-03 16:42 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Daniel Vetter, Ville Syrjälä, intel-gfx, dri-devel, stable

On Mon, Jul 03, 2017 at 09:16:54AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2017-07-03 09:03:36)
> > On Fri, Jun 30, 2017 at 5:39 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > >> Yeah, but my point is that this here is an extremely fancy and fragile
> > >> (and afaics in this form, incomplete) fix for something that in the past
> > >> was fixed much, much simpler. Why do we need this massive amount of
> > >> complexity now? Who's asking for all this (we don't even have anyone yet
> > >> asking for fully queued atomic commits, much less on gen4)?
> > >
> > > It was never "fixed", it was always borked; broken by design.
> > 
> > Hm, what was broken by design in gen3/4 reset? We never bothered to
> > resubmit rendering when the gpu died, but besides that I'm not aware
> > of a deisgn issue in that logic. We nuked in-flight pageflips (and
> > restored those), and we stalled for any pending modesets (grabbing
> > locks did that since all modesets where blocking), and that made sure
> > the hw was in a consistent state.
> 
> KMS reset was taking mutexes within reset without any means of breaking the
> inherent deadlock, instead relying on something else to magically fix
> it. We only ever engineered around struct_mutex for reset, the remains
> of the deadlock upon struct_mutex within modeset is an issue but not the
> one causing trouble here.

So on the kms side we've had:
- grab kms locks -> grab struct_mutex -> wait for rendering

- on the reset side we've had the same order afair, with the caveat that
  we've broken the "wait for rendering" complete before trying to grab any
  of the locks in the reset path.

I thought that the problem is that gpu reset stopping force-completing
everything asap, which then lead the kms side to time-out at a point where
it shouldn't and start to fall over.

So before

commit 221fe7994554cc3985fc5d761ed7e44dcae0fa52
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 9 14:11:51 2016 +0100

    drm/i915: Perform a direct reset of the GPU from the waiter

what was the deadlock we've had? Besides breaking the "wait for rendering"
I don't remember any inversions we've had. And for the breaking we've had
a fairly complex dance of barriers and reset_in_progress and waking up
waiters to make sure it would catch them all ...

> Please do note the bugs that indicate that even without modeset reset,
> hw is not in a consistent state.

So there's more bugs, not sure how that is relevant for gpu reset?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-07-03 16:42                       ` Daniel Vetter
  0 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-07-03 16:42 UTC (permalink / raw)
  To: Chris Wilson; +Cc: dri-devel, stable, intel-gfx

On Mon, Jul 03, 2017 at 09:16:54AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2017-07-03 09:03:36)
> > On Fri, Jun 30, 2017 at 5:39 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > >> Yeah, but my point is that this here is an extremely fancy and fragile
> > >> (and afaics in this form, incomplete) fix for something that in the past
> > >> was fixed much, much simpler. Why do we need this massive amount of
> > >> complexity now? Who's asking for all this (we don't even have anyone yet
> > >> asking for fully queued atomic commits, much less on gen4)?
> > >
> > > It was never "fixed", it was always borked; broken by design.
> > 
> > Hm, what was broken by design in gen3/4 reset? We never bothered to
> > resubmit rendering when the gpu died, but besides that I'm not aware
> > of a deisgn issue in that logic. We nuked in-flight pageflips (and
> > restored those), and we stalled for any pending modesets (grabbing
> > locks did that since all modesets where blocking), and that made sure
> > the hw was in a consistent state.
> 
> KMS reset was taking mutexes within reset without any means of breaking the
> inherent deadlock, instead relying on something else to magically fix
> it. We only ever engineered around struct_mutex for reset, the remains
> of the deadlock upon struct_mutex within modeset is an issue but not the
> one causing trouble here.

So on the kms side we've had:
- grab kms locks -> grab struct_mutex -> wait for rendering

- on the reset side we've had the same order afair, with the caveat that
  we've broken the "wait for rendering" complete before trying to grab any
  of the locks in the reset path.

I thought that the problem is that gpu reset stopping force-completing
everything asap, which then lead the kms side to time-out at a point where
it shouldn't and start to fall over.

So before

commit 221fe7994554cc3985fc5d761ed7e44dcae0fa52
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 9 14:11:51 2016 +0100

    drm/i915: Perform a direct reset of the GPU from the waiter

what was the deadlock we've had? Besides breaking the "wait for rendering"
I don't remember any inversions we've had. And for the breaking we've had
a fairly complex dance of barriers and reset_in_progress and waking up
waiters to make sure it would catch them all ...

> Please do note the bugs that indicate that even without modeset reset,
> hw is not in a consistent state.

So there's more bugs, not sure how that is relevant for gpu reset?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-07-03  9:30                 ` Ville Syrjälä
@ 2017-07-03 16:48                   ` Daniel Vetter
  -1 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-07-03 16:48 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Daniel Vetter, intel-gfx, dri-devel, stable, Chris Wilson

On Mon, Jul 03, 2017 at 12:30:48PM +0300, Ville Syrj�l� wrote:
> I've pretty much decided that I'll be happy if I can solve this for
> future kernels. If someone else wants to try and cook up some kind
> of simple regression fix I don't have any real objections.

Imo fixing the regression isn't the job of the display side, since display
isn't the one that broke the reset contract. Well apart from not syncing
with pending non-blocking atomic modeset commits, which we could fix
easily by dropping DRIVER_ATOMIC from gen2-4. But I'm still trying to
figure out why Chris thinks this isn't a gem regression ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
@ 2017-07-03 16:48                   ` Daniel Vetter
  0 siblings, 0 replies; 44+ messages in thread
From: Daniel Vetter @ 2017-07-03 16:48 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Daniel Vetter, intel-gfx, dri-devel, stable, Chris Wilson

On Mon, Jul 03, 2017 at 12:30:48PM +0300, Ville Syrjälä wrote:
> I've pretty much decided that I'll be happy if I can solve this for
> future kernels. If someone else wants to try and cook up some kind
> of simple regression fix I don't have any real objections.

Imo fixing the regression isn't the job of the display side, since display
isn't the one that broke the reset contract. Well apart from not syncing
with pending non-blocking atomic modeset commits, which we could fix
easily by dropping DRIVER_ATOMIC from gen2-4. But I'm still trying to
figure out why Chris thinks this isn't a gem regression ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2017-07-03 16:48 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-29 13:49 [PATCH 0/5] drm/i915: Fix pre-g4x GPU reset, again ville.syrjala
2017-06-29 13:49 ` [PATCH 1/5] drm/atomic: Refactor drm_atomic_state_realloc_connectors() ville.syrjala
2017-06-30 13:15   ` Daniel Vetter
2017-06-29 13:49 ` [PATCH 2/5] drm/atomic: Introduce drm_atomic_helper_duplicate_committed_state() ville.syrjala
2017-06-29 13:49 ` [PATCH 3/5] drm/i915% Store vma gtt offset in plane state ville.syrjala
2017-06-29 13:49 ` [PATCH 4/5] drm/i915: Refactor __intel_atomic_commit_tail() ville.syrjala
2017-06-29 13:49 ` [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore ville.syrjala
2017-06-29 13:56   ` Chris Wilson
2017-06-29 14:36   ` [PATCH v4 " ville.syrjala
2017-06-29 14:36     ` ville.syrjala
2017-06-29 17:57     ` Chris Wilson
2017-06-29 19:26       ` Ville Syrjälä
2017-06-29 19:26         ` Ville Syrjälä
2017-06-30 13:35         ` Daniel Vetter
2017-06-30 13:35           ` Daniel Vetter
2017-06-30 13:53           ` Ville Syrjälä
2017-06-30 13:53             ` Ville Syrjälä
2017-06-30 15:30             ` Daniel Vetter
2017-06-30 15:30               ` Daniel Vetter
2017-06-30 15:39               ` Chris Wilson
2017-06-30 15:39                 ` Chris Wilson
2017-07-03  8:03                 ` Daniel Vetter
2017-07-03  8:03                   ` Daniel Vetter
2017-07-03  8:16                   ` Chris Wilson
2017-07-03 16:42                     ` Daniel Vetter
2017-07-03 16:42                       ` Daniel Vetter
2017-06-30 13:25   ` [PATCH " Daniel Vetter
2017-06-30 13:25     ` Daniel Vetter
2017-06-30 13:30     ` Daniel Vetter
2017-06-30 13:30       ` Daniel Vetter
2017-06-30 15:44       ` Ville Syrjälä
2017-06-30 15:44         ` Ville Syrjälä
2017-06-30 18:23         ` Daniel Vetter
2017-06-30 18:46           ` Ville Syrjälä
2017-06-30 18:46             ` Ville Syrjälä
2017-07-03  7:55             ` Daniel Vetter
2017-07-03  7:55               ` Daniel Vetter
2017-07-03  9:30               ` Ville Syrjälä
2017-07-03  9:30                 ` Ville Syrjälä
2017-07-03 16:48                 ` Daniel Vetter
2017-07-03 16:48                   ` Daniel Vetter
2017-06-29 14:24 ` ✓ Fi.CI.BAT: success for drm/i915: Fix pre-g4x GPU reset, again Patchwork
2017-06-29 15:07   ` Ville Syrjälä
2017-06-29 14:55 ` ✓ Fi.CI.BAT: success for drm/i915: Fix pre-g4x GPU reset, again (rev2) Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.