[PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request
@ 2018-04-27 19:32 Chris Wilson
  2018-04-27 20:12 ` Michel Thierry
                   ` (8 more replies)
  0 siblings, 9 replies; 17+ messages in thread
From: Chris Wilson @ 2018-04-27 19:32 UTC (permalink / raw)
  To: intel-gfx

Previously, we just reset the ring register in the context image such
that we could skip over the broken batch and emit the closing
breadcrumb. However, on resume the context image and GPU state would be
reloaded, which may have been left in an inconsistent state by the
reset. The presumption was that at worst it would just cause another
reset and skip again until it recovered, however it seems just as likely
to cause an unrecoverable hang. Instead of risking loading an incomplete
context image, restore it back to the default state.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ce23d5116482..422b05290ed6 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1804,8 +1804,8 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 			      struct i915_request *request)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
-	struct intel_context *ce;
 	unsigned long flags;
+	u32 *regs;
 
 	GEM_TRACE("%s request global=%x, current=%d\n",
 		  engine->name, request ? request->global_seqno : 0,
@@ -1855,14 +1855,24 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 	 * future request will be after userspace has had the opportunity
 	 * to recreate its own state.
 	 */
-	ce = &request->ctx->engine[engine->id];
-	execlists_init_reg_state(ce->lrc_reg_state,
-				 request->ctx, engine, ce->ring);
+	regs = request->ctx->engine[engine->id].lrc_reg_state;
+	if (engine->default_state) {
+		void *defaults;
+
+		defaults = i915_gem_object_pin_map(engine->default_state,
+						   I915_MAP_WB);
+		if (!IS_ERR(defaults)) {
+			memcpy(regs,
+			       defaults + LRC_HEADER_PAGES * PAGE_SIZE,
+			       engine->context_size);
+			i915_gem_object_unpin_map(engine->default_state);
+		}
+	}
+	execlists_init_reg_state(regs, request->ctx, engine, request->ring);
 
 	/* Move the RING_HEAD onto the breadcrumb, past the hanging batch */
-	ce->lrc_reg_state[CTX_RING_BUFFER_START+1] =
-		i915_ggtt_offset(ce->ring->vma);
-	ce->lrc_reg_state[CTX_RING_HEAD+1] = request->postfix;
+	regs[CTX_RING_BUFFER_START + 1] = i915_ggtt_offset(request->ring->vma);
+	regs[CTX_RING_HEAD + 1] = request->postfix;
 
 	request->ring->head = request->postfix;
 	intel_ring_update_space(request->ring);
-- 
2.17.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request
  2018-04-27 19:32 [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
@ 2018-04-27 20:12 ` Michel Thierry
  2018-04-27 20:20   ` Chris Wilson
  2018-04-27 20:24 ` [PATCH v2] " Chris Wilson
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Michel Thierry @ 2018-04-27 20:12 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 4/27/2018 12:32 PM, Chris Wilson wrote:
> Previously, we just reset the ring register in the context image such
> that we could skip over the broken batch and emit the closing
> breadcrumb. However, on resume the context image and GPU state would be
> reloaded, which may have been left in an inconsistent state by the
> reset. The presumption was that at worst it would just cause another
> reset and skip again until it recovered, however it seems just as likely
> to cause an unrecoverable hang. Instead of risking loading an incomplete
> context image, restore it back to the default state.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/intel_lrc.c | 24 +++++++++++++++++-------
>   1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ce23d5116482..422b05290ed6 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1804,8 +1804,8 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>   			      struct i915_request *request)
>   {
>   	struct intel_engine_execlists * const execlists = &engine->execlists;
> -	struct intel_context *ce;
>   	unsigned long flags;
> +	u32 *regs;
>   
>   	GEM_TRACE("%s request global=%x, current=%d\n",
>   		  engine->name, request ? request->global_seqno : 0,
> @@ -1855,14 +1855,24 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>   	 * future request will be after userspace has had the opportunity
>   	 * to recreate its own state.
>   	 */
> -	ce = &request->ctx->engine[engine->id];
> -	execlists_init_reg_state(ce->lrc_reg_state,
> -				 request->ctx, engine, ce->ring);
> +	regs = request->ctx->engine[engine->id].lrc_reg_state;
> +	if (engine->default_state) {
> +		void *defaults;
> +
> +		defaults = i915_gem_object_pin_map(engine->default_state,
> +						   I915_MAP_WB);
> +		if (!IS_ERR(defaults)) {
> +			memcpy(regs,
> +			       defaults + LRC_HEADER_PAGES * PAGE_SIZE,
> +			       engine->context_size);
Hi,

The context_size is taking into count the PP_HWSP page, do we also need 
to rewrite the PP_HSWP? (or just the logical state).

Also regs is already pointing to the start of the logical state
(vaddr + LRC_STATE_PN * PAGE_SIZE).

So if we want to overwrite from the PP_HWSP, then regs is not the right 
offset, or if we only want to change the logical state then it should be 
from 'defaults +  LRC_STATE_PN * PAGE_SIZE'.

-Michel

> +			i915_gem_object_unpin_map(engine->default_state);
> +		}
> +	}
> +	execlists_init_reg_state(regs, request->ctx, engine, request->ring);
>   
>   	/* Move the RING_HEAD onto the breadcrumb, past the hanging batch */
> -	ce->lrc_reg_state[CTX_RING_BUFFER_START+1] =
> -		i915_ggtt_offset(ce->ring->vma);
> -	ce->lrc_reg_state[CTX_RING_HEAD+1] = request->postfix;
> +	regs[CTX_RING_BUFFER_START + 1] = i915_ggtt_offset(request->ring->vma);
> +	regs[CTX_RING_HEAD + 1] = request->postfix;
>   
>   	request->ring->head = request->postfix;
>   	intel_ring_update_space(request->ring);
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request
  2018-04-27 20:12 ` Michel Thierry
@ 2018-04-27 20:20   ` Chris Wilson
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Wilson @ 2018-04-27 20:20 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx; +Cc: Mika

Quoting Michel Thierry (2018-04-27 21:12:38)
> On 4/27/2018 12:32 PM, Chris Wilson wrote:
> > Previously, we just reset the ring register in the context image such
> > that we could skip over the broken batch and emit the closing
> > breadcrumb. However, on resume the context image and GPU state would be
> > reloaded, which may have been left in an inconsistent state by the
> > reset. The presumption was that at worst it would just cause another
> > reset and skip again until it recovered, however it seems just as likely
> > to cause an unrecoverable hang. Instead of risking loading an incomplete
> > context image, restore it back to the default state.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Cc: Michel Thierry <michel.thierry@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/intel_lrc.c | 24 +++++++++++++++++-------
> >   1 file changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index ce23d5116482..422b05290ed6 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -1804,8 +1804,8 @@ static void reset_common_ring(struct intel_engine_cs *engine,
> >                             struct i915_request *request)
> >   {
> >       struct intel_engine_execlists * const execlists = &engine->execlists;
> > -     struct intel_context *ce;
> >       unsigned long flags;
> > +     u32 *regs;
> >   
> >       GEM_TRACE("%s request global=%x, current=%d\n",
> >                 engine->name, request ? request->global_seqno : 0,
> > @@ -1855,14 +1855,24 @@ static void reset_common_ring(struct intel_engine_cs *engine,
> >        * future request will be after userspace has had the opportunity
> >        * to recreate its own state.
> >        */
> > -     ce = &request->ctx->engine[engine->id];
> > -     execlists_init_reg_state(ce->lrc_reg_state,
> > -                              request->ctx, engine, ce->ring);
> > +     regs = request->ctx->engine[engine->id].lrc_reg_state;
> > +     if (engine->default_state) {
> > +             void *defaults;
> > +
> > +             defaults = i915_gem_object_pin_map(engine->default_state,
> > +                                                I915_MAP_WB);
> > +             if (!IS_ERR(defaults)) {
> > +                     memcpy(regs,
> > +                            defaults + LRC_HEADER_PAGES * PAGE_SIZE,
> > +                            engine->context_size);
> Hi,
> 
> The context_size is taking into count the PP_HWSP page, do we also need 
> to rewrite the PP_HSWP? (or just the logical state).
> 
> Also regs is already pointing to the start of the logical state
> (vaddr + LRC_STATE_PN * PAGE_SIZE).

Yeah, I was aiming for just the register state, and had a nice little
off by one in comparing the macros.
 
> So if we want to overwrite from the PP_HWSP, then regs is not the right 
> offset, or if we only want to change the logical state then it should be 
> from 'defaults +  LRC_STATE_PN * PAGE_SIZE'.

Right, I don't think we need to scrub the HWSP, just the register state.
The context is lost at this point, and what I want to protect is the
read of the image following the reset. Afaik, we don't issue any reads
from PPHWSP.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] drm/i915/lrc: Scrub the GPU state of the guilty hanging request
  2018-04-27 19:32 [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
  2018-04-27 20:12 ` Michel Thierry
@ 2018-04-27 20:24 ` Chris Wilson
  2018-04-27 20:27   ` Michel Thierry
  2018-04-27 20:29 ` [PATCH v3] " Chris Wilson
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Chris Wilson @ 2018-04-27 20:24 UTC (permalink / raw)
  To: intel-gfx

Previously, we just reset the ring register in the context image such
that we could skip over the broken batch and emit the closing
breadcrumb. However, on resume the context image and GPU state would be
reloaded, which may have been left in an inconsistent state by the
reset. The presumption was that at worst it would just cause another
reset and skip again until it recovered, however it seems just as likely
to cause an unrecoverable hang. Instead of risking loading an incomplete
context image, restore it back to the default state.

v2: Fix up off-by-one from including the ppHSWP in with the register
state.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ce23d5116482..01750a4c2f3f 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1804,8 +1804,8 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 			      struct i915_request *request)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
-	struct intel_context *ce;
 	unsigned long flags;
+	u32 *regs;
 
 	GEM_TRACE("%s request global=%x, current=%d\n",
 		  engine->name, request ? request->global_seqno : 0,
@@ -1855,14 +1855,24 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 	 * future request will be after userspace has had the opportunity
 	 * to recreate its own state.
 	 */
-	ce = &request->ctx->engine[engine->id];
-	execlists_init_reg_state(ce->lrc_reg_state,
-				 request->ctx, engine, ce->ring);
+	regs = request->ctx->engine[engine->id].lrc_reg_state;
+	if (engine->default_state) {
+		void *defaults;
+
+		defaults = i915_gem_object_pin_map(engine->default_state,
+						   I915_MAP_WB);
+		if (!IS_ERR(defaults)) {
+			memcpy(regs, /* skip restoring to the vanilla PPHWSP */
+			       defaults + LRC_STATE_PN * PAGE_SIZE,
+			       engine->context_size - PAGE_SIZE);
+			i915_gem_object_unpin_map(engine->default_state);
+		}
+	}
+	execlists_init_reg_state(regs, request->ctx, engine, request->ring);
 
 	/* Move the RING_HEAD onto the breadcrumb, past the hanging batch */
-	ce->lrc_reg_state[CTX_RING_BUFFER_START+1] =
-		i915_ggtt_offset(ce->ring->vma);
-	ce->lrc_reg_state[CTX_RING_HEAD+1] = request->postfix;
+	regs[CTX_RING_BUFFER_START + 1] = i915_ggtt_offset(request->ring->vma);
+	regs[CTX_RING_HEAD + 1] = request->postfix;
 
 	request->ring->head = request->postfix;
 	intel_ring_update_space(request->ring);
-- 
2.17.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] drm/i915/lrc: Scrub the GPU state of the guilty hanging request
  2018-04-27 20:24 ` [PATCH v2] " Chris Wilson
@ 2018-04-27 20:27   ` Michel Thierry
  2018-04-27 20:35     ` Chris Wilson
  0 siblings, 1 reply; 17+ messages in thread
From: Michel Thierry @ 2018-04-27 20:27 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 4/27/2018 1:24 PM, Chris Wilson wrote:
> Previously, we just reset the ring register in the context image such
> that we could skip over the broken batch and emit the closing
> breadcrumb. However, on resume the context image and GPU state would be
> reloaded, which may have been left in an inconsistent state by the
> reset. The presumption was that at worst it would just cause another
> reset and skip again until it recovered, however it seems just as likely
> to cause an unrecoverable hang. Instead of risking loading an incomplete
> context image, restore it back to the default state.
> 
> v2: Fix up off-by-one from including the ppHSWP in with the register
> state.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Reviewed-by: Michel Thierry <michel.thierry@intel.com>

Does it need a 'Fixes:' tag or has a bugzilla reference?
> ---
>   drivers/gpu/drm/i915/intel_lrc.c | 24 +++++++++++++++++-------
>   1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ce23d5116482..01750a4c2f3f 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1804,8 +1804,8 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>   			      struct i915_request *request)
>   {
>   	struct intel_engine_execlists * const execlists = &engine->execlists;
> -	struct intel_context *ce;
>   	unsigned long flags;
> +	u32 *regs;
>   
>   	GEM_TRACE("%s request global=%x, current=%d\n",
>   		  engine->name, request ? request->global_seqno : 0,
> @@ -1855,14 +1855,24 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>   	 * future request will be after userspace has had the opportunity
>   	 * to recreate its own state.
>   	 */
> -	ce = &request->ctx->engine[engine->id];
> -	execlists_init_reg_state(ce->lrc_reg_state,
> -				 request->ctx, engine, ce->ring);
> +	regs = request->ctx->engine[engine->id].lrc_reg_state;
> +	if (engine->default_state) {
> +		void *defaults;
> +
> +		defaults = i915_gem_object_pin_map(engine->default_state,
> +						   I915_MAP_WB);
> +		if (!IS_ERR(defaults)) {
> +			memcpy(regs, /* skip restoring to the vanilla PPHWSP */
> +			       defaults + LRC_STATE_PN * PAGE_SIZE,
> +			       engine->context_size - PAGE_SIZE);
> +			i915_gem_object_unpin_map(engine->default_state);
> +		}
> +	}
> +	execlists_init_reg_state(regs, request->ctx, engine, request->ring);
>   
>   	/* Move the RING_HEAD onto the breadcrumb, past the hanging batch */
> -	ce->lrc_reg_state[CTX_RING_BUFFER_START+1] =
> -		i915_ggtt_offset(ce->ring->vma);
> -	ce->lrc_reg_state[CTX_RING_HEAD+1] = request->postfix;
> +	regs[CTX_RING_BUFFER_START + 1] = i915_ggtt_offset(request->ring->vma);
> +	regs[CTX_RING_HEAD + 1] = request->postfix;
>   
>   	request->ring->head = request->postfix;
>   	intel_ring_update_space(request->ring);
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3] drm/i915/lrc: Scrub the GPU state of the guilty hanging request
  2018-04-27 19:32 [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
  2018-04-27 20:12 ` Michel Thierry
  2018-04-27 20:24 ` [PATCH v2] " Chris Wilson
@ 2018-04-27 20:29 ` Chris Wilson
  2018-04-28  7:56 ` ✗ Fi.CI.BAT: failure for " Patchwork
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Chris Wilson @ 2018-04-27 20:29 UTC (permalink / raw)
  To: intel-gfx

Previously, we just reset the ring register in the context image such
that we could skip over the broken batch and emit the closing
breadcrumb. However, on resume the context image and GPU state would be
reloaded, which may have been left in an inconsistent state by the
reset. The presumption was that at worst it would just cause another
reset and skip again until it recovered, however it seems just as likely
to cause an unrecoverable hang. Instead of risking loading an incomplete
context image, restore it back to the default state.

v2: Fix up off-by-one from including the ppHSWP in with the register
state.
v3: Use a ring local to compact a few lines.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ce23d5116482..bbca79bf19cc 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1804,8 +1804,9 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 			      struct i915_request *request)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
-	struct intel_context *ce;
+	struct intel_ring * const ring = request->ring;
 	unsigned long flags;
+	u32 *regs;
 
 	GEM_TRACE("%s request global=%x, current=%d\n",
 		  engine->name, request ? request->global_seqno : 0,
@@ -1855,17 +1856,27 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 	 * future request will be after userspace has had the opportunity
 	 * to recreate its own state.
 	 */
-	ce = &request->ctx->engine[engine->id];
-	execlists_init_reg_state(ce->lrc_reg_state,
-				 request->ctx, engine, ce->ring);
+	regs = request->ctx->engine[engine->id].lrc_reg_state;
+	if (engine->default_state) {
+		void *defaults;
+
+		defaults = i915_gem_object_pin_map(engine->default_state,
+						   I915_MAP_WB);
+		if (!IS_ERR(defaults)) {
+			memcpy(regs, /* skip restoring to the vanilla PPHWSP */
+			       defaults + LRC_STATE_PN * PAGE_SIZE,
+			       engine->context_size - PAGE_SIZE);
+			i915_gem_object_unpin_map(engine->default_state);
+		}
+	}
+	execlists_init_reg_state(regs, request->ctx, engine, ring);
 
 	/* Move the RING_HEAD onto the breadcrumb, past the hanging batch */
-	ce->lrc_reg_state[CTX_RING_BUFFER_START+1] =
-		i915_ggtt_offset(ce->ring->vma);
-	ce->lrc_reg_state[CTX_RING_HEAD+1] = request->postfix;
+	regs[CTX_RING_BUFFER_START + 1] = i915_ggtt_offset(ring->vma);
+	regs[CTX_RING_HEAD + 1] = request->postfix;
 
-	request->ring->head = request->postfix;
-	intel_ring_update_space(request->ring);
+	ring->head = request->postfix;
+	intel_ring_update_space(ring);
 
 	/* Reset WaIdleLiteRestore:bdw,skl as well */
 	unwind_wa_tail(request);
-- 
2.17.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] drm/i915/lrc: Scrub the GPU state of the guilty hanging request
  2018-04-27 20:27   ` Michel Thierry
@ 2018-04-27 20:35     ` Chris Wilson
  2018-04-27 22:30       ` Michel Thierry
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Wilson @ 2018-04-27 20:35 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx; +Cc: Mika

Quoting Michel Thierry (2018-04-27 21:27:46)
> On 4/27/2018 1:24 PM, Chris Wilson wrote:
> > Previously, we just reset the ring register in the context image such
> > that we could skip over the broken batch and emit the closing
> > breadcrumb. However, on resume the context image and GPU state would be
> > reloaded, which may have been left in an inconsistent state by the
> > reset. The presumption was that at worst it would just cause another
> > reset and skip again until it recovered, however it seems just as likely
> > to cause an unrecoverable hang. Instead of risking loading an incomplete
> > context image, restore it back to the default state.
> > 
> > v2: Fix up off-by-one from including the ppHSWP in with the register
> > state.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Cc: Michel Thierry <michel.thierry@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Reviewed-by: Michel Thierry <michel.thierry@intel.com>
> 
> Does it need a 'Fixes:' tag or has a bugzilla reference?

I suspect it's rare enough that the unrecoverable hang might not be
recognisable in bugzilla. I was just looking at 

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log

trying to think of ways how the reset might appear to work but the
recovery fail with 

<7>[  521.765114] missed_breadcrumb vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915]
<7>[  521.765176] missed_breadcrumb 	current seqno e4e, last e4f, hangcheck e4e [2048 ms], inflight 1
<7>[  521.765191] missed_breadcrumb 	Reset count: 0 (global 0)
<7>[  521.765206] missed_breadcrumb 	Requests:
<7>[  521.765223] missed_breadcrumb 		first  e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[  521.765239] missed_breadcrumb 		last   e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[  521.765256] missed_breadcrumb 		active e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[  521.765274] missed_breadcrumb 		[head 3900, postfix 3930, tail 3948, batch 0x00000000_00042000]
<7>[  521.765289] missed_breadcrumb 		ring->start:  0x008ef000
<7>[  521.765301] missed_breadcrumb 		ring->head:   0x000038f8
<7>[  521.765313] missed_breadcrumb 		ring->tail:   0x00003948
<7>[  521.765325] missed_breadcrumb 		ring->emit:   0x00003950
<7>[  521.765337] missed_breadcrumb 		ring->space:  0x00002618
<7>[  521.765372] missed_breadcrumb 	RING_START: 0x008ef000
<7>[  521.765389] missed_breadcrumb 	RING_HEAD:  0x000038f8
<7>[  521.765404] missed_breadcrumb 	RING_TAIL:  0x00003948
<7>[  521.765422] missed_breadcrumb 	RING_CTL:   0x00003001
<7>[  521.765438] missed_breadcrumb 	RING_MODE:  0x00000000
<7>[  521.765453] missed_breadcrumb 	RING_IMR: fffffefe
<7>[  521.765473] missed_breadcrumb 	ACTHD:  0x00000000_022039b8
<7>[  521.765492] missed_breadcrumb 	BBADDR: 0x00000000_00042004
<7>[  521.765511] missed_breadcrumb 	DMA_FADDR: 0x00000000_008f28f8
<7>[  521.765537] missed_breadcrumb 	IPEIR: 0x00000000
<7>[  521.765552] missed_breadcrumb 	IPEHR: 0x11000011
<7>[  521.765570] missed_breadcrumb 	Execlist status: 0x00044032 00000002
<7>[  521.765586] missed_breadcrumb 	Execlist CSB read 1 [1 cached], write 2 [2 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  521.765604] missed_breadcrumb 	Execlist CSB[2]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp]
<7>[  521.765619] missed_breadcrumb 		ELSP[0] count=1, rq: e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
<7>[  521.765632] missed_breadcrumb 		ELSP[1] idle
<7>[  521.765645] missed_breadcrumb 		HW active? 0x1
<7>[  521.765660] missed_breadcrumb 		E e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
<7>[  521.765670] missed_breadcrumb 		Queue priority: -2147483648
<7>[  521.765684] missed_breadcrumb 	gem_sync [3112] waiting for e4f
<7>[  521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no)
<7>[  521.765707] missed_breadcrumb HWSP:
<7>[  521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765733] missed_breadcrumb *
<7>[  521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 00000002 00000001 00000000 00000018 00000002
<7>[  521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 00000002 00000000 00000000 00000000 00000002
<7>[  521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765784] missed_breadcrumb *
<7>[  521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765833] missed_breadcrumb *
<7>[  521.765845] missed_breadcrumb Idle? no

Of particular note being the IPEHR being MI_LRI, the ring being idle (it
hasn't moved on from the earlier reset) and the fetch address being
unconnected to the rings, so naturally I assume it died loading the
context image on resume.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] drm/i915/lrc: Scrub the GPU state of the guilty hanging request
  2018-04-27 20:35     ` Chris Wilson
@ 2018-04-27 22:30       ` Michel Thierry
  0 siblings, 0 replies; 17+ messages in thread
From: Michel Thierry @ 2018-04-27 22:30 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx
  Cc: Mika Kuoppala , " Michał Winiarski , Tvrtko Ursulin

On 4/27/2018 1:35 PM, Chris Wilson wrote:
> Quoting Michel Thierry (2018-04-27 21:27:46)
>> On 4/27/2018 1:24 PM, Chris Wilson wrote:
>>> Previously, we just reset the ring register in the context image such
>>> that we could skip over the broken batch and emit the closing
>>> breadcrumb. However, on resume the context image and GPU state would be
>>> reloaded, which may have been left in an inconsistent state by the
>>> reset. The presumption was that at worst it would just cause another
>>> reset and skip again until it recovered, however it seems just as likely
>>> to cause an unrecoverable hang. Instead of risking loading an incomplete
>>> context image, restore it back to the default state.
>>>
>>> v2: Fix up off-by-one from including the ppHSWP in with the register
>>> state.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>> Cc: Michał Winiarski <michal.winiarski@intel.com>
>>> Cc: Michel Thierry <michel.thierry@intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Reviewed-by: Michel Thierry <michel.thierry@intel.com>
>>
>> Does it need a 'Fixes:' tag or has a bugzilla reference?
> 
> I suspect it's rare enough that the unrecoverable hang might not be
> recognisable in bugzilla. I was just looking at
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log
> 
> trying to think of ways how the reset might appear to work but the
> recovery fail with
> 
> <7>[  521.765114] missed_breadcrumb vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915]
> <7>[  521.765176] missed_breadcrumb 	current seqno e4e, last e4f, hangcheck e4e [2048 ms], inflight 1
> <7>[  521.765191] missed_breadcrumb 	Reset count: 0 (global 0)
> <7>[  521.765206] missed_breadcrumb 	Requests:
> <7>[  521.765223] missed_breadcrumb 		first  e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
> <7>[  521.765239] missed_breadcrumb 		last   e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
> <7>[  521.765256] missed_breadcrumb 		active e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
> <7>[  521.765274] missed_breadcrumb 		[head 3900, postfix 3930, tail 3948, batch 0x00000000_00042000]
> <7>[  521.765289] missed_breadcrumb 		ring->start:  0x008ef000
> <7>[  521.765301] missed_breadcrumb 		ring->head:   0x000038f8
> <7>[  521.765313] missed_breadcrumb 		ring->tail:   0x00003948
> <7>[  521.765325] missed_breadcrumb 		ring->emit:   0x00003950
> <7>[  521.765337] missed_breadcrumb 		ring->space:  0x00002618
> <7>[  521.765372] missed_breadcrumb 	RING_START: 0x008ef000
> <7>[  521.765389] missed_breadcrumb 	RING_HEAD:  0x000038f8
> <7>[  521.765404] missed_breadcrumb 	RING_TAIL:  0x00003948
> <7>[  521.765422] missed_breadcrumb 	RING_CTL:   0x00003001
> <7>[  521.765438] missed_breadcrumb 	RING_MODE:  0x00000000
> <7>[  521.765453] missed_breadcrumb 	RING_IMR: fffffefe
> <7>[  521.765473] missed_breadcrumb 	ACTHD:  0x00000000_022039b8
> <7>[  521.765492] missed_breadcrumb 	BBADDR: 0x00000000_00042004
> <7>[  521.765511] missed_breadcrumb 	DMA_FADDR: 0x00000000_008f28f8
> <7>[  521.765537] missed_breadcrumb 	IPEIR: 0x00000000
> <7>[  521.765552] missed_breadcrumb 	IPEHR: 0x11000011
> <7>[  521.765570] missed_breadcrumb 	Execlist status: 0x00044032 00000002
> <7>[  521.765586] missed_breadcrumb 	Execlist CSB read 1 [1 cached], write 2 [2 from hws], interrupt posted? no, tasklet queued? no (enabled)
> <7>[  521.765604] missed_breadcrumb 	Execlist CSB[2]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp]
> <7>[  521.765619] missed_breadcrumb 		ELSP[0] count=1, rq: e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
> <7>[  521.765632] missed_breadcrumb 		ELSP[1] idle
> <7>[  521.765645] missed_breadcrumb 		HW active? 0x1
> <7>[  521.765660] missed_breadcrumb 		E e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
> <7>[  521.765670] missed_breadcrumb 		Queue priority: -2147483648
> <7>[  521.765684] missed_breadcrumb 	gem_sync [3112] waiting for e4f
> <7>[  521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no)
> <7>[  521.765707] missed_breadcrumb HWSP:
> <7>[  521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <7>[  521.765733] missed_breadcrumb *
> <7>[  521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 00000002 00000001 00000000 00000018 00000002
> <7>[  521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 00000002 00000000 00000000 00000000 00000002
> <7>[  521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <7>[  521.765784] missed_breadcrumb *
> <7>[  521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <7>[  521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <7>[  521.765833] missed_breadcrumb *
> <7>[  521.765845] missed_breadcrumb Idle? no
> 
> Of particular note being the IPEHR being MI_LRI, the ring being idle (it
> hasn't moved on from the earlier reset) and the fetch address being
> unconnected to the rings, so naturally I assume it died loading the
> context image on resume.
Plus it is a bsw...
Agreed, this looks like an issue during the ctx restore.

> -Chris
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* ✗ Fi.CI.BAT: failure for drm/i915/lrc: Scrub the GPU state of the guilty hanging request
  2018-04-27 19:32 [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
                   ` (2 preceding siblings ...)
  2018-04-27 20:29 ` [PATCH v3] " Chris Wilson
@ 2018-04-28  7:56 ` Patchwork
  2018-04-28  9:17 ` ✓ Fi.CI.BAT: success for drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev3) Patchwork
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Patchwork @ 2018-04-28  7:56 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/lrc: Scrub the GPU state of the guilty hanging request
URL   : https://patchwork.freedesktop.org/series/42425/
State : failure

== Summary ==

= CI Bug Log - changes from CI_DRM_4108 -> Patchwork_8829 =

== Summary - FAILURE ==

  Serious unknown changes coming with Patchwork_8829 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_8829, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/42425/revisions/1/mbox/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_8829:

  === IGT changes ===

    ==== Possible regressions ====

    igt@gem_busy@basic-hang-default:
      fi-cfl-8700k:       PASS -> INCOMPLETE
      fi-kbl-7560u:       PASS -> INCOMPLETE
      fi-bsw-n3050:       PASS -> INCOMPLETE
      fi-cfl-u:           PASS -> INCOMPLETE
      fi-cfl-s3:          PASS -> INCOMPLETE
      fi-bdw-5557u:       PASS -> INCOMPLETE
      fi-kbl-7500u:       PASS -> INCOMPLETE
      fi-kbl-7567u:       PASS -> INCOMPLETE
      fi-kbl-r:           PASS -> INCOMPLETE

    
== Known issues ==

  Here are the changes found in Patchwork_8829 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@gem_busy@basic-hang-default:
      fi-skl-6600u:       PASS -> INCOMPLETE (fdo#104108)
      fi-cnl-y3:          PASS -> INCOMPLETE (fdo#105086)
      fi-skl-6700k2:      PASS -> INCOMPLETE (fdo#104108)
      fi-cnl-psr:         PASS -> INCOMPLETE (fdo#105086)
      fi-skl-6770hq:      PASS -> INCOMPLETE (fdo#104108)
      fi-skl-gvtdvm:      PASS -> INCOMPLETE (fdo#104108, fdo#105600)
      fi-skl-6260u:       PASS -> INCOMPLETE (fdo#104108)
      fi-bxt-j4205:       PASS -> INCOMPLETE (fdo#103927)
      fi-bdw-gvtdvm:      PASS -> INCOMPLETE (fdo#105600)
      fi-glk-j4005:       PASS -> INCOMPLETE (k.org#198133, fdo#103359)
      fi-skl-guc:         PASS -> INCOMPLETE (fdo#104108)
      fi-bxt-dsi:         PASS -> INCOMPLETE (fdo#103927)

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c:
      fi-ivb-3520m:       PASS -> DMESG-WARN (fdo#106084)

    
  fdo#103359 https://bugs.freedesktop.org/show_bug.cgi?id=103359
  fdo#103927 https://bugs.freedesktop.org/show_bug.cgi?id=103927
  fdo#104108 https://bugs.freedesktop.org/show_bug.cgi?id=104108
  fdo#105086 https://bugs.freedesktop.org/show_bug.cgi?id=105086
  fdo#105600 https://bugs.freedesktop.org/show_bug.cgi?id=105600
  fdo#106084 https://bugs.freedesktop.org/show_bug.cgi?id=106084
  k.org#198133 https://bugzilla.kernel.org/show_bug.cgi?id=198133


== Participating hosts (39 -> 36) ==

  Missing    (3): fi-ctg-p8600 fi-ilk-m540 fi-skl-6700hq 


== Build changes ==

    * Linux: CI_DRM_4108 -> Patchwork_8829

  CI_DRM_4108: 6270f64d10649baff02ae464542f185476a6f652 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4450: 0350f0e7f6a0e07281445fc3082aa70419f4aac7 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_8829: 58ad49346ab235d1b1f8b0cfcc7c266ee15aae7d @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4450: b57600ba58ae0cdbad86826fd653aa0191212f27 @ git://anongit.freedesktop.org/piglit


== Linux commits ==

58ad49346ab2 drm/i915/lrc: Scrub the GPU state of the guilty hanging request

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8829/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev3)
  2018-04-27 19:32 [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
                   ` (3 preceding siblings ...)
  2018-04-28  7:56 ` ✗ Fi.CI.BAT: failure for " Patchwork
@ 2018-04-28  9:17 ` Patchwork
  2018-04-28 10:57 ` ✗ Fi.CI.IGT: failure " Patchwork
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Patchwork @ 2018-04-28  9:17 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev3)
URL   : https://patchwork.freedesktop.org/series/42425/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4109 -> Patchwork_8830 =

== Summary - SUCCESS ==

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/42425/revisions/3/mbox/

== Known issues ==

  Here are the changes found in Patchwork_8830 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@gem_exec_suspend@basic-s4-devices:
      fi-kbl-7500u:       PASS -> DMESG-WARN (fdo#105128)

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c:
      fi-ivb-3520m:       PASS -> DMESG-WARN (fdo#106084)

    igt@prime_vgem@basic-fence-flip:
      fi-ilk-650:         PASS -> FAIL (fdo#104008)

    
    ==== Possible fixes ====

    igt@gem_exec_suspend@basic-s3:
      fi-ivb-3520m:       DMESG-WARN (fdo#106084) -> PASS

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-snb-2520m:       INCOMPLETE (fdo#103713) -> PASS

    
  fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
  fdo#104008 https://bugs.freedesktop.org/show_bug.cgi?id=104008
  fdo#105128 https://bugs.freedesktop.org/show_bug.cgi?id=105128
  fdo#106084 https://bugs.freedesktop.org/show_bug.cgi?id=106084


== Participating hosts (38 -> 36) ==

  Additional (1): fi-cnl-y3 
  Missing    (3): fi-ctg-p8600 fi-ilk-m540 fi-skl-6700hq 


== Build changes ==

    * Linux: CI_DRM_4109 -> Patchwork_8830

  CI_DRM_4109: e701a0e6315dc85615f83b2ee14d9cb2f425d97d @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4451: 29ae12bd764e3b1876356e7628a32192b4ec9066 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_8830: 980e46b5e0f962f8bf3a20fbf8dc38d92e733c5c @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4451: b57600ba58ae0cdbad86826fd653aa0191212f27 @ git://anongit.freedesktop.org/piglit


== Linux commits ==

980e46b5e0f9 drm/i915/lrc: Scrub the GPU state of the guilty hanging request

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8830/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* ✗ Fi.CI.IGT: failure for drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev3)
  2018-04-27 19:32 [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
                   ` (4 preceding siblings ...)
  2018-04-28  9:17 ` ✓ Fi.CI.BAT: success for drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev3) Patchwork
@ 2018-04-28 10:57 ` Patchwork
  2018-04-28 11:15 ` [PATCH v4] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Patchwork @ 2018-04-28 10:57 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev3)
URL   : https://patchwork.freedesktop.org/series/42425/
State : failure

== Summary ==

= CI Bug Log - changes from CI_DRM_4109_full -> Patchwork_8830_full =

== Summary - FAILURE ==

  Serious unknown changes coming with Patchwork_8830_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_8830_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/42425/revisions/3/mbox/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_8830_full:

  === IGT changes ===

    ==== Possible regressions ====

    igt@drv_selftest@live_hangcheck:
      shard-kbl:          PASS -> DMESG-FAIL
      shard-apl:          PASS -> DMESG-FAIL
      shard-glk:          PASS -> DMESG-FAIL

    igt@kms_vblank@pipe-b-wait-forked-busy-hang:
      shard-glk:          PASS -> INCOMPLETE +14

    
    ==== Warnings ====

    igt@drv_selftest@live_execlists:
      shard-glk:          PASS -> SKIP +1
      shard-apl:          PASS -> SKIP +1

    igt@drv_selftest@live_guc:
      shard-kbl:          PASS -> SKIP +1

    
== Known issues ==

  Here are the changes found in Patchwork_8830_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@gem_eio@in-flight-internal-1us:
      shard-glk:          PASS -> INCOMPLETE (k.org#198133, fdo#103359) +11

    igt@gem_eio@in-flight-suspend:
      shard-kbl:          PASS -> INCOMPLETE (fdo#103665) +27

    igt@kms_cursor_legacy@flip-vs-cursor-atomic:
      shard-hsw:          PASS -> FAIL (fdo#102670)

    igt@kms_flip@2x-dpms-vs-vblank-race-interruptible:
      shard-hsw:          PASS -> FAIL (fdo#103060)

    igt@kms_vblank@pipe-a-query-forked-busy-hang:
      shard-apl:          PASS -> INCOMPLETE (fdo#103927) +27

    
    ==== Possible fixes ====

    igt@kms_flip@2x-flip-vs-expired-vblank:
      shard-hsw:          FAIL (fdo#102887) -> PASS

    igt@kms_flip@plain-flip-fb-recreate:
      shard-hsw:          FAIL (fdo#100368) -> PASS

    igt@kms_flip@wf_vblank-ts-check-interruptible:
      shard-hsw:          FAIL (fdo#103928) -> PASS

    igt@kms_universal_plane@disable-primary-vs-flip-pipe-c:
      shard-kbl:          DMESG-WARN (fdo#103558, fdo#105602) -> PASS +10

    
  fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
  fdo#102670 https://bugs.freedesktop.org/show_bug.cgi?id=102670
  fdo#102887 https://bugs.freedesktop.org/show_bug.cgi?id=102887
  fdo#103060 https://bugs.freedesktop.org/show_bug.cgi?id=103060
  fdo#103359 https://bugs.freedesktop.org/show_bug.cgi?id=103359
  fdo#103558 https://bugs.freedesktop.org/show_bug.cgi?id=103558
  fdo#103665 https://bugs.freedesktop.org/show_bug.cgi?id=103665
  fdo#103927 https://bugs.freedesktop.org/show_bug.cgi?id=103927
  fdo#103928 https://bugs.freedesktop.org/show_bug.cgi?id=103928
  fdo#105602 https://bugs.freedesktop.org/show_bug.cgi?id=105602
  k.org#198133 https://bugzilla.kernel.org/show_bug.cgi?id=198133


== Participating hosts (9 -> 8) ==

  Missing    (1): shard-glkb 


== Build changes ==

    * Linux: CI_DRM_4109 -> Patchwork_8830

  CI_DRM_4109: e701a0e6315dc85615f83b2ee14d9cb2f425d97d @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4451: 29ae12bd764e3b1876356e7628a32192b4ec9066 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_8830: 980e46b5e0f962f8bf3a20fbf8dc38d92e733c5c @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4451: b57600ba58ae0cdbad86826fd653aa0191212f27 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8830/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4] drm/i915/lrc: Scrub the GPU state of the guilty hanging request
  2018-04-27 19:32 [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
                   ` (5 preceding siblings ...)
  2018-04-28 10:57 ` ✗ Fi.CI.IGT: failure " Patchwork
@ 2018-04-28 11:15 ` Chris Wilson
  2018-04-30 15:49   ` Michel Thierry
  2018-04-30 10:00 ` ✓ Fi.CI.BAT: success for drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev4) Patchwork
  2018-04-30 12:38 ` ✓ Fi.CI.IGT: " Patchwork
  8 siblings, 1 reply; 17+ messages in thread
From: Chris Wilson @ 2018-04-28 11:15 UTC (permalink / raw)
  To: intel-gfx

Previously, we just reset the ring register in the context image such
that we could skip over the broken batch and emit the closing
breadcrumb. However, on resume the context image and GPU state would be
reloaded, which may have been left in an inconsistent state by the
reset. The presumption was that at worst it would just cause another
reset and skip again until it recovered, however it seems just as likely
to cause an unrecoverable hang. Instead of risking loading an incomplete
context image, restore it back to the default state.

v2: Fix up off-by-one from including the ppHSWP in with the register
state.
v3: Use a ring local to compact a few lines.
v4: Beware setting the ring local before checking for a NULL request.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Michel Thierry <michel.thierry@intel.com> #v2
---
 drivers/gpu/drm/i915/intel_lrc.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ce23d5116482..513aee6b3634 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1804,8 +1804,8 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 			      struct i915_request *request)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
-	struct intel_context *ce;
 	unsigned long flags;
+	u32 *regs;
 
 	GEM_TRACE("%s request global=%x, current=%d\n",
 		  engine->name, request ? request->global_seqno : 0,
@@ -1855,14 +1855,24 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 	 * future request will be after userspace has had the opportunity
 	 * to recreate its own state.
 	 */
-	ce = &request->ctx->engine[engine->id];
-	execlists_init_reg_state(ce->lrc_reg_state,
-				 request->ctx, engine, ce->ring);
+	regs = request->ctx->engine[engine->id].lrc_reg_state;
+	if (engine->default_state) {
+		void *defaults;
+
+		defaults = i915_gem_object_pin_map(engine->default_state,
+						   I915_MAP_WB);
+		if (!IS_ERR(defaults)) {
+			memcpy(regs, /* skip restoring the vanilla PPHWSP */
+			       defaults + LRC_STATE_PN * PAGE_SIZE,
+			       engine->context_size - PAGE_SIZE);
+			i915_gem_object_unpin_map(engine->default_state);
+		}
+	}
+	execlists_init_reg_state(regs, request->ctx, engine, request->ring);
 
 	/* Move the RING_HEAD onto the breadcrumb, past the hanging batch */
-	ce->lrc_reg_state[CTX_RING_BUFFER_START+1] =
-		i915_ggtt_offset(ce->ring->vma);
-	ce->lrc_reg_state[CTX_RING_HEAD+1] = request->postfix;
+	regs[CTX_RING_BUFFER_START + 1] = i915_ggtt_offset(request->ring->vma);
+	regs[CTX_RING_HEAD + 1] = request->postfix;
 
 	request->ring->head = request->postfix;
 	intel_ring_update_space(request->ring);
-- 
2.17.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev4)
  2018-04-27 19:32 [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
                   ` (6 preceding siblings ...)
  2018-04-28 11:15 ` [PATCH v4] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
@ 2018-04-30 10:00 ` Patchwork
  2018-04-30 12:38 ` ✓ Fi.CI.IGT: " Patchwork
  8 siblings, 0 replies; 17+ messages in thread
From: Patchwork @ 2018-04-30 10:00 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev4)
URL   : https://patchwork.freedesktop.org/series/42425/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4112 -> Patchwork_8835 =

== Summary - SUCCESS ==

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/42425/revisions/4/mbox/

== Known issues ==

  Here are the changes found in Patchwork_8835 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_chamelium@hdmi-hpd-fast:
      fi-kbl-7500u:       SKIP -> FAIL (fdo#103841, fdo#102672)

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-cnl-y3:          PASS -> DMESG-WARN (fdo#104951)

    
    ==== Possible fixes ====

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-ivb-3520m:       DMESG-WARN (fdo#106084) -> PASS

    
  fdo#102672 https://bugs.freedesktop.org/show_bug.cgi?id=102672
  fdo#103841 https://bugs.freedesktop.org/show_bug.cgi?id=103841
  fdo#104951 https://bugs.freedesktop.org/show_bug.cgi?id=104951
  fdo#106084 https://bugs.freedesktop.org/show_bug.cgi?id=106084


== Participating hosts (39 -> 36) ==

  Missing    (3): fi-ctg-p8600 fi-ilk-m540 fi-skl-6700hq 


== Build changes ==

    * Linux: CI_DRM_4112 -> Patchwork_8835

  CI_DRM_4112: 423a00794c9d9610a71d8a02cd3bc17c6fe5fae1 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4452: 29ae12bd764e3b1876356e7628a32192b4ec9066 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_8835: 69c99005f772c00407dcb31080b92700184275ca @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4452: 04a2952c5b3782eb03cb136bb16d89daaf243f14 @ git://anongit.freedesktop.org/piglit


== Linux commits ==

69c99005f772 drm/i915/lrc: Scrub the GPU state of the guilty hanging request

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8835/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* ✓ Fi.CI.IGT: success for drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev4)
  2018-04-27 19:32 [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
                   ` (7 preceding siblings ...)
  2018-04-30 10:00 ` ✓ Fi.CI.BAT: success for drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev4) Patchwork
@ 2018-04-30 12:38 ` Patchwork
  2018-04-30 13:06   ` Chris Wilson
  8 siblings, 1 reply; 17+ messages in thread
From: Patchwork @ 2018-04-30 12:38 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev4)
URL   : https://patchwork.freedesktop.org/series/42425/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4112_full -> Patchwork_8835_full =

== Summary - WARNING ==

  Minor unknown changes coming with Patchwork_8835_full need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_8835_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/42425/revisions/4/mbox/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_8835_full:

  === IGT changes ===

    ==== Warnings ====

    igt@gem_exec_schedule@deep-bsd1:
      shard-kbl:          PASS -> SKIP

    igt@gem_mocs_settings@mocs-rc6-blt:
      shard-kbl:          SKIP -> PASS

    
== Known issues ==

  Here are the changes found in Patchwork_8835_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_flip@2x-plain-flip-ts-check:
      shard-hsw:          PASS -> FAIL (fdo#100368) +1

    igt@kms_flip@absolute-wf_vblank-interruptible:
      shard-glk:          PASS -> FAIL (fdo#106087)

    igt@kms_flip@blocking-absolute-wf_vblank-interruptible:
      shard-glk:          PASS -> FAIL (fdo#106134)

    igt@kms_flip@dpms-vs-vblank-race-interruptible:
      shard-apl:          PASS -> FAIL (fdo#103060)

    igt@kms_flip@flip-vs-expired-vblank-interruptible:
      shard-hsw:          PASS -> FAIL (fdo#105707)

    igt@kms_flip@plain-flip-ts-check-interruptible:
      shard-glk:          PASS -> FAIL (fdo#100368)

    igt@kms_flip@wf_vblank-ts-check-interruptible:
      shard-apl:          PASS -> FAIL (fdo#103933, fdo#105312)

    
    ==== Possible fixes ====

    igt@kms_flip@2x-flip-vs-absolute-wf_vblank-interruptible:
      shard-hsw:          FAIL (fdo#103928) -> PASS

    igt@kms_flip@flip-vs-expired-vblank-interruptible:
      shard-glk:          FAIL (fdo#102887, fdo#105363) -> PASS

    igt@kms_flip@plain-flip-fb-recreate-interruptible:
      shard-glk:          FAIL (fdo#100368) -> PASS

    
  fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
  fdo#102887 https://bugs.freedesktop.org/show_bug.cgi?id=102887
  fdo#103060 https://bugs.freedesktop.org/show_bug.cgi?id=103060
  fdo#103928 https://bugs.freedesktop.org/show_bug.cgi?id=103928
  fdo#103933 https://bugs.freedesktop.org/show_bug.cgi?id=103933
  fdo#105312 https://bugs.freedesktop.org/show_bug.cgi?id=105312
  fdo#105363 https://bugs.freedesktop.org/show_bug.cgi?id=105363
  fdo#105707 https://bugs.freedesktop.org/show_bug.cgi?id=105707
  fdo#106087 https://bugs.freedesktop.org/show_bug.cgi?id=106087
  fdo#106134 https://bugs.freedesktop.org/show_bug.cgi?id=106134


== Participating hosts (9 -> 8) ==

  Missing    (1): shard-glkb 


== Build changes ==

    * Linux: CI_DRM_4112 -> Patchwork_8835

  CI_DRM_4112: 423a00794c9d9610a71d8a02cd3bc17c6fe5fae1 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4452: 29ae12bd764e3b1876356e7628a32192b4ec9066 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_8835: 69c99005f772c00407dcb31080b92700184275ca @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4452: 04a2952c5b3782eb03cb136bb16d89daaf243f14 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8835/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ✓ Fi.CI.IGT: success for drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev4)
  2018-04-30 12:38 ` ✓ Fi.CI.IGT: " Patchwork
@ 2018-04-30 13:06   ` Chris Wilson
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Wilson @ 2018-04-30 13:06 UTC (permalink / raw)
  To: Patchwork; +Cc: intel-gfx

Quoting Patchwork (2018-04-30 13:38:55)
> == Series Details ==
> 
> Series: drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev4)
> URL   : https://patchwork.freedesktop.org/series/42425/
> State : success
> 
> == Summary ==
> 
> = CI Bug Log - changes from CI_DRM_4112_full -> Patchwork_8835_full =
> 
> == Summary - WARNING ==
> 
>   Minor unknown changes coming with Patchwork_8835_full need to be verified
>   manually.
>   
>   If you think the reported changes have nothing to do with the changes
>   introduced in Patchwork_8835_full, please notify your bug team to allow them
>   to document this new failure mode, which will reduce false positives in CI.
> 
>   External URL: https://patchwork.freedesktop.org/api/1.0/series/42425/revisions/4/mbox/
> 
> == Possible new issues ==
> 
>   Here are the unknown changes that may have been introduced in Patchwork_8835_full:
> 
>   === IGT changes ===
> 
>     ==== Warnings ====
> 
>     igt@gem_exec_schedule@deep-bsd1:
>       shard-kbl:          PASS -> SKIP
> 
>     igt@gem_mocs_settings@mocs-rc6-blt:
>       shard-kbl:          SKIP -> PASS

Sold! Thanks for the review, hopefully this will nip some CI bugs, but
at an error rate of less than 1% it will be some time before we can be
sure.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] drm/i915/lrc: Scrub the GPU state of the guilty hanging request
  2018-04-28 11:15 ` [PATCH v4] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
@ 2018-04-30 15:49   ` Michel Thierry
  2018-04-30 15:53     ` Chris Wilson
  0 siblings, 1 reply; 17+ messages in thread
From: Michel Thierry @ 2018-04-30 15:49 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 04/28/2018 04:15 AM, Chris Wilson wrote:
> Previously, we just reset the ring register in the context image such
> that we could skip over the broken batch and emit the closing
> breadcrumb. However, on resume the context image and GPU state would be
> reloaded, which may have been left in an inconsistent state by the
> reset. The presumption was that at worst it would just cause another
> reset and skip again until it recovered, however it seems just as likely
> to cause an unrecoverable hang. Instead of risking loading an incomplete
> context image, restore it back to the default state.
> 
> v2: Fix up off-by-one from including the ppHSWP in with the register
> state.
> v3: Use a ring local to compact a few lines.
> v4: Beware setting the ring local before checking for a NULL request.

Didn't you want to set the ring local after this check?
	if (!request || request->fence.error != -EIO)

This is identical to v2.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Reviewed-by: Michel Thierry <michel.thierry@intel.com> #v2
> ---
>   drivers/gpu/drm/i915/intel_lrc.c | 24 +++++++++++++++++-------
>   1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ce23d5116482..513aee6b3634 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1804,8 +1804,8 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>   			      struct i915_request *request)
>   {
>   	struct intel_engine_execlists * const execlists = &engine->execlists;
> -	struct intel_context *ce;
>   	unsigned long flags;
> +	u32 *regs;
>   
>   	GEM_TRACE("%s request global=%x, current=%d\n",
>   		  engine->name, request ? request->global_seqno : 0,
> @@ -1855,14 +1855,24 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>   	 * future request will be after userspace has had the opportunity
>   	 * to recreate its own state.
>   	 */
> -	ce = &request->ctx->engine[engine->id];
> -	execlists_init_reg_state(ce->lrc_reg_state,
> -				 request->ctx, engine, ce->ring);
> +	regs = request->ctx->engine[engine->id].lrc_reg_state;
> +	if (engine->default_state) {
> +		void *defaults;
> +
> +		defaults = i915_gem_object_pin_map(engine->default_state,
> +						   I915_MAP_WB);
> +		if (!IS_ERR(defaults)) {
> +			memcpy(regs, /* skip restoring the vanilla PPHWSP */
> +			       defaults + LRC_STATE_PN * PAGE_SIZE,
> +			       engine->context_size - PAGE_SIZE);
> +			i915_gem_object_unpin_map(engine->default_state);
> +		}
> +	}
> +	execlists_init_reg_state(regs, request->ctx, engine, request->ring);
>   
>   	/* Move the RING_HEAD onto the breadcrumb, past the hanging batch */
> -	ce->lrc_reg_state[CTX_RING_BUFFER_START+1] =
> -		i915_ggtt_offset(ce->ring->vma);
> -	ce->lrc_reg_state[CTX_RING_HEAD+1] = request->postfix;
> +	regs[CTX_RING_BUFFER_START + 1] = i915_ggtt_offset(request->ring->vma);
> +	regs[CTX_RING_HEAD + 1] = request->postfix;
>   
>   	request->ring->head = request->postfix;
>   	intel_ring_update_space(request->ring);
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] drm/i915/lrc: Scrub the GPU state of the guilty hanging request
  2018-04-30 15:49   ` Michel Thierry
@ 2018-04-30 15:53     ` Chris Wilson
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Wilson @ 2018-04-30 15:53 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx; +Cc: Mika

Quoting Michel Thierry (2018-04-30 16:49:53)
> On 04/28/2018 04:15 AM, Chris Wilson wrote:
> > Previously, we just reset the ring register in the context image such
> > that we could skip over the broken batch and emit the closing
> > breadcrumb. However, on resume the context image and GPU state would be
> > reloaded, which may have been left in an inconsistent state by the
> > reset. The presumption was that at worst it would just cause another
> > reset and skip again until it recovered, however it seems just as likely
> > to cause an unrecoverable hang. Instead of risking loading an incomplete
> > context image, restore it back to the default state.
> > 
> > v2: Fix up off-by-one from including the ppHSWP in with the register
> > state.
> > v3: Use a ring local to compact a few lines.
> > v4: Beware setting the ring local before checking for a NULL request.
> 
> Didn't you want to set the ring local after this check?
>         if (!request || request->fence.error != -EIO)

I just removed adding the ring local. Fewer changes...
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-04-30 15:53 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-27 19:32 [PATCH] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
2018-04-27 20:12 ` Michel Thierry
2018-04-27 20:20   ` Chris Wilson
2018-04-27 20:24 ` [PATCH v2] " Chris Wilson
2018-04-27 20:27   ` Michel Thierry
2018-04-27 20:35     ` Chris Wilson
2018-04-27 22:30       ` Michel Thierry
2018-04-27 20:29 ` [PATCH v3] " Chris Wilson
2018-04-28  7:56 ` ✗ Fi.CI.BAT: failure for " Patchwork
2018-04-28  9:17 ` ✓ Fi.CI.BAT: success for drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev3) Patchwork
2018-04-28 10:57 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-04-28 11:15 ` [PATCH v4] drm/i915/lrc: Scrub the GPU state of the guilty hanging request Chris Wilson
2018-04-30 15:49   ` Michel Thierry
2018-04-30 15:53     ` Chris Wilson
2018-04-30 10:00 ` ✓ Fi.CI.BAT: success for drm/i915/lrc: Scrub the GPU state of the guilty hanging request (rev4) Patchwork
2018-04-30 12:38 ` ✓ Fi.CI.IGT: " Patchwork
2018-04-30 13:06   ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.