[PATCH 1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno()
@ 2017-04-13 13:07 Chris Wilson
  2017-04-13 13:07 ` [PATCH 2/2] drm/i915: Reset hangcheck timeouts upon idling Chris Wilson
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Chris Wilson @ 2017-04-13 13:07 UTC (permalink / raw)
  To: intel-gfx

The hangcheck runs independently to the main flow of seqno through the
driver. However, we have an odd coupling of the seqno reset that is
unwelcome, and if poked at just the right rate can cause spurious hangs
(e.g. gem_exec_whisper) on an apparently idle engine.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_engine_cs.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 7681d17ce454..f3cb7e931317 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -265,6 +265,7 @@ void intel_engine_init_global_seqno(struct intel_engine_cs *engine, u32 seqno)
 	struct drm_i915_private *dev_priv = engine->i915;
 
 	GEM_BUG_ON(!intel_engine_is_idle(engine));
+	GEM_BUG_ON(i915_gem_active_isset(&engine->timeline->last_request));
 
 	/* Our semaphore implementation is strictly monotonic (i.e. we proceed
 	 * so long as the semaphore value in the register/page is greater
@@ -284,9 +285,6 @@ void intel_engine_init_global_seqno(struct intel_engine_cs *engine, u32 seqno)
 	intel_write_status_page(engine, I915_GEM_HWS_INDEX, seqno);
 	clear_bit(ENGINE_IRQ_BREADCRUMB, &engine->irq_posted);
 
-	GEM_BUG_ON(i915_gem_active_isset(&engine->timeline->last_request));
-	engine->hangcheck.seqno = seqno;
-
 	/* After manually advancing the seqno, fake the interrupt in case
 	 * there are any waiters for that seqno.
 	 */
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] drm/i915: Reset hangcheck timeouts upon idling
  2017-04-13 13:07 [PATCH 1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno() Chris Wilson
@ 2017-04-13 13:07 ` Chris Wilson
  2017-04-19 14:09   ` Mika Kuoppala
  2017-04-13 13:37 ` ✗ Fi.CI.BAT: failure for series starting with [1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno() Patchwork
  2017-04-21  8:12 ` [PATCH 1/2] " Mika Kuoppala
  2 siblings, 1 reply; 7+ messages in thread
From: Chris Wilson @ 2017-04-13 13:07 UTC (permalink / raw)
  To: intel-gfx

If we have a long period of idleness, we turn off the hangcheck timer
and stop polling the hardware. Before we restart the hangcheck, we
should clear the previous timestamps to prevent us thinking that the
engine was stalled for a long time, if the seqno were manipulated
carefully (such as the repeating patterns in gem_exec_whisper).

It should have no impact upon normal use.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_hangcheck.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index b0ca0c4c70d9..a74decca5109 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -409,13 +409,13 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	int busy_count = 0;
 
 	if (!i915.enable_hangcheck)
-		return;
+		goto disarm_hangcheck;
 
 	if (!READ_ONCE(dev_priv->gt.awake))
-		return;
+		goto disarm_hangcheck;
 
 	if (i915_terminally_wedged(&dev_priv->gpu_error))
-		return;
+		goto disarm_hangcheck;
 
 	/* As enabling the GPU requires fairly extensive mmio access,
 	 * periodically arm the mmio checker to see if we are triggering
@@ -446,8 +446,14 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 		hangcheck_declare_hang(dev_priv, hung, stuck);
 
 	/* Reset timer in case GPU hangs without another request being added */
-	if (busy_count)
+	if (busy_count) {
 		i915_queue_hangcheck(dev_priv);
+		return;
+	}
+
+disarm_hangcheck:
+	for_each_engine(engine, dev_priv, id)
+		intel_engine_init_hangcheck(engine);
 }
 
 void intel_engine_init_hangcheck(struct intel_engine_cs *engine)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* ✗ Fi.CI.BAT: failure for series starting with [1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno()
  2017-04-13 13:07 [PATCH 1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno() Chris Wilson
  2017-04-13 13:07 ` [PATCH 2/2] drm/i915: Reset hangcheck timeouts upon idling Chris Wilson
@ 2017-04-13 13:37 ` Patchwork
  2017-04-21  8:12 ` [PATCH 1/2] " Mika Kuoppala
  2 siblings, 0 replies; 7+ messages in thread
From: Patchwork @ 2017-04-13 13:37 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno()
URL   : https://patchwork.freedesktop.org/series/23003/
State : failure

== Summary ==

Series 23003v1 Series without cover letter
https://patchwork.freedesktop.org/api/1.0/series/23003/revisions/1/mbox/

Test gem_busy:
        Subgroup basic-hang-default:
                pass       -> FAIL       (fi-hsw-4770r) fdo#100561
                pass       -> FAIL       (fi-bsw-n3050)
                pass       -> FAIL       (fi-byt-n2820)
                pass       -> FAIL       (fi-bxt-j4205)
                pass       -> FAIL       (fi-kbl-7500u)
                pass       -> FAIL       (fi-bdw-5557u)
                pass       -> FAIL       (fi-skl-6260u)
                pass       -> FAIL       (fi-skl-6700hq)
                pass       -> FAIL       (fi-kbl-7560u)
                pass       -> FAIL       (fi-snb-2600)
                pass       -> FAIL       (fi-byt-j1900)
                pass       -> FAIL       (fi-skl-6700k)
                pass       -> FAIL       (fi-snb-2520m)
                pass       -> FAIL       (fi-bdw-gvtdvm)
                pass       -> FAIL       (fi-skl-gvtdvm)
                pass       -> FAIL       (fi-ilk-650)
                pass       -> FAIL       (fi-ivb-3520m)
                pass       -> FAIL       (fi-bxt-t5700)
                pass       -> FAIL       (fi-skl-6770hq)
                pass       -> FAIL       (fi-hsw-4770)
                pass       -> FAIL       (fi-ivb-3770)

fdo#100561 https://bugs.freedesktop.org/show_bug.cgi?id=100561

fi-bdw-5557u     total:278  pass:266  dwarn:0   dfail:0   fail:1   skip:11  time:546s
fi-bdw-gvtdvm    total:278  pass:255  dwarn:8   dfail:0   fail:1   skip:14  time:545s
fi-bsw-n3050     total:278  pass:241  dwarn:0   dfail:0   fail:1   skip:36  time:695s
fi-bxt-j4205     total:278  pass:258  dwarn:0   dfail:0   fail:1   skip:19  time:624s
fi-bxt-t5700     total:278  pass:257  dwarn:0   dfail:0   fail:1   skip:20  time:679s
fi-byt-j1900     total:278  pass:249  dwarn:0   dfail:0   fail:5   skip:24  time:602s
fi-byt-n2820     total:278  pass:245  dwarn:0   dfail:0   fail:5   skip:28  time:594s
fi-hsw-4770      total:278  pass:257  dwarn:0   dfail:0   fail:5   skip:16  time:528s
fi-hsw-4770r     total:278  pass:257  dwarn:0   dfail:0   fail:5   skip:16  time:524s
fi-ilk-650       total:278  pass:223  dwarn:0   dfail:0   fail:5   skip:50  time:539s
fi-ivb-3520m     total:278  pass:253  dwarn:0   dfail:0   fail:7   skip:18  time:605s
fi-ivb-3770      total:278  pass:255  dwarn:0   dfail:0   fail:5   skip:18  time:583s
fi-kbl-7500u     total:278  pass:259  dwarn:0   dfail:0   fail:1   skip:18  time:573s
fi-kbl-7560u     total:278  pass:266  dwarn:1   dfail:0   fail:1   skip:10  time:682s
fi-skl-6260u     total:278  pass:267  dwarn:0   dfail:0   fail:1   skip:10  time:574s
fi-skl-6700hq    total:278  pass:260  dwarn:0   dfail:0   fail:1   skip:17  time:688s
fi-skl-6700k     total:278  pass:255  dwarn:4   dfail:0   fail:1   skip:18  time:577s
fi-skl-6770hq    total:278  pass:267  dwarn:0   dfail:0   fail:1   skip:10  time:618s
fi-skl-gvtdvm    total:278  pass:264  dwarn:0   dfail:0   fail:1   skip:13  time:549s
fi-snb-2520m     total:278  pass:245  dwarn:0   dfail:0   fail:5   skip:28  time:651s
fi-snb-2600      total:278  pass:244  dwarn:0   dfail:0   fail:5   skip:29  time:524s

f1faf571d9530365a34fe33a3efa3fb224661692 drm-tip: 2017y-04m-13d-12h-34m-48s UTC integration manifest
23aee1e drm/i915: Reset hangcheck timeouts upon idling
5fc653d drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno()

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4502/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] drm/i915: Reset hangcheck timeouts upon idling
  2017-04-13 13:07 ` [PATCH 2/2] drm/i915: Reset hangcheck timeouts upon idling Chris Wilson
@ 2017-04-19 14:09   ` Mika Kuoppala
  2017-04-19 16:25     ` Chris Wilson
  0 siblings, 1 reply; 7+ messages in thread
From: Mika Kuoppala @ 2017-04-19 14:09 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> If we have a long period of idleness, we turn off the hangcheck timer
> and stop polling the hardware. Before we restart the hangcheck, we
> should clear the previous timestamps to prevent us thinking that the
> engine was stalled for a long time, if the seqno were manipulated
> carefully (such as the repeating patterns in gem_exec_whisper).
>
> It should have no impact upon normal use.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/intel_hangcheck.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index b0ca0c4c70d9..a74decca5109 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -409,13 +409,13 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  	int busy_count = 0;
>  
>  	if (!i915.enable_hangcheck)
> -		return;
> +		goto disarm_hangcheck;
>  
>  	if (!READ_ONCE(dev_priv->gt.awake))
> -		return;
> +		goto disarm_hangcheck;
>  
>  	if (i915_terminally_wedged(&dev_priv->gpu_error))
> -		return;
> +		goto disarm_hangcheck;
>  
>  	/* As enabling the GPU requires fairly extensive mmio access,
>  	 * periodically arm the mmio checker to see if we are triggering
> @@ -446,8 +446,14 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  		hangcheck_declare_hang(dev_priv, hung, stuck);
>  
>  	/* Reset timer in case GPU hangs without another request being added */
> -	if (busy_count)
> +	if (busy_count) {
>  		i915_queue_hangcheck(dev_priv);

Now if we don't have a waiter, we always init hangcheck. And thus
we never detect a hang if so. As demonstrated by the
gem_busy@basic-default-hang.

I suggest we decouple the waiters completely from hangcheck:

-               const bool busy = intel_engine_has_waiter(engine);
+               const bool busy = engine->timeline->inflight_seqnos;

-Mika

> +		return;
> +	}
> +
> +disarm_hangcheck:
> +	for_each_engine(engine, dev_priv, id)
> +		intel_engine_init_hangcheck(engine);
>  }
>  
>  void intel_engine_init_hangcheck(struct intel_engine_cs *engine)
> -- 
> 2.11.0
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] drm/i915: Reset hangcheck timeouts upon idling
  2017-04-19 14:09   ` Mika Kuoppala
@ 2017-04-19 16:25     ` Chris Wilson
  0 siblings, 0 replies; 7+ messages in thread
From: Chris Wilson @ 2017-04-19 16:25 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Wed, Apr 19, 2017 at 05:09:46PM +0300, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > If we have a long period of idleness, we turn off the hangcheck timer
> > and stop polling the hardware. Before we restart the hangcheck, we
> > should clear the previous timestamps to prevent us thinking that the
> > engine was stalled for a long time, if the seqno were manipulated
> > carefully (such as the repeating patterns in gem_exec_whisper).
> >
> > It should have no impact upon normal use.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_hangcheck.c | 14 ++++++++++----
> >  1 file changed, 10 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> > index b0ca0c4c70d9..a74decca5109 100644
> > --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> > @@ -409,13 +409,13 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
> >  	int busy_count = 0;
> >  
> >  	if (!i915.enable_hangcheck)
> > -		return;
> > +		goto disarm_hangcheck;
> >  
> >  	if (!READ_ONCE(dev_priv->gt.awake))
> > -		return;
> > +		goto disarm_hangcheck;
> >  
> >  	if (i915_terminally_wedged(&dev_priv->gpu_error))
> > -		return;
> > +		goto disarm_hangcheck;
> >  
> >  	/* As enabling the GPU requires fairly extensive mmio access,
> >  	 * periodically arm the mmio checker to see if we are triggering
> > @@ -446,8 +446,14 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
> >  		hangcheck_declare_hang(dev_priv, hung, stuck);
> >  
> >  	/* Reset timer in case GPU hangs without another request being added */
> > -	if (busy_count)
> > +	if (busy_count) {
> >  		i915_queue_hangcheck(dev_priv);
> 
> Now if we don't have a waiter, we always init hangcheck. And thus
> we never detect a hang if so. As demonstrated by the
> gem_busy@basic-default-hang.
> 
> I suggest we decouple the waiters completely from hangcheck:
> 
> -               const bool busy = intel_engine_has_waiter(engine);
> +               const bool busy = engine->timeline->inflight_seqnos;

inflight seqnos isn't a good choice either, as that doesn't mean the
engine is active yet. The only issue with this patch was resetting the
hangcheck.seqno nerfed the waiterless hangcheck. Turned out to be a very
bad idea.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno()
  2017-04-13 13:07 [PATCH 1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno() Chris Wilson
  2017-04-13 13:07 ` [PATCH 2/2] drm/i915: Reset hangcheck timeouts upon idling Chris Wilson
  2017-04-13 13:37 ` ✗ Fi.CI.BAT: failure for series starting with [1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno() Patchwork
@ 2017-04-21  8:12 ` Mika Kuoppala
  2017-04-21  8:36   ` Chris Wilson
  2 siblings, 1 reply; 7+ messages in thread
From: Mika Kuoppala @ 2017-04-21  8:12 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> The hangcheck runs independently to the main flow of seqno through the
> driver. However, we have an odd coupling of the seqno reset that is
> unwelcome, and if poked at just the right rate can cause spurious hangs
> (e.g. gem_exec_whisper) on an apparently idle engine.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

> ---
>  drivers/gpu/drm/i915/intel_engine_cs.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index 7681d17ce454..f3cb7e931317 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -265,6 +265,7 @@ void intel_engine_init_global_seqno(struct intel_engine_cs *engine, u32 seqno)
>  	struct drm_i915_private *dev_priv = engine->i915;
>  
>  	GEM_BUG_ON(!intel_engine_is_idle(engine));
> +	GEM_BUG_ON(i915_gem_active_isset(&engine->timeline->last_request));
>  
>  	/* Our semaphore implementation is strictly monotonic (i.e. we proceed
>  	 * so long as the semaphore value in the register/page is greater
> @@ -284,9 +285,6 @@ void intel_engine_init_global_seqno(struct intel_engine_cs *engine, u32 seqno)
>  	intel_write_status_page(engine, I915_GEM_HWS_INDEX, seqno);
>  	clear_bit(ENGINE_IRQ_BREADCRUMB, &engine->irq_posted);
>  
> -	GEM_BUG_ON(i915_gem_active_isset(&engine->timeline->last_request));
> -	engine->hangcheck.seqno = seqno;
> -
>  	/* After manually advancing the seqno, fake the interrupt in case
>  	 * there are any waiters for that seqno.
>  	 */
> -- 
> 2.11.0
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno()
  2017-04-21  8:12 ` [PATCH 1/2] " Mika Kuoppala
@ 2017-04-21  8:36   ` Chris Wilson
  0 siblings, 0 replies; 7+ messages in thread
From: Chris Wilson @ 2017-04-21  8:36 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Fri, Apr 21, 2017 at 11:12:21AM +0300, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > The hangcheck runs independently to the main flow of seqno through the
> > driver. However, we have an odd coupling of the seqno reset that is
> > unwelcome, and if poked at just the right rate can cause spurious hangs
> > (e.g. gem_exec_whisper) on an apparently idle engine.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

Ta, I'll send this by itself to CI to confirm that it is safe :)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-04-21  8:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-13 13:07 [PATCH 1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno() Chris Wilson
2017-04-13 13:07 ` [PATCH 2/2] drm/i915: Reset hangcheck timeouts upon idling Chris Wilson
2017-04-19 14:09   ` Mika Kuoppala
2017-04-19 16:25     ` Chris Wilson
2017-04-13 13:37 ` ✗ Fi.CI.BAT: failure for series starting with [1/2] drm/i915: Stop touching hangcheck.seqno from intel_engine_init_global_seqno() Patchwork
2017-04-21  8:12 ` [PATCH 1/2] " Mika Kuoppala
2017-04-21  8:36   ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.