All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Always run hangcheck while the GPU is busy
@ 2018-01-29 14:41 Chris Wilson
  2018-01-29 15:05 ` ✓ Fi.CI.BAT: success for " Patchwork
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Chris Wilson @ 2018-01-29 14:41 UTC (permalink / raw)
  To: intel-gfx

Previously, we relied on only running the hangcheck while somebody was
waiting on the GPU, in order to minimise the amount of time hangcheck
had to run. (If nobody was watching the GPU, nobody would notice if the
GPU wasn't responding -- eventually somebody would care and so kick
hangcheck into action.) However, this falls apart from around commit
4680816be336 ("drm/i915: Wait first for submission, before waiting for
request completion"), as not all waiters declare themselves to hangcheck
and so we could switch off hangcheck and miss GPU hangs even when
waiting under the struct_mutex.

If we enable hangcheck from the first request submission, and let it run
until the GPU is idle again, we forgo all the complexity involved with
only enabling around waiters. Instead we have to be careful that we do
not declare a GPU hang when idly waiting for the next request to be come
ready.

Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion"
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c          |  7 +++----
 drivers/gpu/drm/i915/i915_gem_request.c  |  2 ++
 drivers/gpu/drm/i915/intel_breadcrumbs.c | 11 -----------
 drivers/gpu/drm/i915/intel_hangcheck.c   |  7 +------
 4 files changed, 6 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 062b21408698..63308ec016a3 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3322,16 +3322,15 @@ i915_gem_retire_work_handler(struct work_struct *work)
 		mutex_unlock(&dev->struct_mutex);
 	}
 
-	/* Keep the retire handler running until we are finally idle.
+	/*
+	 * Keep the retire handler running until we are finally idle.
 	 * We do not need to do this test under locking as in the worst-case
 	 * we queue the retire worker once too often.
 	 */
-	if (READ_ONCE(dev_priv->gt.awake)) {
-		i915_queue_hangcheck(dev_priv);
+	if (READ_ONCE(dev_priv->gt.awake))
 		queue_delayed_work(dev_priv->wq,
 				   &dev_priv->gt.retire_work,
 				   round_jiffies_up_relative(HZ));
-	}
 }
 
 static void shrink_caches(struct drm_i915_private *i915)
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 9b02310307fc..6cacd78cc849 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -393,6 +393,8 @@ static void mark_busy(struct drm_i915_private *i915)
 
 	intel_engines_unpark(i915);
 
+	i915_queue_hangcheck(i915);
+
 	queue_delayed_work(i915->wq,
 			   &i915->gt.retire_work,
 			   round_jiffies_up_relative(HZ));
diff --git a/drivers/gpu/drm/i915/intel_breadcrumbs.c b/drivers/gpu/drm/i915/intel_breadcrumbs.c
index 86acac010bb8..62b2a20bc24e 100644
--- a/drivers/gpu/drm/i915/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/intel_breadcrumbs.c
@@ -149,17 +149,6 @@ static void intel_breadcrumbs_fake_irq(struct timer_list *t)
 		return;
 
 	mod_timer(&b->fake_irq, jiffies + 1);
-
-	/* Ensure that even if the GPU hangs, we get woken up.
-	 *
-	 * However, note that if no one is waiting, we never notice
-	 * a gpu hang. Eventually, we will have to wait for a resource
-	 * held by the GPU and so trigger a hangcheck. In the most
-	 * pathological case, this will be upon memory starvation! To
-	 * prevent this, we also queue the hangcheck from the retire
-	 * worker.
-	 */
-	i915_queue_hangcheck(engine->i915);
 }
 
 static void irq_enable(struct intel_engine_cs *engine)
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 31f01d64c021..348a4f7ffb67 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -411,7 +411,6 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 	unsigned int hung = 0, stuck = 0;
-	int busy_count = 0;
 
 	if (!i915_modparams.enable_hangcheck)
 		return;
@@ -429,7 +428,6 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	intel_uncore_arm_unclaimed_mmio_detection(dev_priv);
 
 	for_each_engine(engine, dev_priv, id) {
-		const bool busy = intel_engine_has_waiter(engine);
 		struct intel_engine_hangcheck hc;
 
 		semaphore_clear_deadlocks(dev_priv);
@@ -443,16 +441,13 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			if (hc.action != ENGINE_DEAD)
 				stuck |= intel_engine_flag(engine);
 		}
-
-		busy_count += busy;
 	}
 
 	if (hung)
 		hangcheck_declare_hang(dev_priv, hung, stuck);
 
 	/* Reset timer in case GPU hangs without another request being added */
-	if (busy_count)
-		i915_queue_hangcheck(dev_priv);
+	i915_queue_hangcheck(dev_priv);
 }
 
 void intel_engine_init_hangcheck(struct intel_engine_cs *engine)
-- 
2.15.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Always run hangcheck while the GPU is busy
  2018-01-29 14:41 [PATCH] drm/i915: Always run hangcheck while the GPU is busy Chris Wilson
@ 2018-01-29 15:05 ` Patchwork
  2018-01-29 23:41 ` [PATCH] " Antonio Argenziano
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Patchwork @ 2018-01-29 15:05 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Always run hangcheck while the GPU is busy
URL   : https://patchwork.freedesktop.org/series/37281/
State : success

== Summary ==

Series 37281v1 drm/i915: Always run hangcheck while the GPU is busy
https://patchwork.freedesktop.org/api/1.0/series/37281/revisions/1/mbox/

Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-b:
                incomplete -> PASS       (fi-snb-2520m) fdo#103713

fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713

fi-bdw-5557u     total:288  pass:267  dwarn:0   dfail:0   fail:0   skip:21  time:422s
fi-bdw-gvtdvm    total:288  pass:264  dwarn:0   dfail:0   fail:0   skip:24  time:427s
fi-blb-e6850     total:288  pass:223  dwarn:1   dfail:0   fail:0   skip:64  time:372s
fi-bsw-n3050     total:288  pass:242  dwarn:0   dfail:0   fail:0   skip:46  time:489s
fi-bwr-2160      total:288  pass:183  dwarn:0   dfail:0   fail:0   skip:105 time:283s
fi-bxt-dsi       total:288  pass:258  dwarn:0   dfail:0   fail:0   skip:30  time:486s
fi-bxt-j4205     total:288  pass:259  dwarn:0   dfail:0   fail:0   skip:29  time:486s
fi-byt-j1900     total:288  pass:253  dwarn:0   dfail:0   fail:0   skip:35  time:475s
fi-byt-n2820     total:288  pass:249  dwarn:0   dfail:0   fail:0   skip:39  time:456s
fi-cfl-s2        total:288  pass:262  dwarn:0   dfail:0   fail:0   skip:26  time:565s
fi-elk-e7500     total:224  pass:168  dwarn:10  dfail:0   fail:0   skip:45 
fi-glk-1         total:288  pass:260  dwarn:0   dfail:0   fail:0   skip:28  time:517s
fi-hsw-4770      total:288  pass:261  dwarn:0   dfail:0   fail:0   skip:27  time:394s
fi-hsw-4770r     total:288  pass:261  dwarn:0   dfail:0   fail:0   skip:27  time:398s
fi-ilk-650       total:288  pass:228  dwarn:0   dfail:0   fail:0   skip:60  time:412s
fi-ivb-3520m     total:288  pass:259  dwarn:0   dfail:0   fail:0   skip:29  time:459s
fi-ivb-3770      total:288  pass:255  dwarn:0   dfail:0   fail:0   skip:33  time:412s
fi-kbl-7500u     total:288  pass:263  dwarn:1   dfail:0   fail:0   skip:24  time:458s
fi-kbl-7560u     total:288  pass:269  dwarn:0   dfail:0   fail:0   skip:19  time:502s
fi-kbl-7567u     total:288  pass:268  dwarn:0   dfail:0   fail:0   skip:20  time:455s
fi-kbl-r         total:288  pass:261  dwarn:0   dfail:0   fail:0   skip:27  time:501s
fi-pnv-d510      total:288  pass:222  dwarn:1   dfail:0   fail:0   skip:65  time:582s
fi-skl-6260u     total:288  pass:268  dwarn:0   dfail:0   fail:0   skip:20  time:436s
fi-skl-6600u     total:288  pass:261  dwarn:0   dfail:0   fail:0   skip:27  time:510s
fi-skl-6700hq    total:288  pass:262  dwarn:0   dfail:0   fail:0   skip:26  time:529s
fi-skl-6700k2    total:288  pass:264  dwarn:0   dfail:0   fail:0   skip:24  time:488s
fi-skl-6770hq    total:288  pass:268  dwarn:0   dfail:0   fail:0   skip:20  time:484s
fi-skl-guc       total:288  pass:260  dwarn:0   dfail:0   fail:0   skip:28  time:415s
fi-skl-gvtdvm    total:288  pass:265  dwarn:0   dfail:0   fail:0   skip:23  time:433s
fi-snb-2520m     total:288  pass:248  dwarn:0   dfail:0   fail:0   skip:40  time:520s
fi-snb-2600      total:288  pass:248  dwarn:0   dfail:0   fail:0   skip:40  time:398s
Blacklisted hosts:
fi-cnl-y2        total:288  pass:261  dwarn:0   dfail:0   fail:0   skip:27  time:527s
fi-glk-dsi       total:288  pass:258  dwarn:0   dfail:0   fail:0   skip:30  time:469s
fi-gdg-551 failed to collect. IGT log at Patchwork_7808/fi-gdg-551/igt.log

5f9f3637ad4677971149ec7cde9ed2081a90839b drm-tip: 2018y-01m-29d-13h-02m-54s UTC integration manifest
589aaf248006 drm/i915: Always run hangcheck while the GPU is busy

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7808/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Always run hangcheck while the GPU is busy
  2018-01-29 14:41 [PATCH] drm/i915: Always run hangcheck while the GPU is busy Chris Wilson
  2018-01-29 15:05 ` ✓ Fi.CI.BAT: success for " Patchwork
@ 2018-01-29 23:41 ` Antonio Argenziano
  2018-01-30 12:18 ` Mika Kuoppala
  2018-01-31  9:41 ` Mika Kuoppala
  3 siblings, 0 replies; 9+ messages in thread
From: Antonio Argenziano @ 2018-01-29 23:41 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 29/01/18 06:41, Chris Wilson wrote:
> Previously, we relied on only running the hangcheck while somebody was
> waiting on the GPU, in order to minimise the amount of time hangcheck
> had to run. (If nobody was watching the GPU, nobody would notice if the
> GPU wasn't responding -- eventually somebody would care and so kick
> hangcheck into action.) However, this falls apart from around commit
> 4680816be336 ("drm/i915: Wait first for submission, before waiting for
> request completion"), as not all waiters declare themselves to hangcheck
> and so we could switch off hangcheck and miss GPU hangs even when
> waiting under the struct_mutex.
> 
> If we enable hangcheck from the first request submission, and let it run
> until the GPU is idle again, we forgo all the complexity involved with
> only enabling around waiters. Instead we have to be careful that we do
> not declare a GPU hang when idly waiting for the next request to be come
> ready.
> 
> Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion"

It also fixes https://bugs.freedesktop.org/show_bug.cgi?id=104840.

Thanks,
Antonio

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem.c          |  7 +++----
>   drivers/gpu/drm/i915/i915_gem_request.c  |  2 ++
>   drivers/gpu/drm/i915/intel_breadcrumbs.c | 11 -----------
>   drivers/gpu/drm/i915/intel_hangcheck.c   |  7 +------
>   4 files changed, 6 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 062b21408698..63308ec016a3 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3322,16 +3322,15 @@ i915_gem_retire_work_handler(struct work_struct *work)
>   		mutex_unlock(&dev->struct_mutex);
>   	}
>   
> -	/* Keep the retire handler running until we are finally idle.
> +	/*
> +	 * Keep the retire handler running until we are finally idle.
>   	 * We do not need to do this test under locking as in the worst-case
>   	 * we queue the retire worker once too often.
>   	 */
> -	if (READ_ONCE(dev_priv->gt.awake)) {
> -		i915_queue_hangcheck(dev_priv);
> +	if (READ_ONCE(dev_priv->gt.awake))
>   		queue_delayed_work(dev_priv->wq,
>   				   &dev_priv->gt.retire_work,
>   				   round_jiffies_up_relative(HZ));
> -	}
>   }
>   
>   static void shrink_caches(struct drm_i915_private *i915)
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 9b02310307fc..6cacd78cc849 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -393,6 +393,8 @@ static void mark_busy(struct drm_i915_private *i915)
>   
>   	intel_engines_unpark(i915);
>   
> +	i915_queue_hangcheck(i915);
> +
>   	queue_delayed_work(i915->wq,
>   			   &i915->gt.retire_work,
>   			   round_jiffies_up_relative(HZ));
> diff --git a/drivers/gpu/drm/i915/intel_breadcrumbs.c b/drivers/gpu/drm/i915/intel_breadcrumbs.c
> index 86acac010bb8..62b2a20bc24e 100644
> --- a/drivers/gpu/drm/i915/intel_breadcrumbs.c
> +++ b/drivers/gpu/drm/i915/intel_breadcrumbs.c
> @@ -149,17 +149,6 @@ static void intel_breadcrumbs_fake_irq(struct timer_list *t)
>   		return;
>   
>   	mod_timer(&b->fake_irq, jiffies + 1);
> -
> -	/* Ensure that even if the GPU hangs, we get woken up.
> -	 *
> -	 * However, note that if no one is waiting, we never notice
> -	 * a gpu hang. Eventually, we will have to wait for a resource
> -	 * held by the GPU and so trigger a hangcheck. In the most
> -	 * pathological case, this will be upon memory starvation! To
> -	 * prevent this, we also queue the hangcheck from the retire
> -	 * worker.
> -	 */
> -	i915_queue_hangcheck(engine->i915);
>   }
>   
>   static void irq_enable(struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 31f01d64c021..348a4f7ffb67 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -411,7 +411,6 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   	struct intel_engine_cs *engine;
>   	enum intel_engine_id id;
>   	unsigned int hung = 0, stuck = 0;
> -	int busy_count = 0;
>   
>   	if (!i915_modparams.enable_hangcheck)
>   		return;
> @@ -429,7 +428,6 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   	intel_uncore_arm_unclaimed_mmio_detection(dev_priv);
>   
>   	for_each_engine(engine, dev_priv, id) {
> -		const bool busy = intel_engine_has_waiter(engine);
>   		struct intel_engine_hangcheck hc;
>   
>   		semaphore_clear_deadlocks(dev_priv);
> @@ -443,16 +441,13 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   			if (hc.action != ENGINE_DEAD)
>   				stuck |= intel_engine_flag(engine);
>   		}
> -
> -		busy_count += busy;
>   	}
>   
>   	if (hung)
>   		hangcheck_declare_hang(dev_priv, hung, stuck);
>   
>   	/* Reset timer in case GPU hangs without another request being added */
> -	if (busy_count)
> -		i915_queue_hangcheck(dev_priv);
> +	i915_queue_hangcheck(dev_priv);
>   }
>   
>   void intel_engine_init_hangcheck(struct intel_engine_cs *engine)
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Always run hangcheck while the GPU is busy
  2018-01-29 14:41 [PATCH] drm/i915: Always run hangcheck while the GPU is busy Chris Wilson
  2018-01-29 15:05 ` ✓ Fi.CI.BAT: success for " Patchwork
  2018-01-29 23:41 ` [PATCH] " Antonio Argenziano
@ 2018-01-30 12:18 ` Mika Kuoppala
  2018-01-30 12:24   ` Chris Wilson
  2018-01-31  9:41 ` Mika Kuoppala
  3 siblings, 1 reply; 9+ messages in thread
From: Mika Kuoppala @ 2018-01-30 12:18 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Previously, we relied on only running the hangcheck while somebody was
> waiting on the GPU, in order to minimise the amount of time hangcheck
> had to run. (If nobody was watching the GPU, nobody would notice if the
> GPU wasn't responding -- eventually somebody would care and so kick
> hangcheck into action.) However, this falls apart from around commit
> 4680816be336 ("drm/i915: Wait first for submission, before waiting for
> request completion"), as not all waiters declare themselves to hangcheck
> and so we could switch off hangcheck and miss GPU hangs even when
> waiting under the struct_mutex.
>
> If we enable hangcheck from the first request submission, and let it run
> until the GPU is idle again, we forgo all the complexity involved with
> only enabling around waiters. Instead we have to be careful that we do
> not declare a GPU hang when idly waiting for the next request to be come
> ready.

For the complexity part I agree that this is simple and elegant. But
I think I have not understood it fully as I don't connect the part where
we need to be careful in idly waiting for next request.
Could you elaborate and point it the relevant portion in the patch for it?

-Mika

>
> Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion"
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c          |  7 +++----
>  drivers/gpu/drm/i915/i915_gem_request.c  |  2 ++
>  drivers/gpu/drm/i915/intel_breadcrumbs.c | 11 -----------
>  drivers/gpu/drm/i915/intel_hangcheck.c   |  7 +------
>  4 files changed, 6 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 062b21408698..63308ec016a3 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3322,16 +3322,15 @@ i915_gem_retire_work_handler(struct work_struct *work)
>  		mutex_unlock(&dev->struct_mutex);
>  	}
>  
> -	/* Keep the retire handler running until we are finally idle.
> +	/*
> +	 * Keep the retire handler running until we are finally idle.
>  	 * We do not need to do this test under locking as in the worst-case
>  	 * we queue the retire worker once too often.
>  	 */
> -	if (READ_ONCE(dev_priv->gt.awake)) {
> -		i915_queue_hangcheck(dev_priv);
> +	if (READ_ONCE(dev_priv->gt.awake))
>  		queue_delayed_work(dev_priv->wq,
>  				   &dev_priv->gt.retire_work,
>  				   round_jiffies_up_relative(HZ));
> -	}
>  }
>  
>  static void shrink_caches(struct drm_i915_private *i915)
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 9b02310307fc..6cacd78cc849 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -393,6 +393,8 @@ static void mark_busy(struct drm_i915_private *i915)
>  
>  	intel_engines_unpark(i915);
>  
> +	i915_queue_hangcheck(i915);
> +
>  	queue_delayed_work(i915->wq,
>  			   &i915->gt.retire_work,
>  			   round_jiffies_up_relative(HZ));
> diff --git a/drivers/gpu/drm/i915/intel_breadcrumbs.c b/drivers/gpu/drm/i915/intel_breadcrumbs.c
> index 86acac010bb8..62b2a20bc24e 100644
> --- a/drivers/gpu/drm/i915/intel_breadcrumbs.c
> +++ b/drivers/gpu/drm/i915/intel_breadcrumbs.c
> @@ -149,17 +149,6 @@ static void intel_breadcrumbs_fake_irq(struct timer_list *t)
>  		return;
>  
>  	mod_timer(&b->fake_irq, jiffies + 1);
> -
> -	/* Ensure that even if the GPU hangs, we get woken up.
> -	 *
> -	 * However, note that if no one is waiting, we never notice
> -	 * a gpu hang. Eventually, we will have to wait for a resource
> -	 * held by the GPU and so trigger a hangcheck. In the most
> -	 * pathological case, this will be upon memory starvation! To
> -	 * prevent this, we also queue the hangcheck from the retire
> -	 * worker.
> -	 */
> -	i915_queue_hangcheck(engine->i915);
>  }
>  
>  static void irq_enable(struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 31f01d64c021..348a4f7ffb67 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -411,7 +411,6 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  	struct intel_engine_cs *engine;
>  	enum intel_engine_id id;
>  	unsigned int hung = 0, stuck = 0;
> -	int busy_count = 0;
>  
>  	if (!i915_modparams.enable_hangcheck)
>  		return;
> @@ -429,7 +428,6 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  	intel_uncore_arm_unclaimed_mmio_detection(dev_priv);
>  
>  	for_each_engine(engine, dev_priv, id) {
> -		const bool busy = intel_engine_has_waiter(engine);
>  		struct intel_engine_hangcheck hc;
>  
>  		semaphore_clear_deadlocks(dev_priv);
> @@ -443,16 +441,13 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  			if (hc.action != ENGINE_DEAD)
>  				stuck |= intel_engine_flag(engine);
>  		}
> -
> -		busy_count += busy;
>  	}
>  
>  	if (hung)
>  		hangcheck_declare_hang(dev_priv, hung, stuck);
>  
>  	/* Reset timer in case GPU hangs without another request being added */
> -	if (busy_count)
> -		i915_queue_hangcheck(dev_priv);
> +	i915_queue_hangcheck(dev_priv);
>  }
>  
>  void intel_engine_init_hangcheck(struct intel_engine_cs *engine)
> -- 
> 2.15.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Always run hangcheck while the GPU is busy
  2018-01-30 12:18 ` Mika Kuoppala
@ 2018-01-30 12:24   ` Chris Wilson
  2018-01-30 13:00     ` Mika Kuoppala
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Wilson @ 2018-01-30 12:24 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2018-01-30 12:18:17)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Previously, we relied on only running the hangcheck while somebody was
> > waiting on the GPU, in order to minimise the amount of time hangcheck
> > had to run. (If nobody was watching the GPU, nobody would notice if the
> > GPU wasn't responding -- eventually somebody would care and so kick
> > hangcheck into action.) However, this falls apart from around commit
> > 4680816be336 ("drm/i915: Wait first for submission, before waiting for
> > request completion"), as not all waiters declare themselves to hangcheck
> > and so we could switch off hangcheck and miss GPU hangs even when
> > waiting under the struct_mutex.
> >
> > If we enable hangcheck from the first request submission, and let it run
> > until the GPU is idle again, we forgo all the complexity involved with
> > only enabling around waiters. Instead we have to be careful that we do
> > not declare a GPU hang when idly waiting for the next request to be come
> > ready.
> 
> For the complexity part I agree that this is simple and elegant. But
> I think I have not understood it fully as I don't connect the part where
> we need to be careful in idly waiting for next request.
> Could you elaborate and point it the relevant portion in the patch for it?

It's not in this patch, it's just relating to the experiences we've had
previously in compensating for an engine with requests scheduled waiting
for a signal, making sure we treated those engines as idle rather than
stuck.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Always run hangcheck while the GPU is busy
  2018-01-30 12:24   ` Chris Wilson
@ 2018-01-30 13:00     ` Mika Kuoppala
  2018-01-30 15:52       ` Chris Wilson
  0 siblings, 1 reply; 9+ messages in thread
From: Mika Kuoppala @ 2018-01-30 13:00 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2018-01-30 12:18:17)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > Previously, we relied on only running the hangcheck while somebody was
>> > waiting on the GPU, in order to minimise the amount of time hangcheck
>> > had to run. (If nobody was watching the GPU, nobody would notice if the
>> > GPU wasn't responding -- eventually somebody would care and so kick
>> > hangcheck into action.) However, this falls apart from around commit
>> > 4680816be336 ("drm/i915: Wait first for submission, before waiting for
>> > request completion"), as not all waiters declare themselves to hangcheck
>> > and so we could switch off hangcheck and miss GPU hangs even when
>> > waiting under the struct_mutex.
>> >
>> > If we enable hangcheck from the first request submission, and let it run
>> > until the GPU is idle again, we forgo all the complexity involved with
>> > only enabling around waiters. Instead we have to be careful that we do
>> > not declare a GPU hang when idly waiting for the next request to be come
>> > ready.
>> 
>> For the complexity part I agree that this is simple and elegant. But
>> I think I have not understood it fully as I don't connect the part where
>> we need to be careful in idly waiting for next request.
>> Could you elaborate and point it the relevant portion in the patch for it?
>
> It's not in this patch, it's just relating to the experiences we've had
> previously in compensating for an engine with requests scheduled waiting
> for a signal, making sure we treated those engines as idle rather than
> stuck.

Ok. Perhaps the last sentence can be omitted then.

I tried to look if we somehow could miss an idle engine check
and declare a false hang if we somehow end up doing a check on
a just idled hardware.

Could not find a clear way that would happen but as
the gt.awake is now a master, should it be first thing we
check in intel_engine_is_idle() to limit how far
we look into the rabbit hole?

-Mika

> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Always run hangcheck while the GPU is busy
  2018-01-30 13:00     ` Mika Kuoppala
@ 2018-01-30 15:52       ` Chris Wilson
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Wilson @ 2018-01-30 15:52 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2018-01-30 13:00:50)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Quoting Mika Kuoppala (2018-01-30 12:18:17)
> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
> >> 
> >> > Previously, we relied on only running the hangcheck while somebody was
> >> > waiting on the GPU, in order to minimise the amount of time hangcheck
> >> > had to run. (If nobody was watching the GPU, nobody would notice if the
> >> > GPU wasn't responding -- eventually somebody would care and so kick
> >> > hangcheck into action.) However, this falls apart from around commit
> >> > 4680816be336 ("drm/i915: Wait first for submission, before waiting for
> >> > request completion"), as not all waiters declare themselves to hangcheck
> >> > and so we could switch off hangcheck and miss GPU hangs even when
> >> > waiting under the struct_mutex.
> >> >
> >> > If we enable hangcheck from the first request submission, and let it run
> >> > until the GPU is idle again, we forgo all the complexity involved with
> >> > only enabling around waiters. Instead we have to be careful that we do
> >> > not declare a GPU hang when idly waiting for the next request to be come
> >> > ready.
> >> 
> >> For the complexity part I agree that this is simple and elegant. But
> >> I think I have not understood it fully as I don't connect the part where
> >> we need to be careful in idly waiting for next request.
> >> Could you elaborate and point it the relevant portion in the patch for it?
> >
> > It's not in this patch, it's just relating to the experiences we've had
> > previously in compensating for an engine with requests scheduled waiting
> > for a signal, making sure we treated those engines as idle rather than
> > stuck.
> 
> Ok. Perhaps the last sentence can be omitted then.

Just felt it was an important reminder that as we don't switch off as
often, hangcheck needs to be wary of frequently finding itself in
situations where the driver is busy but the engine is idle. We've fallen
foul of that in the past, a hint to [future] reviewers that we need to
double check all will be well.
 
> I tried to look if we somehow could miss an idle engine check
> and declare a false hang if we somehow end up doing a check on
> a just idled hardware.
> 
> Could not find a clear way that would happen but as
> the gt.awake is now a master, should it be first thing we
> check in intel_engine_is_idle() to limit how far
> we look into the rabbit hole?

The other argument being that the user parameter trumps gt.awake.
I think the order is reasonably clear atm. What I'm not completely sold
on is the behaviour around wedging -- the requeue should be from unwedge
in that case and not inside i915_reset(). But I didn't look at all the
implications of moving that i915_queue_hangcheck() and so left it alone
for now.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Always run hangcheck while the GPU is busy
  2018-01-29 14:41 [PATCH] drm/i915: Always run hangcheck while the GPU is busy Chris Wilson
                   ` (2 preceding siblings ...)
  2018-01-30 12:18 ` Mika Kuoppala
@ 2018-01-31  9:41 ` Mika Kuoppala
  2018-01-31 10:09   ` Chris Wilson
  3 siblings, 1 reply; 9+ messages in thread
From: Mika Kuoppala @ 2018-01-31  9:41 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Previously, we relied on only running the hangcheck while somebody was
> waiting on the GPU, in order to minimise the amount of time hangcheck
> had to run. (If nobody was watching the GPU, nobody would notice if the
> GPU wasn't responding -- eventually somebody would care and so kick
> hangcheck into action.) However, this falls apart from around commit
> 4680816be336 ("drm/i915: Wait first for submission, before waiting for
> request completion"), as not all waiters declare themselves to hangcheck
> and so we could switch off hangcheck and miss GPU hangs even when
> waiting under the struct_mutex.
>
> If we enable hangcheck from the first request submission, and let it run
> until the GPU is idle again, we forgo all the complexity involved with
> only enabling around waiters. Instead we have to be careful that we do
> not declare a GPU hang when idly waiting for the next request to be come
> ready.
>
> Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion"
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

> ---
>  drivers/gpu/drm/i915/i915_gem.c          |  7 +++----
>  drivers/gpu/drm/i915/i915_gem_request.c  |  2 ++
>  drivers/gpu/drm/i915/intel_breadcrumbs.c | 11 -----------
>  drivers/gpu/drm/i915/intel_hangcheck.c   |  7 +------
>  4 files changed, 6 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 062b21408698..63308ec016a3 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3322,16 +3322,15 @@ i915_gem_retire_work_handler(struct work_struct *work)
>  		mutex_unlock(&dev->struct_mutex);
>  	}
>  
> -	/* Keep the retire handler running until we are finally idle.
> +	/*
> +	 * Keep the retire handler running until we are finally idle.
>  	 * We do not need to do this test under locking as in the worst-case
>  	 * we queue the retire worker once too often.
>  	 */
> -	if (READ_ONCE(dev_priv->gt.awake)) {
> -		i915_queue_hangcheck(dev_priv);
> +	if (READ_ONCE(dev_priv->gt.awake))
>  		queue_delayed_work(dev_priv->wq,
>  				   &dev_priv->gt.retire_work,
>  				   round_jiffies_up_relative(HZ));
> -	}
>  }
>  
>  static void shrink_caches(struct drm_i915_private *i915)
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 9b02310307fc..6cacd78cc849 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -393,6 +393,8 @@ static void mark_busy(struct drm_i915_private *i915)
>  
>  	intel_engines_unpark(i915);
>  
> +	i915_queue_hangcheck(i915);
> +
>  	queue_delayed_work(i915->wq,
>  			   &i915->gt.retire_work,
>  			   round_jiffies_up_relative(HZ));
> diff --git a/drivers/gpu/drm/i915/intel_breadcrumbs.c b/drivers/gpu/drm/i915/intel_breadcrumbs.c
> index 86acac010bb8..62b2a20bc24e 100644
> --- a/drivers/gpu/drm/i915/intel_breadcrumbs.c
> +++ b/drivers/gpu/drm/i915/intel_breadcrumbs.c
> @@ -149,17 +149,6 @@ static void intel_breadcrumbs_fake_irq(struct timer_list *t)
>  		return;
>  
>  	mod_timer(&b->fake_irq, jiffies + 1);
> -
> -	/* Ensure that even if the GPU hangs, we get woken up.
> -	 *
> -	 * However, note that if no one is waiting, we never notice
> -	 * a gpu hang. Eventually, we will have to wait for a resource
> -	 * held by the GPU and so trigger a hangcheck. In the most
> -	 * pathological case, this will be upon memory starvation! To
> -	 * prevent this, we also queue the hangcheck from the retire
> -	 * worker.
> -	 */
> -	i915_queue_hangcheck(engine->i915);
>  }
>  
>  static void irq_enable(struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 31f01d64c021..348a4f7ffb67 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -411,7 +411,6 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  	struct intel_engine_cs *engine;
>  	enum intel_engine_id id;
>  	unsigned int hung = 0, stuck = 0;
> -	int busy_count = 0;
>  
>  	if (!i915_modparams.enable_hangcheck)
>  		return;
> @@ -429,7 +428,6 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  	intel_uncore_arm_unclaimed_mmio_detection(dev_priv);
>  
>  	for_each_engine(engine, dev_priv, id) {
> -		const bool busy = intel_engine_has_waiter(engine);
>  		struct intel_engine_hangcheck hc;
>  
>  		semaphore_clear_deadlocks(dev_priv);
> @@ -443,16 +441,13 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  			if (hc.action != ENGINE_DEAD)
>  				stuck |= intel_engine_flag(engine);
>  		}
> -
> -		busy_count += busy;
>  	}
>  
>  	if (hung)
>  		hangcheck_declare_hang(dev_priv, hung, stuck);
>  
>  	/* Reset timer in case GPU hangs without another request being added */
> -	if (busy_count)
> -		i915_queue_hangcheck(dev_priv);
> +	i915_queue_hangcheck(dev_priv);
>  }
>  
>  void intel_engine_init_hangcheck(struct intel_engine_cs *engine)
> -- 
> 2.15.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Always run hangcheck while the GPU is busy
  2018-01-31  9:41 ` Mika Kuoppala
@ 2018-01-31 10:09   ` Chris Wilson
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Wilson @ 2018-01-31 10:09 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2018-01-31 09:41:35)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Previously, we relied on only running the hangcheck while somebody was
> > waiting on the GPU, in order to minimise the amount of time hangcheck
> > had to run. (If nobody was watching the GPU, nobody would notice if the
> > GPU wasn't responding -- eventually somebody would care and so kick
> > hangcheck into action.) However, this falls apart from around commit
> > 4680816be336 ("drm/i915: Wait first for submission, before waiting for
> > request completion"), as not all waiters declare themselves to hangcheck
> > and so we could switch off hangcheck and miss GPU hangs even when
> > waiting under the struct_mutex.
> >
> > If we enable hangcheck from the first request submission, and let it run
> > until the GPU is idle again, we forgo all the complexity involved with
> > only enabling around waiters. Instead we have to be careful that we do
> > not declare a GPU hang when idly waiting for the next request to be come
> > ready.
> >
> > Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion"
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Rewrote the last paragraph to try and make it clear what I was hinting
at, and so it doesn't sound like a description of what this patch is
doing but the background mechanics that this patch relies upon.

Thanks for the review and discussion,
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-01-31 10:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-29 14:41 [PATCH] drm/i915: Always run hangcheck while the GPU is busy Chris Wilson
2018-01-29 15:05 ` ✓ Fi.CI.BAT: success for " Patchwork
2018-01-29 23:41 ` [PATCH] " Antonio Argenziano
2018-01-30 12:18 ` Mika Kuoppala
2018-01-30 12:24   ` Chris Wilson
2018-01-30 13:00     ` Mika Kuoppala
2018-01-30 15:52       ` Chris Wilson
2018-01-31  9:41 ` Mika Kuoppala
2018-01-31 10:09   ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.