All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode
@ 2016-11-03 16:19 Imre Deak
  2016-11-03 16:19 ` [PATCH 2/2] drm/i915: Add assert for no pending GPU requests during resume " Imre Deak
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Imre Deak @ 2016-11-03 16:19 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

We assume that the GPU is idle once receiving the seqno via the last
request's user interrupt. In execlist mode the corresponding context
completed interrupt can be delayed though and until this latter
interrupt arrives we consider the request to be pending on the ELSP
submit port. This can cause a problem during system suspend where this
last request will be seen by the resume code as still pending. Such
pending requests are normally replayed after a GPU reset, but during
resume we reset both SW and HW tracking of the ring head/tail pointers,
so replaying the pending request with its stale tale pointer will leave
the ring in an inconsistent state. A subsequent request submission can
lead then to the GPU executing from uninitialized area in the ring
behind the above stale tail pointer.

Fix this by making sure any pending request on the ELSP port is
completed before suspending. I used a polling wait since the completion
time I measured was <1ms and since normally we only need to wait during
system suspend. GPU idling during runtime suspend is scheduled with a
delay (currently 50-100ms) after the retirement of the last request at
which point the context completed interrupt must have arrived already.

The chance of this bug was increased by

commit 1c777c5d1dcdf8fa0223fcff35fb387b5bb9517a
Author: Imre Deak <imre.deak@intel.com>
Date:   Wed Oct 12 17:46:37 2016 +0300

    drm/i915/hsw: Fix GPU hang during resume from S3-devices state

but it could happen even without the explicit GPU reset, since we
disable interrupts afterwards during the suspend sequence.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98470
Signed-off-by: Imre Deak <imre.deak@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c  |  3 +++
 drivers/gpu/drm/i915/intel_lrc.c | 12 ++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h |  1 +
 3 files changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 1f995ce..5ff02b5 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2766,6 +2766,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
 	if (dev_priv->gt.active_requests)
 		goto out_unlock;
 
+	if (i915.enable_execlists)
+		intel_lr_wait_engines_idle(dev_priv);
+
 	for_each_engine(engine, dev_priv, id)
 		i915_gem_batch_pool_fini(&engine->batch_pool);
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index fa3012c..ee4aaf1 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -522,6 +522,18 @@ static bool execlists_elsp_idle(struct intel_engine_cs *engine)
 	return !engine->execlist_port[0].request;
 }
 
+void intel_lr_wait_engines_idle(struct drm_i915_private *dev_priv)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	for_each_engine(engine, dev_priv, id) {
+		if (wait_for(execlists_elsp_idle(engine), 10))
+			DRM_ERROR("Timeout waiting for engine %s to idle\n",
+				  engine->name);
+	}
+}
+
 static bool execlists_elsp_ready(struct intel_engine_cs *engine)
 {
 	int port;
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 4fed816..bf3775e 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -87,6 +87,7 @@ void intel_lr_context_unpin(struct i915_gem_context *ctx,
 
 struct drm_i915_private;
 
+void intel_lr_wait_engines_idle(struct drm_i915_private *dev_priv);
 void intel_lr_context_resume(struct drm_i915_private *dev_priv);
 uint64_t intel_lr_context_descriptor(struct i915_gem_context *ctx,
 				     struct intel_engine_cs *engine);
-- 
2.5.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] drm/i915: Add assert for no pending GPU requests during resume in LR mode
  2016-11-03 16:19 [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode Imre Deak
@ 2016-11-03 16:19 ` Imre Deak
  2016-11-03 19:08   ` Chris Wilson
  2016-11-03 16:45 ` ✗ Fi.CI.BAT: warning for series starting with [1/2] drm/i915: Make sure engines are idle during GPU idling " Patchwork
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Imre Deak @ 2016-11-03 16:19 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

During resume we will reset the SW/HW tracking for each ring head/tail
pointers and so are not prepared to replay any pending requests (as
opposed to GPU reset time). Add an assert for this.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ee4aaf1..ae46345 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2158,6 +2158,8 @@ void intel_lr_context_resume(struct drm_i915_private *dev_priv)
 			if (WARN_ON(IS_ERR(reg)))
 				continue;
 
+			WARN_ON(!execlists_elsp_idle(engine));
+
 			reg += LRC_STATE_PN * PAGE_SIZE / sizeof(*reg);
 			reg[CTX_RING_HEAD+1] = 0;
 			reg[CTX_RING_TAIL+1] = 0;
-- 
2.5.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* ✗ Fi.CI.BAT: warning for series starting with [1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode
  2016-11-03 16:19 [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode Imre Deak
  2016-11-03 16:19 ` [PATCH 2/2] drm/i915: Add assert for no pending GPU requests during resume " Imre Deak
@ 2016-11-03 16:45 ` Patchwork
  2016-11-03 16:49 ` [PATCH 1/2] " Tvrtko Ursulin
  2016-11-03 18:59 ` Chris Wilson
  3 siblings, 0 replies; 12+ messages in thread
From: Patchwork @ 2016-11-03 16:45 UTC (permalink / raw)
  To: Imre Deak; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode
URL   : https://patchwork.freedesktop.org/series/14789/
State : warning

== Summary ==

Series 14789v1 Series without cover letter
https://patchwork.freedesktop.org/api/1.0/series/14789/revisions/1/mbox/

Test drv_module_reload_basic:
                pass       -> DMESG-WARN (fi-skl-6770hq)

fi-bdw-5557u     total:241  pass:226  dwarn:0   dfail:0   fail:0   skip:15 
fi-bsw-n3050     total:241  pass:201  dwarn:0   dfail:0   fail:0   skip:40 
fi-bxt-t5700     total:241  pass:213  dwarn:0   dfail:0   fail:0   skip:28 
fi-byt-j1900     total:241  pass:213  dwarn:0   dfail:0   fail:0   skip:28 
fi-byt-n2820     total:241  pass:209  dwarn:0   dfail:0   fail:0   skip:32 
fi-hsw-4770      total:241  pass:221  dwarn:0   dfail:0   fail:0   skip:20 
fi-hsw-4770r     total:241  pass:221  dwarn:0   dfail:0   fail:0   skip:20 
fi-ilk-650       total:241  pass:188  dwarn:0   dfail:0   fail:0   skip:53 
fi-ivb-3520m     total:241  pass:219  dwarn:0   dfail:0   fail:0   skip:22 
fi-ivb-3770      total:241  pass:219  dwarn:0   dfail:0   fail:0   skip:22 
fi-kbl-7200u     total:241  pass:219  dwarn:0   dfail:0   fail:0   skip:22 
fi-skl-6260u     total:241  pass:227  dwarn:0   dfail:0   fail:0   skip:14 
fi-skl-6700hq    total:241  pass:220  dwarn:0   dfail:0   fail:0   skip:21 
fi-skl-6700k     total:241  pass:219  dwarn:1   dfail:0   fail:0   skip:21 
fi-skl-6770hq    total:241  pass:226  dwarn:1   dfail:0   fail:0   skip:14 
fi-snb-2520m     total:241  pass:209  dwarn:0   dfail:0   fail:0   skip:32 
fi-snb-2600      total:241  pass:208  dwarn:0   dfail:0   fail:0   skip:33 

8a1c897ee1d0d93b8e70991becb6f9f8488dba1b drm-intel-nightly: 2016y-11m-03d-13h-53m-22s UTC integration manifest
48b2b9f drm/i915: Add assert for no pending GPU requests during resume in LR mode
f38c913 drm/i915: Make sure engines are idle during GPU idling in LR mode

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_2899/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode
  2016-11-03 16:19 [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode Imre Deak
  2016-11-03 16:19 ` [PATCH 2/2] drm/i915: Add assert for no pending GPU requests during resume " Imre Deak
  2016-11-03 16:45 ` ✗ Fi.CI.BAT: warning for series starting with [1/2] drm/i915: Make sure engines are idle during GPU idling " Patchwork
@ 2016-11-03 16:49 ` Tvrtko Ursulin
  2016-11-03 18:59 ` Chris Wilson
  3 siblings, 0 replies; 12+ messages in thread
From: Tvrtko Ursulin @ 2016-11-03 16:49 UTC (permalink / raw)
  To: Imre Deak, intel-gfx; +Cc: Mika Kuoppala


On 03/11/2016 16:19, Imre Deak wrote:
> We assume that the GPU is idle once receiving the seqno via the last
> request's user interrupt. In execlist mode the corresponding context
> completed interrupt can be delayed though and until this latter
> interrupt arrives we consider the request to be pending on the ELSP
> submit port. This can cause a problem during system suspend where this
> last request will be seen by the resume code as still pending. Such
> pending requests are normally replayed after a GPU reset, but during
> resume we reset both SW and HW tracking of the ring head/tail pointers,
> so replaying the pending request with its stale tale pointer will leave
> the ring in an inconsistent state. A subsequent request submission can
> lead then to the GPU executing from uninitialized area in the ring
> behind the above stale tail pointer.
>
> Fix this by making sure any pending request on the ELSP port is
> completed before suspending. I used a polling wait since the completion
> time I measured was <1ms and since normally we only need to wait during
> system suspend. GPU idling during runtime suspend is scheduled with a
> delay (currently 50-100ms) after the retirement of the last request at
> which point the context completed interrupt must have arrived already.
>
> The chance of this bug was increased by
>
> commit 1c777c5d1dcdf8fa0223fcff35fb387b5bb9517a
> Author: Imre Deak <imre.deak@intel.com>
> Date:   Wed Oct 12 17:46:37 2016 +0300
>
>     drm/i915/hsw: Fix GPU hang during resume from S3-devices state
>
> but it could happen even without the explicit GPU reset, since we
> disable interrupts afterwards during the suspend sequence.
>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98470
> Signed-off-by: Imre Deak <imre.deak@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c  |  3 +++
>  drivers/gpu/drm/i915/intel_lrc.c | 12 ++++++++++++
>  drivers/gpu/drm/i915/intel_lrc.h |  1 +
>  3 files changed, 16 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 1f995ce..5ff02b5 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2766,6 +2766,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
>  	if (dev_priv->gt.active_requests)
>  		goto out_unlock;
>
> +	if (i915.enable_execlists)
> +		intel_lr_wait_engines_idle(dev_priv);
> +
>  	for_each_engine(engine, dev_priv, id)
>  		i915_gem_batch_pool_fini(&engine->batch_pool);
>
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index fa3012c..ee4aaf1 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -522,6 +522,18 @@ static bool execlists_elsp_idle(struct intel_engine_cs *engine)
>  	return !engine->execlist_port[0].request;
>  }
>
> +void intel_lr_wait_engines_idle(struct drm_i915_private *dev_priv)
> +{
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		if (wait_for(execlists_elsp_idle(engine), 10))
> +			DRM_ERROR("Timeout waiting for engine %s to idle\n",
> +				  engine->name);

Just noticed engine names are currently like "render ring",etc, so you 
can drop the 'engine' from the message.

> +	}
> +}
> +
>  static bool execlists_elsp_ready(struct intel_engine_cs *engine)
>  {
>  	int port;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index 4fed816..bf3775e 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -87,6 +87,7 @@ void intel_lr_context_unpin(struct i915_gem_context *ctx,
>
>  struct drm_i915_private;
>
> +void intel_lr_wait_engines_idle(struct drm_i915_private *dev_priv);
>  void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>  uint64_t intel_lr_context_descriptor(struct i915_gem_context *ctx,
>  				     struct intel_engine_cs *engine);
>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode
  2016-11-03 16:19 [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode Imre Deak
                   ` (2 preceding siblings ...)
  2016-11-03 16:49 ` [PATCH 1/2] " Tvrtko Ursulin
@ 2016-11-03 18:59 ` Chris Wilson
  2016-11-03 20:57   ` Imre Deak
  3 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2016-11-03 18:59 UTC (permalink / raw)
  To: Imre Deak; +Cc: intel-gfx, Mika Kuoppala

On Thu, Nov 03, 2016 at 06:19:37PM +0200, Imre Deak wrote:
> We assume that the GPU is idle once receiving the seqno via the last
> request's user interrupt. In execlist mode the corresponding context
> completed interrupt can be delayed though and until this latter
> interrupt arrives we consider the request to be pending on the ELSP
> submit port. This can cause a problem during system suspend where this
> last request will be seen by the resume code as still pending. Such
> pending requests are normally replayed after a GPU reset, but during
> resume we reset both SW and HW tracking of the ring head/tail pointers,
> so replaying the pending request with its stale tale pointer will leave
> the ring in an inconsistent state. A subsequent request submission can
> lead then to the GPU executing from uninitialized area in the ring
> behind the above stale tail pointer.
> 
> Fix this by making sure any pending request on the ELSP port is
> completed before suspending. I used a polling wait since the completion
> time I measured was <1ms and since normally we only need to wait during
> system suspend. GPU idling during runtime suspend is scheduled with a
> delay (currently 50-100ms) after the retirement of the last request at
> which point the context completed interrupt must have arrived already.
> 
> The chance of this bug was increased by
> 
> commit 1c777c5d1dcdf8fa0223fcff35fb387b5bb9517a
> Author: Imre Deak <imre.deak@intel.com>
> Date:   Wed Oct 12 17:46:37 2016 +0300
> 
>     drm/i915/hsw: Fix GPU hang during resume from S3-devices state
> 
> but it could happen even without the explicit GPU reset, since we
> disable interrupts afterwards during the suspend sequence.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98470
> Signed-off-by: Imre Deak <imre.deak@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c  |  3 +++
>  drivers/gpu/drm/i915/intel_lrc.c | 12 ++++++++++++
>  drivers/gpu/drm/i915/intel_lrc.h |  1 +
>  3 files changed, 16 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 1f995ce..5ff02b5 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2766,6 +2766,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
>  	if (dev_priv->gt.active_requests)
>  		goto out_unlock;
>  
> +	if (i915.enable_execlists)
> +		intel_lr_wait_engines_idle(dev_priv);

Idle work handler... So runtime suspend. Anyway this is not an
ideal place for a stall under struct_mutex (even if 16x10us, it's the
principle!).

Move this to before the first READ_ONCE(dev_priv->gt.active_requests);
so we stall before taking the lock, and skip if any new requests arrive
whilst waiting.

(Also i915.enable_execlists is forbidden. But meh)

static struct drm_i915_gem_request *
execlists_active_port(struct intel_engine_cs *engine)
{
	struct drm_i915_gem_request *request;

	request = READ_ONCE(engine->execlist_port[1]);
	if (request)
		return request;

	return READ_ONCE(engine->execlist_port[0]);
}

/* Wait for execlists to settle, but bail if any new requests come in */
for_each_engine(engine, dev_priv, id) {
	struct drm_i915_gem_request *request;

	request = execlists_active_port(engine);
	if (!request)
		continue;

	if (wait_for(execlists_active_port(engine) != request, 10))
		DRM_ERROR("Timeout waiting for %s to idle\n", engine->name);
}

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] drm/i915: Add assert for no pending GPU requests during resume in LR mode
  2016-11-03 16:19 ` [PATCH 2/2] drm/i915: Add assert for no pending GPU requests during resume " Imre Deak
@ 2016-11-03 19:08   ` Chris Wilson
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2016-11-03 19:08 UTC (permalink / raw)
  To: Imre Deak; +Cc: intel-gfx, Mika Kuoppala

On Thu, Nov 03, 2016 at 06:19:38PM +0200, Imre Deak wrote:
> During resume we will reset the SW/HW tracking for each ring head/tail
> pointers and so are not prepared to replay any pending requests (as
> opposed to GPU reset time). Add an assert for this.

A bit belated. You could just put WARN_ON(dev_priv->gt.awake) in
i915_gem_resume() to mirror the assertions that we are idle on suspend.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Signed-off-by: Imre Deak <imre.deak@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ee4aaf1..ae46345 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2158,6 +2158,8 @@ void intel_lr_context_resume(struct drm_i915_private *dev_priv)
>  			if (WARN_ON(IS_ERR(reg)))
>  				continue;
>  
> +			WARN_ON(!execlists_elsp_idle(engine));

I don't think this check here documents the assumption of touching the
context. What you want is this assertion in suspend.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode
  2016-11-03 18:59 ` Chris Wilson
@ 2016-11-03 20:57   ` Imre Deak
  2016-11-03 21:14     ` Chris Wilson
  0 siblings, 1 reply; 12+ messages in thread
From: Imre Deak @ 2016-11-03 20:57 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, Mika Kuoppala

On Thu, 2016-11-03 at 18:59 +0000, Chris Wilson wrote:
> On Thu, Nov 03, 2016 at 06:19:37PM +0200, Imre Deak wrote:
> > We assume that the GPU is idle once receiving the seqno via the last
> > request's user interrupt. In execlist mode the corresponding context
> > completed interrupt can be delayed though and until this latter
> > interrupt arrives we consider the request to be pending on the ELSP
> > submit port. This can cause a problem during system suspend where this
> > last request will be seen by the resume code as still pending. Such
> > pending requests are normally replayed after a GPU reset, but during
> > resume we reset both SW and HW tracking of the ring head/tail pointers,
> > so replaying the pending request with its stale tale pointer will leave
> > the ring in an inconsistent state. A subsequent request submission can
> > lead then to the GPU executing from uninitialized area in the ring
> > behind the above stale tail pointer.
> > 
> > Fix this by making sure any pending request on the ELSP port is
> > completed before suspending. I used a polling wait since the completion
> > time I measured was <1ms and since normally we only need to wait during
> > system suspend. GPU idling during runtime suspend is scheduled with a
> > delay (currently 50-100ms) after the retirement of the last request at
> > which point the context completed interrupt must have arrived already.
> > 
> > The chance of this bug was increased by
> > 
> > commit 1c777c5d1dcdf8fa0223fcff35fb387b5bb9517a
> > Author: Imre Deak <imre.deak@intel.com>
> > Date:   Wed Oct 12 17:46:37 2016 +0300
> > 
> >     drm/i915/hsw: Fix GPU hang during resume from S3-devices state
> > 
> > but it could happen even without the explicit GPU reset, since we
> > disable interrupts afterwards during the suspend sequence.
> > 
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98470
> > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_gem.c  |  3 +++
> >  drivers/gpu/drm/i915/intel_lrc.c | 12 ++++++++++++
> >  drivers/gpu/drm/i915/intel_lrc.h |  1 +
> >  3 files changed, 16 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 1f995ce..5ff02b5 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -2766,6 +2766,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
> >  	if (dev_priv->gt.active_requests)
> >  		goto out_unlock;
> >  
> > +	if (i915.enable_execlists)
> > +		intel_lr_wait_engines_idle(dev_priv);
> 
> Idle work handler... So runtime suspend.
> Anyway this is not an ideal place for a stall under struct_mutex (even if
> 16x10us, it's the principle!).

During runtime suspend this won't add any overhead since the context
done interrupt happened already (unless there is a bug somewhere else).

> Move this to before the first READ_ONCE(dev_priv->gt.active_requests);
> so we stall before taking the lock, and skip if any new requests arrive
> whilst waiting.
> 
> (Also i915.enable_execlists is forbidden. But meh)
> 
> static struct drm_i915_gem_request *
> execlists_active_port(struct intel_engine_cs *engine)
> {
> 	struct drm_i915_gem_request *request;
> 
> 	request = READ_ONCE(engine->execlist_port[1]);
> 	if (request)
> 		return request;
> 
> 	return READ_ONCE(engine->execlist_port[0]);
> }
> 
> /* Wait for execlists to settle, but bail if any new requests come in */
> for_each_engine(engine, dev_priv, id) {
> 	struct drm_i915_gem_request *request;
> 
> 	request = execlists_active_port(engine);
> 	if (!request)
> 		continue;
> 
> 	if (wait_for(execlists_active_port(engine) != request, 10))
> 		DRM_ERROR("Timeout waiting for %s to idle\n", engine->name);
> }

Hm, but we still need to re-check and bail out if not idle with
struct_mutex held, since gt.active_requests could go 0->1->0 before
taking struct_mutex? I can rewrite things with that check added, using
the above.

--Imre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode
  2016-11-03 20:57   ` Imre Deak
@ 2016-11-03 21:14     ` Chris Wilson
  2016-11-04 20:33       ` Imre Deak
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2016-11-03 21:14 UTC (permalink / raw)
  To: Imre Deak; +Cc: intel-gfx, Mika Kuoppala

On Thu, Nov 03, 2016 at 10:57:23PM +0200, Imre Deak wrote:
> On Thu, 2016-11-03 at 18:59 +0000, Chris Wilson wrote:
> > On Thu, Nov 03, 2016 at 06:19:37PM +0200, Imre Deak wrote:
> > > We assume that the GPU is idle once receiving the seqno via the last
> > > request's user interrupt. In execlist mode the corresponding context
> > > completed interrupt can be delayed though and until this latter
> > > interrupt arrives we consider the request to be pending on the ELSP
> > > submit port. This can cause a problem during system suspend where this
> > > last request will be seen by the resume code as still pending. Such
> > > pending requests are normally replayed after a GPU reset, but during
> > > resume we reset both SW and HW tracking of the ring head/tail pointers,
> > > so replaying the pending request with its stale tale pointer will leave
> > > the ring in an inconsistent state. A subsequent request submission can
> > > lead then to the GPU executing from uninitialized area in the ring
> > > behind the above stale tail pointer.
> > > 
> > > Fix this by making sure any pending request on the ELSP port is
> > > completed before suspending. I used a polling wait since the completion
> > > time I measured was <1ms and since normally we only need to wait during
> > > system suspend. GPU idling during runtime suspend is scheduled with a
> > > delay (currently 50-100ms) after the retirement of the last request at
> > > which point the context completed interrupt must have arrived already.
> > > 
> > > The chance of this bug was increased by
> > > 
> > > commit 1c777c5d1dcdf8fa0223fcff35fb387b5bb9517a
> > > Author: Imre Deak <imre.deak@intel.com>
> > > Date:   Wed Oct 12 17:46:37 2016 +0300
> > > 
> > >     drm/i915/hsw: Fix GPU hang during resume from S3-devices state
> > > 
> > > but it could happen even without the explicit GPU reset, since we
> > > disable interrupts afterwards during the suspend sequence.
> > > 
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98470
> > > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_gem.c  |  3 +++
> > >  drivers/gpu/drm/i915/intel_lrc.c | 12 ++++++++++++
> > >  drivers/gpu/drm/i915/intel_lrc.h |  1 +
> > >  3 files changed, 16 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > > index 1f995ce..5ff02b5 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > @@ -2766,6 +2766,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
> > >  	if (dev_priv->gt.active_requests)
> > >  		goto out_unlock;
> > >  
> > > +	if (i915.enable_execlists)
> > > +		intel_lr_wait_engines_idle(dev_priv);
> > 
> > Idle work handler... So runtime suspend.
> > Anyway this is not an ideal place for a stall under struct_mutex (even if
> > 16x10us, it's the principle!).
> 
> During runtime suspend this won't add any overhead since the context
> done interrupt happened already (unless there is a bug somewhere else).

Where is that guaranteed? I thought we only serialised with the pm
interrupts. Remember this happens before rpm suspend, since
gem_idle_work_handler is responsible for dropping the GPU wakelock.
 
> > Move this to before the first READ_ONCE(dev_priv->gt.active_requests);
> > so we stall before taking the lock, and skip if any new requests arrive
> > whilst waiting.
> > 
> > (Also i915.enable_execlists is forbidden. But meh)
> > 
> > static struct drm_i915_gem_request *
> > execlists_active_port(struct intel_engine_cs *engine)
> > {
> > 	struct drm_i915_gem_request *request;
> > 
> > 	request = READ_ONCE(engine->execlist_port[1]);
> > 	if (request)
> > 		return request;
> > 
> > 	return READ_ONCE(engine->execlist_port[0]);
> > }
> > 
> > /* Wait for execlists to settle, but bail if any new requests come in */
> > for_each_engine(engine, dev_priv, id) {
> > 	struct drm_i915_gem_request *request;
> > 
> > 	request = execlists_active_port(engine);
> > 	if (!request)
> > 		continue;
> > 
> > 	if (wait_for(execlists_active_port(engine) != request, 10))
> > 		DRM_ERROR("Timeout waiting for %s to idle\n", engine->name);
> > }
> 
> Hm, but we still need to re-check and bail out if not idle with
> struct_mutex held, since gt.active_requests could go 0->1->0 before
> taking struct_mutex? I can rewrite things with that check added, using
> the above.

Hmm, apparently we don't care ;) If the context-done interrupt is
serialised with runtime suspend, then we don't need a wait here at all.
On the system path there are no new requests and we are just flushing
the idle worker.

But yes, for the sake of correctness do both an unlocked wait followed
by a locked wait.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode
  2016-11-03 21:14     ` Chris Wilson
@ 2016-11-04 20:33       ` Imre Deak
  2016-11-04 21:01         ` Chris Wilson
  0 siblings, 1 reply; 12+ messages in thread
From: Imre Deak @ 2016-11-04 20:33 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, Mika Kuoppala

On Thu, 2016-11-03 at 21:14 +0000, Chris Wilson wrote:
> On Thu, Nov 03, 2016 at 10:57:23PM +0200, Imre Deak wrote:
> > On Thu, 2016-11-03 at 18:59 +0000, Chris Wilson wrote:
> > > On Thu, Nov 03, 2016 at 06:19:37PM +0200, Imre Deak wrote:
> > > > We assume that the GPU is idle once receiving the seqno via the last
> > > > request's user interrupt. In execlist mode the corresponding context
> > > > completed interrupt can be delayed though and until this latter
> > > > interrupt arrives we consider the request to be pending on the ELSP
> > > > submit port. This can cause a problem during system suspend where this
> > > > last request will be seen by the resume code as still pending. Such
> > > > pending requests are normally replayed after a GPU reset, but during
> > > > resume we reset both SW and HW tracking of the ring head/tail pointers,
> > > > so replaying the pending request with its stale tale pointer will leave
> > > > the ring in an inconsistent state. A subsequent request submission can
> > > > lead then to the GPU executing from uninitialized area in the ring
> > > > behind the above stale tail pointer.
> > > > 
> > > > Fix this by making sure any pending request on the ELSP port is
> > > > completed before suspending. I used a polling wait since the completion
> > > > time I measured was <1ms and since normally we only need to wait during
> > > > system suspend. GPU idling during runtime suspend is scheduled with a
> > > > delay (currently 50-100ms) after the retirement of the last request at
> > > > which point the context completed interrupt must have arrived already.
> > > > 
> > > > The chance of this bug was increased by
> > > > 
> > > > commit 1c777c5d1dcdf8fa0223fcff35fb387b5bb9517a
> > > > Author: Imre Deak <imre.deak@intel.com>
> > > > Date:   Wed Oct 12 17:46:37 2016 +0300
> > > > 
> > > >     drm/i915/hsw: Fix GPU hang during resume from S3-devices state
> > > > 
> > > > but it could happen even without the explicit GPU reset, since we
> > > > disable interrupts afterwards during the suspend sequence.
> > > > 
> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98470
> > > > Signed-off-by: Imre Deak <imre.deak@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_gem.c  |  3 +++
> > > >  drivers/gpu/drm/i915/intel_lrc.c | 12 ++++++++++++
> > > >  drivers/gpu/drm/i915/intel_lrc.h |  1 +
> > > >  3 files changed, 16 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > > > index 1f995ce..5ff02b5 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > > @@ -2766,6 +2766,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
> > > >  	if (dev_priv->gt.active_requests)
> > > >  		goto out_unlock;
> > > >  
> > > > +	if (i915.enable_execlists)
> > > > +		intel_lr_wait_engines_idle(dev_priv);
> > > 
> > > Idle work handler... So runtime suspend.
> > > Anyway this is not an ideal place for a stall under struct_mutex (even if
> > > 16x10us, it's the principle!).
> > 
> > During runtime suspend this won't add any overhead since the context
> > done interrupt happened already (unless there is a bug somewhere else).
> 
> Where is that guaranteed? I thought we only serialised with the pm
> interrupts. Remember this happens before rpm suspend, since
> gem_idle_work_handler is responsible for dropping the GPU wakelock.

I meant that the 100msec after the last request signals completion and
this handler is scheduled is normally enough for the context complete
interrupt to get delivered. But yea, it's not a guarantee.

> > > Move this to before the first READ_ONCE(dev_priv->gt.active_requests);
> > > so we stall before taking the lock, and skip if any new requests arrive
> > > whilst waiting.
> > > 
> > > (Also i915.enable_execlists is forbidden. But meh)
> > > 
> > > static struct drm_i915_gem_request *
> > > execlists_active_port(struct intel_engine_cs *engine)
> > > {
> > > 	struct drm_i915_gem_request *request;
> > > 
> > > 	request = READ_ONCE(engine->execlist_port[1]);
> > > 	if (request)
> > > 		return request;
> > > 
> > > 	return READ_ONCE(engine->execlist_port[0]);
> > > }
> > > 
> > > /* Wait for execlists to settle, but bail if any new requests come in */
> > > for_each_engine(engine, dev_priv, id) {
> > > 	struct drm_i915_gem_request *request;
> > > 
> > > 	request = execlists_active_port(engine);
> > > 	if (!request)
> > > 		continue;
> > > 
> > > 	if (wait_for(execlists_active_port(engine) != request, 10))
> > > 		DRM_ERROR("Timeout waiting for %s to idle\n", engine->name);
> > > }
> > 
> > Hm, but we still need to re-check and bail out if not idle with
> > struct_mutex held, since gt.active_requests could go 0->1->0 before
> > taking struct_mutex? I can rewrite things with that check added, using
> > the above.
> 
> Hmm, apparently we don't care ;) If the context-done interrupt is
> serialised with runtime suspend, then we don't need a wait here at all.
> On the system path there are no new requests and we are just flushing
> the idle worker.
> 
> But yes, for the sake of correctness do both an unlocked wait followed
> by a locked wait.

Ok, will resend with this change. I used gt.active_requests to detect a
new request during the unlocked wait, by that we can avoid polling in
some cases.

--Imre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode
  2016-11-04 20:33       ` Imre Deak
@ 2016-11-04 21:01         ` Chris Wilson
  2016-11-04 22:32           ` Imre Deak
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2016-11-04 21:01 UTC (permalink / raw)
  To: Imre Deak; +Cc: intel-gfx, Mika Kuoppala

On Fri, Nov 04, 2016 at 10:33:24PM +0200, Imre Deak wrote:
> On Thu, 2016-11-03 at 21:14 +0000, Chris Wilson wrote:
> > Where is that guaranteed? I thought we only serialised with the pm
> > interrupts. Remember this happens before rpm suspend, since
> > gem_idle_work_handler is responsible for dropping the GPU wakelock.
> 
> I meant that the 100msec after the last request signals completion and
> this handler is scheduled is normally enough for the context complete
> interrupt to get delivered. But yea, it's not a guarantee.

If only it was that deterministic! The idle_worker was scheduled 100ms
after some retire_worker, just not necessarily the most recent. So it
could be running exactly as active_requests -> 0 and so before the
context-interrupt.

Anyway, it was a good find!
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode
  2016-11-04 21:01         ` Chris Wilson
@ 2016-11-04 22:32           ` Imre Deak
  2016-11-05  0:11             ` Imre Deak
  0 siblings, 1 reply; 12+ messages in thread
From: Imre Deak @ 2016-11-04 22:32 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, Mika Kuoppala

On Fri, 2016-11-04 at 21:01 +0000, Chris Wilson wrote:
> On Fri, Nov 04, 2016 at 10:33:24PM +0200, Imre Deak wrote:
> > On Thu, 2016-11-03 at 21:14 +0000, Chris Wilson wrote:
> > > Where is that guaranteed? I thought we only serialised with the
> > > pm
> > > interrupts. Remember this happens before rpm suspend, since
> > > gem_idle_work_handler is responsible for dropping the GPU
> > > wakelock.
> > 
> > I meant that the 100msec after the last request signals completion
> > and
> > this handler is scheduled is normally enough for the context
> > complete
> > interrupt to get delivered. But yea, it's not a guarantee.
> 
> If only it was that deterministic! The idle_worker was scheduled
> 100ms
> after some retire_worker, just not necessarily the most recent. So it
> could be running exactly as active_requests -> 0 and so before the
> context-interrupt.

Right, but we don't poll in that case, so there is no overhead.

> Anyway, it was a good find!
> -Chris
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode
  2016-11-04 22:32           ` Imre Deak
@ 2016-11-05  0:11             ` Imre Deak
  0 siblings, 0 replies; 12+ messages in thread
From: Imre Deak @ 2016-11-05  0:11 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, Mika Kuoppala

On Sat, 2016-11-05 at 00:32 +0200, Imre Deak wrote:
> On Fri, 2016-11-04 at 21:01 +0000, Chris Wilson wrote:
> > On Fri, Nov 04, 2016 at 10:33:24PM +0200, Imre Deak wrote:
> > > On Thu, 2016-11-03 at 21:14 +0000, Chris Wilson wrote:
> > > > Where is that guaranteed? I thought we only serialised with the
> > > > pm
> > > > interrupts. Remember this happens before rpm suspend, since
> > > > gem_idle_work_handler is responsible for dropping the GPU
> > > > wakelock.
> > > 
> > > I meant that the 100msec after the last request signals
> > > completion
> > > and
> > > this handler is scheduled is normally enough for the context
> > > complete
> > > interrupt to get delivered. But yea, it's not a guarantee.
> > 
> > If only it was that deterministic! The idle_worker was scheduled
> > 100ms
> > after some retire_worker, just not necessarily the most recent. So
> > it
> > could be running exactly as active_requests -> 0 and so before the
> > context-interrupt.
> 
> Right, but we don't poll in that case, so there is no overhead.

Ok, there is a small window in the idle_worker after the unlocked poll
and before taking the lock where a new request could be submitted and
retired. In that case active_requests could be 0 after taking the lock
and we'd have the poll overhead there.

We could detect this by the fact that there is a new idle_worker
pending and bail out in that case. We shouldn't idle the GPU in that
case anyway.

> > Anyway, it was a good find!
> > -Chris
> > 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-11-05  0:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-03 16:19 [PATCH 1/2] drm/i915: Make sure engines are idle during GPU idling in LR mode Imre Deak
2016-11-03 16:19 ` [PATCH 2/2] drm/i915: Add assert for no pending GPU requests during resume " Imre Deak
2016-11-03 19:08   ` Chris Wilson
2016-11-03 16:45 ` ✗ Fi.CI.BAT: warning for series starting with [1/2] drm/i915: Make sure engines are idle during GPU idling " Patchwork
2016-11-03 16:49 ` [PATCH 1/2] " Tvrtko Ursulin
2016-11-03 18:59 ` Chris Wilson
2016-11-03 20:57   ` Imre Deak
2016-11-03 21:14     ` Chris Wilson
2016-11-04 20:33       ` Imre Deak
2016-11-04 21:01         ` Chris Wilson
2016-11-04 22:32           ` Imre Deak
2016-11-05  0:11             ` Imre Deak

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.