All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/i915: Reorder media/render reset on g4x
@ 2017-05-18 12:34 Mika Kuoppala
  2017-05-18 12:34 ` [PATCH 2/2] drm/i915/g4x: Fix unreliable gpu reset Mika Kuoppala
  0 siblings, 1 reply; 9+ messages in thread
From: Mika Kuoppala @ 2017-05-18 12:34 UTC (permalink / raw)
  To: intel-gfx

From: Chris Wilson <chris@chris-wilson.co.uk>

Ville found a reference to WaMediaResetBeforeFullReset which we presume
means that we should simply do the media reset first.

References: https://bugs.freedesktop.org/show_bug.cgi?id=100942
Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index a9a6933afda2..7eaaf2225e1a 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1468,12 +1468,6 @@ static int g4x_do_reset(struct drm_i915_private *dev_priv, unsigned engine_mask)
 	struct pci_dev *pdev = dev_priv->drm.pdev;
 	int ret;
 
-	pci_write_config_byte(pdev, I915_GDRST,
-			      GRDOM_RENDER | GRDOM_RESET_ENABLE);
-	ret =  wait_for(g4x_reset_complete(pdev), 500);
-	if (ret)
-		goto out;
-
 	/* WaVcpClkGateDisableForMediaReset:ctg,elk */
 	I915_WRITE(VDECCLK_GATE_D, I915_READ(VDECCLK_GATE_D) | VCP_UNIT_CLOCK_GATE_DISABLE);
 	POSTING_READ(VDECCLK_GATE_D);
@@ -1481,11 +1475,17 @@ static int g4x_do_reset(struct drm_i915_private *dev_priv, unsigned engine_mask)
 	pci_write_config_byte(pdev, I915_GDRST,
 			      GRDOM_MEDIA | GRDOM_RESET_ENABLE);
 	ret =  wait_for(g4x_reset_complete(pdev), 500);
+	if (ret)
+		goto out;
 
 	/* WaVcpClkGateDisableForMediaReset:ctg,elk */
 	I915_WRITE(VDECCLK_GATE_D, I915_READ(VDECCLK_GATE_D) & ~VCP_UNIT_CLOCK_GATE_DISABLE);
 	POSTING_READ(VDECCLK_GATE_D);
 
+	pci_write_config_byte(pdev, I915_GDRST,
+			      GRDOM_RENDER | GRDOM_RESET_ENABLE);
+	ret =  wait_for(g4x_reset_complete(pdev), 500);
+
 out:
 	pci_write_config_byte(pdev, I915_GDRST, 0);
 	return ret;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] drm/i915/g4x: Fix unreliable gpu reset
  2017-05-18 12:34 [PATCH 1/2] drm/i915: Reorder media/render reset on g4x Mika Kuoppala
@ 2017-05-18 12:34 ` Mika Kuoppala
  2017-05-18 12:40   ` Mika Kuoppala
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Mika Kuoppala @ 2017-05-18 12:34 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomi Sarvela

ELK seems to very picky about the preconditions to reset.
Evidence on Eaglelake (8086:2e12 (rev 03)) shows that it does
not like if reset occurs when there is active ring.

Ville found out that there is workaround with name
'WaMediaResetMainRingCleanup' which suggests that we need to
cleanup rings before resetting. It is unclear what cleanup
exactly means but evidence shows that stopping the ring
does have an effect on reset reliability.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100998
Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 7eaaf2225e1a..1d473cd1f8a4 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1427,6 +1427,26 @@ int i915_reg_read_ioctl(struct drm_device *dev,
 	return ret;
 }
 
+static void gen3_stop_rings(struct drm_i915_private *dev_priv)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	for_each_engine(engine, dev_priv, id) {
+		const i915_reg_t mode_r = RING_MI_MODE(engine->mmio_base);
+
+		I915_WRITE_FW(mode_r, _MASKED_BIT_ENABLE(STOP_RING));
+		if (intel_wait_for_register_fw(dev_priv,
+					       mode_r,
+					       MODE_IDLE,
+					       MODE_IDLE,
+					       1000)) {
+			DRM_DEBUG("%s : timed out STOP_RING\n",
+				  engine->name);
+		}
+	}
+}
+
 static bool i915_reset_complete(struct pci_dev *pdev)
 {
 	u8 gdrst;
@@ -1472,6 +1492,12 @@ static int g4x_do_reset(struct drm_i915_private *dev_priv, unsigned engine_mask)
 	I915_WRITE(VDECCLK_GATE_D, I915_READ(VDECCLK_GATE_D) | VCP_UNIT_CLOCK_GATE_DISABLE);
 	POSTING_READ(VDECCLK_GATE_D);
 
+	/* We stop engines, otherwise we might get failed reset and a
+	 * dead gpu (on elk).
+	 */
+	/* WaMediaResetMainRingCleanup:ctg,elk (supposedly) */
+	gen3_stop_rings(dev_priv);
+
 	pci_write_config_byte(pdev, I915_GDRST,
 			      GRDOM_MEDIA | GRDOM_RESET_ENABLE);
 	ret =  wait_for(g4x_reset_complete(pdev), 500);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] drm/i915/g4x: Fix unreliable gpu reset
  2017-05-18 12:34 ` [PATCH 2/2] drm/i915/g4x: Fix unreliable gpu reset Mika Kuoppala
@ 2017-05-18 12:40   ` Mika Kuoppala
  2017-05-18 12:44   ` Chris Wilson
  2017-05-18 14:28   ` [PATCH v2 2/2] drm/i915/g4x: Improve gpu reset reliability Mika Kuoppala
  2 siblings, 0 replies; 9+ messages in thread
From: Mika Kuoppala @ 2017-05-18 12:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomi Sarvela

Mika Kuoppala <mika.kuoppala@linux.intel.com> writes:

> ELK seems to very picky about the preconditions to reset.
> Evidence on Eaglelake (8086:2e12 (rev 03)) shows that it does
> not like if reset occurs when there is active ring.
>
> Ville found out that there is workaround with name
> 'WaMediaResetMainRingCleanup' which suggests that we need to
> cleanup rings before resetting. It is unclear what cleanup
> exactly means but evidence shows that stopping the ring
> does have an effect on reset reliability.
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100998

This is wrong. Should be
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100942

> Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_uncore.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 7eaaf2225e1a..1d473cd1f8a4 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1427,6 +1427,26 @@ int i915_reg_read_ioctl(struct drm_device *dev,
>  	return ret;
>  }
>  
> +static void gen3_stop_rings(struct drm_i915_private *dev_priv)
> +{
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		const i915_reg_t mode_r = RING_MI_MODE(engine->mmio_base);
> +
> +		I915_WRITE_FW(mode_r, _MASKED_BIT_ENABLE(STOP_RING));
> +		if (intel_wait_for_register_fw(dev_priv,
> +					       mode_r,
> +					       MODE_IDLE,
> +					       MODE_IDLE,
> +					       1000)) {
> +			DRM_DEBUG("%s : timed out STOP_RING\n",
> +				  engine->name);
> +		}
> +	}
> +}
> +
>  static bool i915_reset_complete(struct pci_dev *pdev)
>  {
>  	u8 gdrst;
> @@ -1472,6 +1492,12 @@ static int g4x_do_reset(struct drm_i915_private *dev_priv, unsigned engine_mask)
>  	I915_WRITE(VDECCLK_GATE_D, I915_READ(VDECCLK_GATE_D) | VCP_UNIT_CLOCK_GATE_DISABLE);
>  	POSTING_READ(VDECCLK_GATE_D);
>  
> +	/* We stop engines, otherwise we might get failed reset and a
> +	 * dead gpu (on elk).
> +	 */
> +	/* WaMediaResetMainRingCleanup:ctg,elk (supposedly) */
> +	gen3_stop_rings(dev_priv);
> +
>  	pci_write_config_byte(pdev, I915_GDRST,
>  			      GRDOM_MEDIA | GRDOM_RESET_ENABLE);
>  	ret =  wait_for(g4x_reset_complete(pdev), 500);
> -- 
> 2.11.0
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] drm/i915/g4x: Fix unreliable gpu reset
  2017-05-18 12:34 ` [PATCH 2/2] drm/i915/g4x: Fix unreliable gpu reset Mika Kuoppala
  2017-05-18 12:40   ` Mika Kuoppala
@ 2017-05-18 12:44   ` Chris Wilson
  2017-05-18 12:51     ` Mika Kuoppala
  2017-05-18 14:28   ` [PATCH v2 2/2] drm/i915/g4x: Improve gpu reset reliability Mika Kuoppala
  2 siblings, 1 reply; 9+ messages in thread
From: Chris Wilson @ 2017-05-18 12:44 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: Tomi Sarvela, intel-gfx

On Thu, May 18, 2017 at 03:34:22PM +0300, Mika Kuoppala wrote:
> ELK seems to very picky about the preconditions to reset.
> Evidence on Eaglelake (8086:2e12 (rev 03)) shows that it does
> not like if reset occurs when there is active ring.
> 
> Ville found out that there is workaround with name
> 'WaMediaResetMainRingCleanup' which suggests that we need to
> cleanup rings before resetting. It is unclear what cleanup
> exactly means but evidence shows that stopping the ring
> does have an effect on reset reliability.
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100998
> Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_uncore.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 7eaaf2225e1a..1d473cd1f8a4 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1427,6 +1427,26 @@ int i915_reg_read_ioctl(struct drm_device *dev,
>  	return ret;
>  }
>  
> +static void gen3_stop_rings(struct drm_i915_private *dev_priv)
> +{
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		const i915_reg_t mode_r = RING_MI_MODE(engine->mmio_base);
> +
> +		I915_WRITE_FW(mode_r, _MASKED_BIT_ENABLE(STOP_RING));

This will only stop between instructions on the CS (i.e. within the
ring) aiui . If we did hang in a shader, this is not going to achieve
very much.

> +		if (intel_wait_for_register_fw(dev_priv,
> +					       mode_r,
> +					       MODE_IDLE,
> +					       MODE_IDLE,
> +					       1000)) {
> +			DRM_DEBUG("%s : timed out STOP_RING\n",
> +				  engine->name);

If we make this a DRM_ERROR I expect it to fire through the hang tests
in BAT.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] drm/i915/g4x: Fix unreliable gpu reset
  2017-05-18 12:44   ` Chris Wilson
@ 2017-05-18 12:51     ` Mika Kuoppala
  2017-05-18 12:58       ` Chris Wilson
  0 siblings, 1 reply; 9+ messages in thread
From: Mika Kuoppala @ 2017-05-18 12:51 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Tomi Sarvela, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Thu, May 18, 2017 at 03:34:22PM +0300, Mika Kuoppala wrote:
>> ELK seems to very picky about the preconditions to reset.
>> Evidence on Eaglelake (8086:2e12 (rev 03)) shows that it does
>> not like if reset occurs when there is active ring.
>> 
>> Ville found out that there is workaround with name
>> 'WaMediaResetMainRingCleanup' which suggests that we need to
>> cleanup rings before resetting. It is unclear what cleanup
>> exactly means but evidence shows that stopping the ring
>> does have an effect on reset reliability.
>> 
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100998
>> Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> ---
>>  drivers/gpu/drm/i915/intel_uncore.c | 26 ++++++++++++++++++++++++++
>>  1 file changed, 26 insertions(+)
>> 
>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> index 7eaaf2225e1a..1d473cd1f8a4 100644
>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> @@ -1427,6 +1427,26 @@ int i915_reg_read_ioctl(struct drm_device *dev,
>>  	return ret;
>>  }
>>  
>> +static void gen3_stop_rings(struct drm_i915_private *dev_priv)
>> +{
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +
>> +	for_each_engine(engine, dev_priv, id) {
>> +		const i915_reg_t mode_r = RING_MI_MODE(engine->mmio_base);
>> +
>> +		I915_WRITE_FW(mode_r, _MASKED_BIT_ENABLE(STOP_RING));
>
> This will only stop between instructions on the CS (i.e. within the
> ring) aiui . If we did hang in a shader, this is not going to achieve
> very much.

The subject is too grandiose then. Should I add to the subject 'basic
hangs'?

This seems to help get our hang tests through, bringing more coverage.
I pondered adding ctl, head and tail writes to zero. It would not help
with shades, but should not hurt either.

>
>> +		if (intel_wait_for_register_fw(dev_priv,
>> +					       mode_r,
>> +					       MODE_IDLE,
>> +					       MODE_IDLE,
>> +					       1000)) {
>> +			DRM_DEBUG("%s : timed out STOP_RING\n",
>> +				  engine->name);
>
> If we make this a DRM_ERROR I expect it to fire through the hang tests
> in BAT.

Why would stopping the ring be a problem with a clean chained batch?
Well I try to see how it goes.

-Mika


> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] drm/i915/g4x: Fix unreliable gpu reset
  2017-05-18 12:51     ` Mika Kuoppala
@ 2017-05-18 12:58       ` Chris Wilson
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Wilson @ 2017-05-18 12:58 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: Tomi Sarvela, intel-gfx

On Thu, May 18, 2017 at 03:51:19PM +0300, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > On Thu, May 18, 2017 at 03:34:22PM +0300, Mika Kuoppala wrote:
> >> ELK seems to very picky about the preconditions to reset.
> >> Evidence on Eaglelake (8086:2e12 (rev 03)) shows that it does
> >> not like if reset occurs when there is active ring.
> >> 
> >> Ville found out that there is workaround with name
> >> 'WaMediaResetMainRingCleanup' which suggests that we need to
> >> cleanup rings before resetting. It is unclear what cleanup
> >> exactly means but evidence shows that stopping the ring
> >> does have an effect on reset reliability.
> >> 
> >> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100998
> >> Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >> Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> >> ---
> >>  drivers/gpu/drm/i915/intel_uncore.c | 26 ++++++++++++++++++++++++++
> >>  1 file changed, 26 insertions(+)
> >> 
> >> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> >> index 7eaaf2225e1a..1d473cd1f8a4 100644
> >> --- a/drivers/gpu/drm/i915/intel_uncore.c
> >> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> >> @@ -1427,6 +1427,26 @@ int i915_reg_read_ioctl(struct drm_device *dev,
> >>  	return ret;
> >>  }
> >>  
> >> +static void gen3_stop_rings(struct drm_i915_private *dev_priv)
> >> +{
> >> +	struct intel_engine_cs *engine;
> >> +	enum intel_engine_id id;
> >> +
> >> +	for_each_engine(engine, dev_priv, id) {
> >> +		const i915_reg_t mode_r = RING_MI_MODE(engine->mmio_base);
> >> +
> >> +		I915_WRITE_FW(mode_r, _MASKED_BIT_ENABLE(STOP_RING));
> >
> > This will only stop between instructions on the CS (i.e. within the
> > ring) aiui . If we did hang in a shader, this is not going to achieve
> > very much.
> 
> The subject is too grandiose then. Should I add to the subject 'basic
> hangs'?
> 
> This seems to help get our hang tests through, bringing more coverage.
> I pondered adding ctl, head and tail writes to zero. It would not help
> with shades, but should not hurt either.
> 
> >
> >> +		if (intel_wait_for_register_fw(dev_priv,
> >> +					       mode_r,
> >> +					       MODE_IDLE,
> >> +					       MODE_IDLE,
> >> +					       1000)) {
> >> +			DRM_DEBUG("%s : timed out STOP_RING\n",
> >> +				  engine->name);
> >
> > If we make this a DRM_ERROR I expect it to fire through the hang tests
> > in BAT.
> 
> Why would stopping the ring be a problem with a clean chained batch?
> Well I try to see how it goes.

My understanding is that the STOP_RING only takes affect on the ring
itself, our chained batches stay away from the ring. At least that's the
problem I encountered when trying to use STOP_RING + SYNC_FLUSH for
seqno flushing.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 2/2] drm/i915/g4x: Improve gpu reset reliability
  2017-05-18 12:34 ` [PATCH 2/2] drm/i915/g4x: Fix unreliable gpu reset Mika Kuoppala
  2017-05-18 12:40   ` Mika Kuoppala
  2017-05-18 12:44   ` Chris Wilson
@ 2017-05-18 14:28   ` Mika Kuoppala
  2017-05-18 22:03     ` Chris Wilson
  2 siblings, 1 reply; 9+ messages in thread
From: Mika Kuoppala @ 2017-05-18 14:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomi Sarvela

ELK seems to very picky about the preconditions to reset.
Evidence on Eaglelake (8086:2e12 (rev 03)) shows that it does
not like if reset occurs when there is active ring.

Ville found out that there is workaround with name
'WaMediaResetMainRingCleanup' which suggests that we need to
cleanup rings before resetting. It is unclear what cleanup
exactly means but evidence shows that stopping the ring
does have an effect on reset reliability. This patch makes
reset succesful on hangs induced by chained batches (the igt ones).
Note that if the hang is inside a shader, it is possible
that our attempts to stop the ring achieves anything.

v2: zero ctl,head,tail also. bug ref. use driver debugs (Chris)

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100942
Testcase: igt/gem_busy/*-hang
Testcase: igt/gem_ringfill/hang-*
Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 7eaaf2225e1a..43da84be0321 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1427,6 +1427,35 @@ int i915_reg_read_ioctl(struct drm_device *dev,
 	return ret;
 }
 
+static void gen3_stop_rings(struct drm_i915_private *dev_priv)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	for_each_engine(engine, dev_priv, id) {
+		const u32 base = engine->mmio_base;
+		const i915_reg_t mode = RING_MI_MODE(base);
+
+		I915_WRITE_FW(mode, _MASKED_BIT_ENABLE(STOP_RING));
+		if (intel_wait_for_register_fw(dev_priv,
+					       mode,
+					       MODE_IDLE,
+					       MODE_IDLE,
+					       500))
+			DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n",
+					 engine->name);
+
+		I915_WRITE_FW(RING_CTL(base), 0);
+		I915_WRITE_FW(RING_HEAD(base), 0);
+		I915_WRITE_FW(RING_TAIL(base), 0);
+
+		/* Check acts as a post */
+		if (I915_READ_FW(RING_HEAD(base)) != 0)
+			DRM_DEBUG_DRIVER("%s: ring head not parked\n",
+					 engine->name);
+	}
+}
+
 static bool i915_reset_complete(struct pci_dev *pdev)
 {
 	u8 gdrst;
@@ -1472,6 +1501,12 @@ static int g4x_do_reset(struct drm_i915_private *dev_priv, unsigned engine_mask)
 	I915_WRITE(VDECCLK_GATE_D, I915_READ(VDECCLK_GATE_D) | VCP_UNIT_CLOCK_GATE_DISABLE);
 	POSTING_READ(VDECCLK_GATE_D);
 
+	/* We stop engines, otherwise we might get failed reset and a
+	 * dead gpu (on elk).
+	 */
+	/* WaMediaResetMainRingCleanup:ctg,elk (supposedly) */
+	gen3_stop_rings(dev_priv);
+
 	pci_write_config_byte(pdev, I915_GDRST,
 			      GRDOM_MEDIA | GRDOM_RESET_ENABLE);
 	ret =  wait_for(g4x_reset_complete(pdev), 500);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] drm/i915/g4x: Improve gpu reset reliability
  2017-05-18 14:28   ` [PATCH v2 2/2] drm/i915/g4x: Improve gpu reset reliability Mika Kuoppala
@ 2017-05-18 22:03     ` Chris Wilson
  2017-05-19 11:22       ` Mika Kuoppala
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Wilson @ 2017-05-18 22:03 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: Tomi Sarvela, intel-gfx

On Thu, May 18, 2017 at 05:28:41PM +0300, Mika Kuoppala wrote:
> ELK seems to very picky about the preconditions to reset.
> Evidence on Eaglelake (8086:2e12 (rev 03)) shows that it does
> not like if reset occurs when there is active ring.
> 
> Ville found out that there is workaround with name
> 'WaMediaResetMainRingCleanup' which suggests that we need to
> cleanup rings before resetting. It is unclear what cleanup
> exactly means but evidence shows that stopping the ring
> does have an effect on reset reliability. This patch makes
> reset succesful on hangs induced by chained batches (the igt ones).
> Note that if the hang is inside a shader, it is possible
> that our attempts to stop the ring achieves anything.
> 
> v2: zero ctl,head,tail also. bug ref. use driver debugs (Chris)
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100942
> Testcase: igt/gem_busy/*-hang
> Testcase: igt/gem_ringfill/hang-*

Maybe add # elk to these to indicate the problem isn't quite that
widespread!

> Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_uncore.c | 35 +++++++++++++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 7eaaf2225e1a..43da84be0321 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1427,6 +1427,35 @@ int i915_reg_read_ioctl(struct drm_device *dev,
>  	return ret;
>  }
>  
> +static void gen3_stop_rings(struct drm_i915_private *dev_priv)
> +{
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		const u32 base = engine->mmio_base;
> +		const i915_reg_t mode = RING_MI_MODE(base);
> +
> +		I915_WRITE_FW(mode, _MASKED_BIT_ENABLE(STOP_RING));
> +		if (intel_wait_for_register_fw(dev_priv,
> +					       mode,
> +					       MODE_IDLE,
> +					       MODE_IDLE,
> +					       500))
> +			DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n",
> +					 engine->name);
> +
> +		I915_WRITE_FW(RING_CTL(base), 0);
> +		I915_WRITE_FW(RING_HEAD(base), 0);
> +		I915_WRITE_FW(RING_TAIL(base), 0);
> +
> +		/* Check acts as a post */
> +		if (I915_READ_FW(RING_HEAD(base)) != 0)
> +			DRM_DEBUG_DRIVER("%s: ring head not parked\n",
> +					 engine->name);
> +	}
> +}
> +
>  static bool i915_reset_complete(struct pci_dev *pdev)
>  {
>  	u8 gdrst;
> @@ -1472,6 +1501,12 @@ static int g4x_do_reset(struct drm_i915_private *dev_priv, unsigned engine_mask)
>  	I915_WRITE(VDECCLK_GATE_D, I915_READ(VDECCLK_GATE_D) | VCP_UNIT_CLOCK_GATE_DISABLE);
>  	POSTING_READ(VDECCLK_GATE_D);
>  
> +	/* We stop engines, otherwise we might get failed reset and a
> +	 * dead gpu (on elk).
> +	 */
> +	/* WaMediaResetMainRingCleanup:ctg,elk (supposedly) */

Join this into a single comment block, s/supposedly/presumably/

Just a small concern we have some duplication of stop_ring() here, but I
don't have a better suggestion (along the lines of export intel_stop_ring,
gen3_engine_stop_ring, so far looks more confusing than helpful). As you
have tested with DRM_ERROR to be sure that fear about this simply
timing out for our spinning batches, it looks good to me.

Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] drm/i915/g4x: Improve gpu reset reliability
  2017-05-18 22:03     ` Chris Wilson
@ 2017-05-19 11:22       ` Mika Kuoppala
  0 siblings, 0 replies; 9+ messages in thread
From: Mika Kuoppala @ 2017-05-19 11:22 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Tomi Sarvela, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Thu, May 18, 2017 at 05:28:41PM +0300, Mika Kuoppala wrote:
>> ELK seems to very picky about the preconditions to reset.
>> Evidence on Eaglelake (8086:2e12 (rev 03)) shows that it does
>> not like if reset occurs when there is active ring.
>> 
>> Ville found out that there is workaround with name
>> 'WaMediaResetMainRingCleanup' which suggests that we need to
>> cleanup rings before resetting. It is unclear what cleanup
>> exactly means but evidence shows that stopping the ring
>> does have an effect on reset reliability. This patch makes
>> reset succesful on hangs induced by chained batches (the igt ones).
>> Note that if the hang is inside a shader, it is possible
>> that our attempts to stop the ring achieves anything.
>> 
>> v2: zero ctl,head,tail also. bug ref. use driver debugs (Chris)
>> 
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100942
>> Testcase: igt/gem_busy/*-hang
>> Testcase: igt/gem_ringfill/hang-*
>
> Maybe add # elk to these to indicate the problem isn't quite that
> widespread!
>
>> Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> ---
>>  drivers/gpu/drm/i915/intel_uncore.c | 35 +++++++++++++++++++++++++++++++++++
>>  1 file changed, 35 insertions(+)
>> 
>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> index 7eaaf2225e1a..43da84be0321 100644
>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> @@ -1427,6 +1427,35 @@ int i915_reg_read_ioctl(struct drm_device *dev,
>>  	return ret;
>>  }
>>  
>> +static void gen3_stop_rings(struct drm_i915_private *dev_priv)
>> +{
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +
>> +	for_each_engine(engine, dev_priv, id) {
>> +		const u32 base = engine->mmio_base;
>> +		const i915_reg_t mode = RING_MI_MODE(base);
>> +
>> +		I915_WRITE_FW(mode, _MASKED_BIT_ENABLE(STOP_RING));
>> +		if (intel_wait_for_register_fw(dev_priv,
>> +					       mode,
>> +					       MODE_IDLE,
>> +					       MODE_IDLE,
>> +					       500))
>> +			DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n",
>> +					 engine->name);
>> +
>> +		I915_WRITE_FW(RING_CTL(base), 0);
>> +		I915_WRITE_FW(RING_HEAD(base), 0);
>> +		I915_WRITE_FW(RING_TAIL(base), 0);
>> +
>> +		/* Check acts as a post */
>> +		if (I915_READ_FW(RING_HEAD(base)) != 0)
>> +			DRM_DEBUG_DRIVER("%s: ring head not parked\n",
>> +					 engine->name);
>> +	}
>> +}
>> +
>>  static bool i915_reset_complete(struct pci_dev *pdev)
>>  {
>>  	u8 gdrst;
>> @@ -1472,6 +1501,12 @@ static int g4x_do_reset(struct drm_i915_private *dev_priv, unsigned engine_mask)
>>  	I915_WRITE(VDECCLK_GATE_D, I915_READ(VDECCLK_GATE_D) | VCP_UNIT_CLOCK_GATE_DISABLE);
>>  	POSTING_READ(VDECCLK_GATE_D);
>>  
>> +	/* We stop engines, otherwise we might get failed reset and a
>> +	 * dead gpu (on elk).
>> +	 */
>> +	/* WaMediaResetMainRingCleanup:ctg,elk (supposedly) */
>
> Join this into a single comment block, s/supposedly/presumably/
>
> Just a small concern we have some duplication of stop_ring() here, but I
> don't have a better suggestion (along the lines of export intel_stop_ring,
> gen3_engine_stop_ring, so far looks more confusing than helpful). As
> you

I had a patch which piggypacked engine->reset_hw(engine, NULL) to do
the dirty work of stopping the ring. But the stop_ring of()
intel_ringbuffer.c was giving up halfway if it didn't find
idling the ring succesful, leaving head/tail intact.
And that was on the prepare reset path. The boon was that
it stopped the rings before killing the tasklet.

But I decided to do more surgical approach directy on top of
reset. If we find another gen which is suspectible, we might
want to either piggypack on reset_hw or do a engine->stop_ring()
and use it in prepare for reset path.

> have tested with DRM_ERROR to be sure that fear about this simply
> timing out for our spinning batches, it looks good to me.
>

Spinning batches in general seem to go idle nice, but gem_ringfill
will spew out that ring_stop timeout debug.

> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

Thanks for review. Patch pushed.
-Mika
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-05-19 11:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-18 12:34 [PATCH 1/2] drm/i915: Reorder media/render reset on g4x Mika Kuoppala
2017-05-18 12:34 ` [PATCH 2/2] drm/i915/g4x: Fix unreliable gpu reset Mika Kuoppala
2017-05-18 12:40   ` Mika Kuoppala
2017-05-18 12:44   ` Chris Wilson
2017-05-18 12:51     ` Mika Kuoppala
2017-05-18 12:58       ` Chris Wilson
2017-05-18 14:28   ` [PATCH v2 2/2] drm/i915/g4x: Improve gpu reset reliability Mika Kuoppala
2017-05-18 22:03     ` Chris Wilson
2017-05-19 11:22       ` Mika Kuoppala

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.