All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2
@ 2017-02-15 12:37 Mika Kuoppala
  2017-02-15 12:52   ` Chris Wilson
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-02-15 12:37 UTC (permalink / raw)
  To: intel-gfx
  Cc: Mika Kuoppala, Chris Wilson, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

Certain Baytrails, namely the 4 cpu core variants, have been
plaqued by spurious system hangs, mostly occurring with light loads.

Multiple bisects by various people point to a commit which changes the
reclocking strategy for Baytrail to follow its bigger brethen:
commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")

There is also a review comment attached to this commit from Deepak S
on avoiding punit access on Cherryview and thus it was excluded on
common reclocking path. By taking the same approach and omitting
the punit access by not tweaking the thresholds when the hardware
has been asked to move into different frequency, considerable gains
in stability have been observed.

With J1900 box, light render/video load would end up in system hang
in usually less than 12 hours. With this patch applied, the cumulative
uptime has now been 34 days without issues. To provoke system hang,
light loads on both render and bsd engines in parallel have been used:
glxgears >/dev/null 2>/dev/null &
mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4

So far, author has not witnessed system hang with above load
and this patch applied. Reports from the tenacious people at
kernel bugzilla are also promising.

Considering that the punit access frequency with this patch is
considerably less, there is a possibility that this will push
the, still unknown, root cause past the triggering point on most loads.

But as we now can reliably reproduce the hang independently,
we can reduce the pain that users are having and use a
static thresholds until a root cause is found.

References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: fritsch@xbmc.org
Cc: miku@iki.fi
Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
CC: Michal Feix <michal@feix.cz>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Deepak S <deepak.s@linux.intel.com>
Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Cc: <stable@vger.kernel.org> # v4.2+
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 4 ++--
 drivers/gpu/drm/i915/i915_reg.h | 2 ++
 drivers/gpu/drm/i915/intel_pm.c | 6 +++++-
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index a887aef..319c02d 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1095,7 +1095,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
 	if (pm_iir & GEN6_PM_RP_DOWN_EI_EXPIRED) {
 		if (!vlv_c0_above(dev_priv,
 				  &dev_priv->rps.down_ei, &now,
-				  dev_priv->rps.down_threshold))
+				  VLV_RP_DOWN_EI_THRESHOLD))
 			events |= GEN6_PM_RP_DOWN_THRESHOLD;
 		dev_priv->rps.down_ei = now;
 	}
@@ -1103,7 +1103,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
 	if (pm_iir & GEN6_PM_RP_UP_EI_EXPIRED) {
 		if (vlv_c0_above(dev_priv,
 				 &dev_priv->rps.up_ei, &now,
-				 dev_priv->rps.up_threshold))
+				 VLV_RP_UP_EI_THRESHOLD))
 			events |= GEN6_PM_RP_UP_THRESHOLD;
 		dev_priv->rps.up_ei = now;
 	}
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 141a5c1..1297f6a 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1135,6 +1135,8 @@ enum skl_disp_power_wells {
 #define 	CHV_BIAS_CPU_50_SOC_50 (3 << 2)
 
 #define VLV_CZ_CLOCK_TO_MILLI_SEC		100000
+#define VLV_RP_UP_EI_THRESHOLD			90
+#define VLV_RP_DOWN_EI_THRESHOLD		70
 
 /* vlv2 north clock has */
 #define CCK_FUSE_REG				0x8
diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 3d311e1..bce6aae 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -4971,7 +4971,11 @@ static int valleyview_set_rps(struct drm_i915_private *dev_priv, u8 val)
 		if (err)
 			return err;
 
-		gen6_set_rps_thresholds(dev_priv, val);
+		/* When byt can survive without system hang with dynamic
+		 * sw freq adjustments, this restriction can be lifted.
+		 */
+		if (!IS_VALLEYVIEW(dev_priv))
+			gen6_set_rps_thresholds(dev_priv, val);
 	}
 
 	dev_priv->rps.cur_freq = val;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2
  2017-02-15 12:37 [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2 Mika Kuoppala
@ 2017-02-15 12:52   ` Chris Wilson
  2017-02-15 13:52   ` Mika Kuoppala
  2017-02-15 15:28 ` ✓ Fi.CI.BAT: success for drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2 (rev2) Patchwork
  2 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-02-15 12:52 UTC (permalink / raw)
  To: Mika Kuoppala
  Cc: intel-gfx, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

On Wed, Feb 15, 2017 at 02:37:50PM +0200, Mika Kuoppala wrote:
> Certain Baytrails, namely the 4 cpu core variants, have been
> plaqued by spurious system hangs, mostly occurring with light loads.
> 
> Multiple bisects by various people point to a commit which changes the
> reclocking strategy for Baytrail to follow its bigger brethen:
> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
> 
> There is also a review comment attached to this commit from Deepak S
> on avoiding punit access on Cherryview and thus it was excluded on
> common reclocking path. By taking the same approach and omitting
> the punit access by not tweaking the thresholds when the hardware
> has been asked to move into different frequency, considerable gains
> in stability have been observed.
> 
> With J1900 box, light render/video load would end up in system hang
> in usually less than 12 hours. With this patch applied, the cumulative
> uptime has now been 34 days without issues. To provoke system hang,
> light loads on both render and bsd engines in parallel have been used:
> glxgears >/dev/null 2>/dev/null &
> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
> 
> So far, author has not witnessed system hang with above load
> and this patch applied. Reports from the tenacious people at
> kernel bugzilla are also promising.
> 
> Considering that the punit access frequency with this patch is
> considerably less, there is a possibility that this will push
> the, still unknown, root cause past the triggering point on most loads.
> 
> But as we now can reliably reproduce the hang independently,
> we can reduce the pain that users are having and use a
> static thresholds until a root cause is found.
> 
> References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Ville Syrj�l� <ville.syrjala@linux.intel.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Jani Nikula <jani.nikula@intel.com>
> Cc: fritsch@xbmc.org
> Cc: miku@iki.fi
> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
> CC: Michal Feix <michal@feix.cz>
> Cc: Hans de Goede <hdegoede@redhat.com>
> Cc: Deepak S <deepak.s@linux.intel.com>
> Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v4.2+
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_irq.c | 4 ++--
>  drivers/gpu/drm/i915/i915_reg.h | 2 ++
>  drivers/gpu/drm/i915/intel_pm.c | 6 +++++-
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index a887aef..319c02d 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1095,7 +1095,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
>  	if (pm_iir & GEN6_PM_RP_DOWN_EI_EXPIRED) {
>  		if (!vlv_c0_above(dev_priv,
>  				  &dev_priv->rps.down_ei, &now,
> -				  dev_priv->rps.down_threshold))
> +				  VLV_RP_DOWN_EI_THRESHOLD))
>  			events |= GEN6_PM_RP_DOWN_THRESHOLD;
>  		dev_priv->rps.down_ei = now;
>  	}
> @@ -1103,7 +1103,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
>  	if (pm_iir & GEN6_PM_RP_UP_EI_EXPIRED) {
>  		if (vlv_c0_above(dev_priv,
>  				 &dev_priv->rps.up_ei, &now,
> -				 dev_priv->rps.up_threshold))
> +				 VLV_RP_UP_EI_THRESHOLD))

A patch to set them as we set the default values during rps enable so
that you don't break the debug interfaces.

>  			events |= GEN6_PM_RP_UP_THRESHOLD;
>  		dev_priv->rps.up_ei = now;
>  	}
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 141a5c1..1297f6a 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -1135,6 +1135,8 @@ enum skl_disp_power_wells {
>  #define 	CHV_BIAS_CPU_50_SOC_50 (3 << 2)
>  
>  #define VLV_CZ_CLOCK_TO_MILLI_SEC		100000
> +#define VLV_RP_UP_EI_THRESHOLD			90
> +#define VLV_RP_DOWN_EI_THRESHOLD		70
>  
>  /* vlv2 north clock has */
>  #define CCK_FUSE_REG				0x8
> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> index 3d311e1..bce6aae 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -4971,7 +4971,11 @@ static int valleyview_set_rps(struct drm_i915_private *dev_priv, u8 val)
>  		if (err)
>  			return err;
>  
> -		gen6_set_rps_thresholds(dev_priv, val);
> +		/* When byt can survive without system hang with dynamic
> +		 * sw freq adjustments, this restriction can be lifted.
> +		 */
> +		if (!IS_VALLEYVIEW(dev_priv))

Are all vlv affected?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2
@ 2017-02-15 12:52   ` Chris Wilson
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-02-15 12:52 UTC (permalink / raw)
  To: Mika Kuoppala
  Cc: intel-gfx, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

On Wed, Feb 15, 2017 at 02:37:50PM +0200, Mika Kuoppala wrote:
> Certain Baytrails, namely the 4 cpu core variants, have been
> plaqued by spurious system hangs, mostly occurring with light loads.
> 
> Multiple bisects by various people point to a commit which changes the
> reclocking strategy for Baytrail to follow its bigger brethen:
> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
> 
> There is also a review comment attached to this commit from Deepak S
> on avoiding punit access on Cherryview and thus it was excluded on
> common reclocking path. By taking the same approach and omitting
> the punit access by not tweaking the thresholds when the hardware
> has been asked to move into different frequency, considerable gains
> in stability have been observed.
> 
> With J1900 box, light render/video load would end up in system hang
> in usually less than 12 hours. With this patch applied, the cumulative
> uptime has now been 34 days without issues. To provoke system hang,
> light loads on both render and bsd engines in parallel have been used:
> glxgears >/dev/null 2>/dev/null &
> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
> 
> So far, author has not witnessed system hang with above load
> and this patch applied. Reports from the tenacious people at
> kernel bugzilla are also promising.
> 
> Considering that the punit access frequency with this patch is
> considerably less, there is a possibility that this will push
> the, still unknown, root cause past the triggering point on most loads.
> 
> But as we now can reliably reproduce the hang independently,
> we can reduce the pain that users are having and use a
> static thresholds until a root cause is found.
> 
> References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Jani Nikula <jani.nikula@intel.com>
> Cc: fritsch@xbmc.org
> Cc: miku@iki.fi
> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
> CC: Michal Feix <michal@feix.cz>
> Cc: Hans de Goede <hdegoede@redhat.com>
> Cc: Deepak S <deepak.s@linux.intel.com>
> Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v4.2+
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_irq.c | 4 ++--
>  drivers/gpu/drm/i915/i915_reg.h | 2 ++
>  drivers/gpu/drm/i915/intel_pm.c | 6 +++++-
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index a887aef..319c02d 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1095,7 +1095,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
>  	if (pm_iir & GEN6_PM_RP_DOWN_EI_EXPIRED) {
>  		if (!vlv_c0_above(dev_priv,
>  				  &dev_priv->rps.down_ei, &now,
> -				  dev_priv->rps.down_threshold))
> +				  VLV_RP_DOWN_EI_THRESHOLD))
>  			events |= GEN6_PM_RP_DOWN_THRESHOLD;
>  		dev_priv->rps.down_ei = now;
>  	}
> @@ -1103,7 +1103,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
>  	if (pm_iir & GEN6_PM_RP_UP_EI_EXPIRED) {
>  		if (vlv_c0_above(dev_priv,
>  				 &dev_priv->rps.up_ei, &now,
> -				 dev_priv->rps.up_threshold))
> +				 VLV_RP_UP_EI_THRESHOLD))

A patch to set them as we set the default values during rps enable so
that you don't break the debug interfaces.

>  			events |= GEN6_PM_RP_UP_THRESHOLD;
>  		dev_priv->rps.up_ei = now;
>  	}
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 141a5c1..1297f6a 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -1135,6 +1135,8 @@ enum skl_disp_power_wells {
>  #define 	CHV_BIAS_CPU_50_SOC_50 (3 << 2)
>  
>  #define VLV_CZ_CLOCK_TO_MILLI_SEC		100000
> +#define VLV_RP_UP_EI_THRESHOLD			90
> +#define VLV_RP_DOWN_EI_THRESHOLD		70
>  
>  /* vlv2 north clock has */
>  #define CCK_FUSE_REG				0x8
> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> index 3d311e1..bce6aae 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -4971,7 +4971,11 @@ static int valleyview_set_rps(struct drm_i915_private *dev_priv, u8 val)
>  		if (err)
>  			return err;
>  
> -		gen6_set_rps_thresholds(dev_priv, val);
> +		/* When byt can survive without system hang with dynamic
> +		 * sw freq adjustments, this restriction can be lifted.
> +		 */
> +		if (!IS_VALLEYVIEW(dev_priv))

Are all vlv affected?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2
  2017-02-15 12:52   ` Chris Wilson
  (?)
@ 2017-02-15 13:51   ` Mika Kuoppala
  -1 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-02-15 13:51 UTC (permalink / raw)
  To: Chris Wilson
  Cc: intel-gfx, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Wed, Feb 15, 2017 at 02:37:50PM +0200, Mika Kuoppala wrote:
>> Certain Baytrails, namely the 4 cpu core variants, have been
>> plaqued by spurious system hangs, mostly occurring with light loads.
>> 
>> Multiple bisects by various people point to a commit which changes the
>> reclocking strategy for Baytrail to follow its bigger brethen:
>> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
>> 
>> There is also a review comment attached to this commit from Deepak S
>> on avoiding punit access on Cherryview and thus it was excluded on
>> common reclocking path. By taking the same approach and omitting
>> the punit access by not tweaking the thresholds when the hardware
>> has been asked to move into different frequency, considerable gains
>> in stability have been observed.
>> 
>> With J1900 box, light render/video load would end up in system hang
>> in usually less than 12 hours. With this patch applied, the cumulative
>> uptime has now been 34 days without issues. To provoke system hang,
>> light loads on both render and bsd engines in parallel have been used:
>> glxgears >/dev/null 2>/dev/null &
>> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
>> 
>> So far, author has not witnessed system hang with above load
>> and this patch applied. Reports from the tenacious people at
>> kernel bugzilla are also promising.
>> 
>> Considering that the punit access frequency with this patch is
>> considerably less, there is a possibility that this will push
>> the, still unknown, root cause past the triggering point on most loads.
>> 
>> But as we now can reliably reproduce the hang independently,
>> we can reduce the pain that users are having and use a
>> static thresholds until a root cause is found.
>> 
>> References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>> Cc: Len Brown <len.brown@intel.com>
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Cc: Jani Nikula <jani.nikula@intel.com>
>> Cc: fritsch@xbmc.org
>> Cc: miku@iki.fi
>> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
>> CC: Michal Feix <michal@feix.cz>
>> Cc: Hans de Goede <hdegoede@redhat.com>
>> Cc: Deepak S <deepak.s@linux.intel.com>
>> Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
>> Cc: <stable@vger.kernel.org> # v4.2+
>> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_irq.c | 4 ++--
>>  drivers/gpu/drm/i915/i915_reg.h | 2 ++
>>  drivers/gpu/drm/i915/intel_pm.c | 6 +++++-
>>  3 files changed, 9 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>> index a887aef..319c02d 100644
>> --- a/drivers/gpu/drm/i915/i915_irq.c
>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>> @@ -1095,7 +1095,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
>>  	if (pm_iir & GEN6_PM_RP_DOWN_EI_EXPIRED) {
>>  		if (!vlv_c0_above(dev_priv,
>>  				  &dev_priv->rps.down_ei, &now,
>> -				  dev_priv->rps.down_threshold))
>> +				  VLV_RP_DOWN_EI_THRESHOLD))
>>  			events |= GEN6_PM_RP_DOWN_THRESHOLD;
>>  		dev_priv->rps.down_ei = now;
>>  	}
>> @@ -1103,7 +1103,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
>>  	if (pm_iir & GEN6_PM_RP_UP_EI_EXPIRED) {
>>  		if (vlv_c0_above(dev_priv,
>>  				 &dev_priv->rps.up_ei, &now,
>> -				 dev_priv->rps.up_threshold))
>> +				 VLV_RP_UP_EI_THRESHOLD))
>
> A patch to set them as we set the default values during rps enable so
> that you don't break the debug interfaces.
>
>>  			events |= GEN6_PM_RP_UP_THRESHOLD;
>>  		dev_priv->rps.up_ei = now;
>>  	}
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index 141a5c1..1297f6a 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -1135,6 +1135,8 @@ enum skl_disp_power_wells {
>>  #define 	CHV_BIAS_CPU_50_SOC_50 (3 << 2)
>>  
>>  #define VLV_CZ_CLOCK_TO_MILLI_SEC		100000
>> +#define VLV_RP_UP_EI_THRESHOLD			90
>> +#define VLV_RP_DOWN_EI_THRESHOLD		70
>>  
>>  /* vlv2 north clock has */
>>  #define CCK_FUSE_REG				0x8
>> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
>> index 3d311e1..bce6aae 100644
>> --- a/drivers/gpu/drm/i915/intel_pm.c
>> +++ b/drivers/gpu/drm/i915/intel_pm.c
>> @@ -4971,7 +4971,11 @@ static int valleyview_set_rps(struct drm_i915_private *dev_priv, u8 val)
>>  		if (err)
>>  			return err;
>>  
>> -		gen6_set_rps_thresholds(dev_priv, val);
>> +		/* When byt can survive without system hang with dynamic
>> +		 * sw freq adjustments, this restriction can be lifted.
>> +		 */
>> +		if (!IS_VALLEYVIEW(dev_priv))
>
> Are all vlv affected?

Not all. From what I have gathered, the 4 core variants are
the susceptile ones. For example N28xx works, N29xx freezes.

-Mika

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3
  2017-02-15 12:37 [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2 Mika Kuoppala
@ 2017-02-15 13:52   ` Mika Kuoppala
  2017-02-15 13:52   ` Mika Kuoppala
  2017-02-15 15:28 ` ✓ Fi.CI.BAT: success for drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2 (rev2) Patchwork
  2 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-02-15 13:52 UTC (permalink / raw)
  To: intel-gfx
  Cc: Mika Kuoppala, Chris Wilson, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

Certain Baytrails, namely the 4 cpu core variants, have been
plaqued by spurious system hangs, mostly occurring with light loads.

Multiple bisects by various people point to a commit which changes the
reclocking strategy for Baytrail to follow its bigger brethen:
commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")

There is also a review comment attached to this commit from Deepak S
on avoiding punit access on Cherryview and thus it was excluded on
common reclocking path. By taking the same approach and omitting
the punit access by not tweaking the thresholds when the hardware
has been asked to move into different frequency, considerable gains
in stability have been observed.

With J1900 box, light render/video load would end up in system hang
in usually less than 12 hours. With this patch applied, the cumulative
uptime has now been 34 days without issues. To provoke system hang,
light loads on both render and bsd engines in parallel have been used:
glxgears >/dev/null 2>/dev/null &
mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4

So far, author has not witnessed system hang with above load
and this patch applied. Reports from the tenacious people at
kernel bugzilla are also promising.

Considering that the punit access frequency with this patch is
considerably less, there is a possibility that this will push
the, still unknown, root cause past the triggering point on most loads.

But as we now can reliably reproduce the hang independently,
we can reduce the pain that users are having and use a
static thresholds until a root cause is found.

v3: don't break debugfs and simplification (Chris Wilson)

References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: fritsch@xbmc.org
Cc: miku@iki.fi
Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
CC: Michal Feix <michal@feix.cz>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Deepak S <deepak.s@linux.intel.com>
Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Cc: <stable@vger.kernel.org> # v4.2+
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/intel_pm.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 3d311e1..732256c 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -4870,6 +4870,12 @@ static void gen6_set_rps_thresholds(struct drm_i915_private *dev_priv, u8 val)
 		break;
 	}
 
+	/* When byt can survive without system hang with dynamic
+	 * sw freq adjustments, this restriction can be lifted.
+	 */
+	if (IS_VALLEYVIEW(dev_priv))
+		goto skip_hw_write;
+
 	I915_WRITE(GEN6_RP_UP_EI,
 		   GT_INTERVAL_FROM_US(dev_priv, ei_up));
 	I915_WRITE(GEN6_RP_UP_THRESHOLD,
@@ -4890,6 +4896,7 @@ static void gen6_set_rps_thresholds(struct drm_i915_private *dev_priv, u8 val)
 		   GEN6_RP_UP_BUSY_AVG |
 		   GEN6_RP_DOWN_IDLE_AVG);
 
+skip_hw_write:
 	dev_priv->rps.power = new_power;
 	dev_priv->rps.up_threshold = threshold_up;
 	dev_priv->rps.down_threshold = threshold_down;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3
@ 2017-02-15 13:52   ` Mika Kuoppala
  0 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-02-15 13:52 UTC (permalink / raw)
  To: intel-gfx
  Cc: Len Brown, Michal Feix, Jani Nikula, Daniel Vetter,
	Hans de Goede, miku, Jarkko Nikula, Ezequiel Garcia, # v4 . 2+,
	fritsch

Certain Baytrails, namely the 4 cpu core variants, have been
plaqued by spurious system hangs, mostly occurring with light loads.

Multiple bisects by various people point to a commit which changes the
reclocking strategy for Baytrail to follow its bigger brethen:
commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")

There is also a review comment attached to this commit from Deepak S
on avoiding punit access on Cherryview and thus it was excluded on
common reclocking path. By taking the same approach and omitting
the punit access by not tweaking the thresholds when the hardware
has been asked to move into different frequency, considerable gains
in stability have been observed.

With J1900 box, light render/video load would end up in system hang
in usually less than 12 hours. With this patch applied, the cumulative
uptime has now been 34 days without issues. To provoke system hang,
light loads on both render and bsd engines in parallel have been used:
glxgears >/dev/null 2>/dev/null &
mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4

So far, author has not witnessed system hang with above load
and this patch applied. Reports from the tenacious people at
kernel bugzilla are also promising.

Considering that the punit access frequency with this patch is
considerably less, there is a possibility that this will push
the, still unknown, root cause past the triggering point on most loads.

But as we now can reliably reproduce the hang independently,
we can reduce the pain that users are having and use a
static thresholds until a root cause is found.

v3: don't break debugfs and simplification (Chris Wilson)

References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: fritsch@xbmc.org
Cc: miku@iki.fi
Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
CC: Michal Feix <michal@feix.cz>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Deepak S <deepak.s@linux.intel.com>
Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Cc: <stable@vger.kernel.org> # v4.2+
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/intel_pm.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 3d311e1..732256c 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -4870,6 +4870,12 @@ static void gen6_set_rps_thresholds(struct drm_i915_private *dev_priv, u8 val)
 		break;
 	}
 
+	/* When byt can survive without system hang with dynamic
+	 * sw freq adjustments, this restriction can be lifted.
+	 */
+	if (IS_VALLEYVIEW(dev_priv))
+		goto skip_hw_write;
+
 	I915_WRITE(GEN6_RP_UP_EI,
 		   GT_INTERVAL_FROM_US(dev_priv, ei_up));
 	I915_WRITE(GEN6_RP_UP_THRESHOLD,
@@ -4890,6 +4896,7 @@ static void gen6_set_rps_thresholds(struct drm_i915_private *dev_priv, u8 val)
 		   GEN6_RP_UP_BUSY_AVG |
 		   GEN6_RP_DOWN_IDLE_AVG);
 
+skip_hw_write:
 	dev_priv->rps.power = new_power;
 	dev_priv->rps.up_threshold = threshold_up;
 	dev_priv->rps.down_threshold = threshold_down;
-- 
2.7.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2 (rev2)
  2017-02-15 12:37 [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2 Mika Kuoppala
  2017-02-15 12:52   ` Chris Wilson
  2017-02-15 13:52   ` Mika Kuoppala
@ 2017-02-15 15:28 ` Patchwork
  2 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2017-02-15 15:28 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2 (rev2)
URL   : https://patchwork.freedesktop.org/series/19702/
State : success

== Summary ==

Series 19702v2 drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2
https://patchwork.freedesktop.org/api/1.0/series/19702/revisions/2/mbox/

Test kms_pipe_crc_basic:
        Subgroup nonblocking-crc-pipe-a:
                dmesg-fail -> PASS       (fi-snb-2520m)
        Subgroup nonblocking-crc-pipe-a-frame-sequence:
                dmesg-fail -> PASS       (fi-snb-2520m)
        Subgroup nonblocking-crc-pipe-b:
                dmesg-fail -> PASS       (fi-snb-2520m)
        Subgroup nonblocking-crc-pipe-b-frame-sequence:
                dmesg-fail -> PASS       (fi-snb-2520m)
        Subgroup read-crc-pipe-a:
                incomplete -> PASS       (fi-snb-2520m)

fi-bdw-5557u     total:252  pass:238  dwarn:3   dfail:0   fail:0   skip:11 
fi-bsw-n3050     total:252  pass:210  dwarn:3   dfail:0   fail:0   skip:39 
fi-bxt-j4205     total:252  pass:230  dwarn:3   dfail:0   fail:0   skip:19 
fi-bxt-t5700     total:83   pass:70   dwarn:0   dfail:0   fail:0   skip:12 
fi-byt-j1900     total:252  pass:222  dwarn:3   dfail:0   fail:0   skip:27 
fi-byt-n2820     total:252  pass:218  dwarn:3   dfail:0   fail:0   skip:31 
fi-hsw-4770      total:252  pass:233  dwarn:3   dfail:0   fail:0   skip:16 
fi-hsw-4770r     total:252  pass:233  dwarn:3   dfail:0   fail:0   skip:16 
fi-ilk-650       total:252  pass:199  dwarn:3   dfail:0   fail:0   skip:50 
fi-ivb-3520m     total:252  pass:231  dwarn:3   dfail:0   fail:0   skip:18 
fi-ivb-3770      total:252  pass:231  dwarn:3   dfail:0   fail:0   skip:18 
fi-kbl-7500u     total:252  pass:231  dwarn:3   dfail:0   fail:0   skip:18 
fi-skl-6260u     total:252  pass:239  dwarn:3   dfail:0   fail:0   skip:10 
fi-skl-6700hq    total:252  pass:232  dwarn:3   dfail:0   fail:0   skip:17 
fi-skl-6700k     total:252  pass:230  dwarn:4   dfail:0   fail:0   skip:18 
fi-skl-6770hq    total:252  pass:239  dwarn:3   dfail:0   fail:0   skip:10 
fi-snb-2520m     total:252  pass:221  dwarn:3   dfail:0   fail:0   skip:28 
fi-snb-2600      total:252  pass:220  dwarn:3   dfail:0   fail:0   skip:29 

cc11223a7f11b4e2d15f1c645326ac6f34568d88 drm-tip: 2017y-02m-15d-13h-44m-31s UTC integration manifest
c4ada14 drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_3825/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3
  2017-02-15 13:52   ` Mika Kuoppala
@ 2017-02-27  9:25     ` Chris Wilson
  -1 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-02-27  9:25 UTC (permalink / raw)
  To: Mika Kuoppala
  Cc: intel-gfx, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

On Wed, Feb 15, 2017 at 03:52:59PM +0200, Mika Kuoppala wrote:
> Certain Baytrails, namely the 4 cpu core variants, have been
> plaqued by spurious system hangs, mostly occurring with light loads.
> 
> Multiple bisects by various people point to a commit which changes the
> reclocking strategy for Baytrail to follow its bigger brethen:
> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
> 
> There is also a review comment attached to this commit from Deepak S
> on avoiding punit access on Cherryview and thus it was excluded on
> common reclocking path. By taking the same approach and omitting
> the punit access by not tweaking the thresholds when the hardware
> has been asked to move into different frequency, considerable gains
> in stability have been observed.
> 
> With J1900 box, light render/video load would end up in system hang
> in usually less than 12 hours. With this patch applied, the cumulative
> uptime has now been 34 days without issues. To provoke system hang,
> light loads on both render and bsd engines in parallel have been used:
> glxgears >/dev/null 2>/dev/null &
> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
> 
> So far, author has not witnessed system hang with above load
> and this patch applied. Reports from the tenacious people at
> kernel bugzilla are also promising.
> 
> Considering that the punit access frequency with this patch is
> considerably less, there is a possibility that this will push
> the, still unknown, root cause past the triggering point on most loads.
> 
> But as we now can reliably reproduce the hang independently,
> we can reduce the pain that users are having and use a
> static thresholds until a root cause is found.
> 
> v3: don't break debugfs and simplification (Chris Wilson)
> 
> References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Ville Syrj�l� <ville.syrjala@linux.intel.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Jani Nikula <jani.nikula@intel.com>
> Cc: fritsch@xbmc.org
> Cc: miku@iki.fi
> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
> CC: Michal Feix <michal@feix.cz>
> Cc: Hans de Goede <hdegoede@redhat.com>
> Cc: Deepak S <deepak.s@linux.intel.com>
> Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v4.2+
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>

Had a couple of weekends to try and find an alternative explanation
(a root cause for the hangs would be nice!). If it is just the writes to
the RPS registers, are we safe on resume (etc)?

However, I've drawn a blank on explaining what the hw is doing wrong
(but found a couple of bugs in the byt manual RPS evaluation which
desire review), so
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3
@ 2017-02-27  9:25     ` Chris Wilson
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-02-27  9:25 UTC (permalink / raw)
  To: Mika Kuoppala
  Cc: intel-gfx, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

On Wed, Feb 15, 2017 at 03:52:59PM +0200, Mika Kuoppala wrote:
> Certain Baytrails, namely the 4 cpu core variants, have been
> plaqued by spurious system hangs, mostly occurring with light loads.
> 
> Multiple bisects by various people point to a commit which changes the
> reclocking strategy for Baytrail to follow its bigger brethen:
> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
> 
> There is also a review comment attached to this commit from Deepak S
> on avoiding punit access on Cherryview and thus it was excluded on
> common reclocking path. By taking the same approach and omitting
> the punit access by not tweaking the thresholds when the hardware
> has been asked to move into different frequency, considerable gains
> in stability have been observed.
> 
> With J1900 box, light render/video load would end up in system hang
> in usually less than 12 hours. With this patch applied, the cumulative
> uptime has now been 34 days without issues. To provoke system hang,
> light loads on both render and bsd engines in parallel have been used:
> glxgears >/dev/null 2>/dev/null &
> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
> 
> So far, author has not witnessed system hang with above load
> and this patch applied. Reports from the tenacious people at
> kernel bugzilla are also promising.
> 
> Considering that the punit access frequency with this patch is
> considerably less, there is a possibility that this will push
> the, still unknown, root cause past the triggering point on most loads.
> 
> But as we now can reliably reproduce the hang independently,
> we can reduce the pain that users are having and use a
> static thresholds until a root cause is found.
> 
> v3: don't break debugfs and simplification (Chris Wilson)
> 
> References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Jani Nikula <jani.nikula@intel.com>
> Cc: fritsch@xbmc.org
> Cc: miku@iki.fi
> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
> CC: Michal Feix <michal@feix.cz>
> Cc: Hans de Goede <hdegoede@redhat.com>
> Cc: Deepak S <deepak.s@linux.intel.com>
> Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v4.2+
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>

Had a couple of weekends to try and find an alternative explanation
(a root cause for the hangs would be nice!). If it is just the writes to
the RPS registers, are we safe on resume (etc)?

However, I've drawn a blank on explaining what the hw is doing wrong
(but found a couple of bugs in the byt manual RPS evaluation which
desire review), so
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3
  2017-02-27  9:25     ` Chris Wilson
@ 2017-02-27 13:22       ` Mika Kuoppala
  -1 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-02-27 13:22 UTC (permalink / raw)
  To: Chris Wilson
  Cc: intel-gfx, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Wed, Feb 15, 2017 at 03:52:59PM +0200, Mika Kuoppala wrote:
>> Certain Baytrails, namely the 4 cpu core variants, have been
>> plaqued by spurious system hangs, mostly occurring with light loads.
>> 
>> Multiple bisects by various people point to a commit which changes the
>> reclocking strategy for Baytrail to follow its bigger brethen:
>> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
>> 
>> There is also a review comment attached to this commit from Deepak S
>> on avoiding punit access on Cherryview and thus it was excluded on
>> common reclocking path. By taking the same approach and omitting
>> the punit access by not tweaking the thresholds when the hardware
>> has been asked to move into different frequency, considerable gains
>> in stability have been observed.
>> 
>> With J1900 box, light render/video load would end up in system hang
>> in usually less than 12 hours. With this patch applied, the cumulative
>> uptime has now been 34 days without issues. To provoke system hang,
>> light loads on both render and bsd engines in parallel have been used:
>> glxgears >/dev/null 2>/dev/null &
>> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
>> 
>> So far, author has not witnessed system hang with above load
>> and this patch applied. Reports from the tenacious people at
>> kernel bugzilla are also promising.
>> 
>> Considering that the punit access frequency with this patch is
>> considerably less, there is a possibility that this will push
>> the, still unknown, root cause past the triggering point on most loads.
>> 
>> But as we now can reliably reproduce the hang independently,
>> we can reduce the pain that users are having and use a
>> static thresholds until a root cause is found.
>> 
>> v3: don't break debugfs and simplification (Chris Wilson)
>> 
>> References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>> Cc: Len Brown <len.brown@intel.com>
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Cc: Jani Nikula <jani.nikula@intel.com>
>> Cc: fritsch@xbmc.org
>> Cc: miku@iki.fi
>> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
>> CC: Michal Feix <michal@feix.cz>
>> Cc: Hans de Goede <hdegoede@redhat.com>
>> Cc: Deepak S <deepak.s@linux.intel.com>
>> Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
>> Cc: <stable@vger.kernel.org> # v4.2+
>> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>
> Had a couple of weekends to try and find an alternative explanation
> (a root cause for the hangs would be nice!). If it is just the writes to
> the RPS registers, are we safe on resume (etc)?
>
> However, I've drawn a blank on explaining what the hw is doing wrong
> (but found a couple of bugs in the byt manual RPS evaluation which
> desire review), so
> Acked-by: Chris Wilson <chris@chris-wilson.co.uk>

Pushed, thanks.
-Mika

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3
@ 2017-02-27 13:22       ` Mika Kuoppala
  0 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-02-27 13:22 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Len Brown, Michal Feix, Jani Nikula, Daniel Vetter, intel-gfx,
	fritsch, Hans de Goede, miku, Jarkko Nikula, Ezequiel Garcia,
	# v4 . 2+

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Wed, Feb 15, 2017 at 03:52:59PM +0200, Mika Kuoppala wrote:
>> Certain Baytrails, namely the 4 cpu core variants, have been
>> plaqued by spurious system hangs, mostly occurring with light loads.
>> 
>> Multiple bisects by various people point to a commit which changes the
>> reclocking strategy for Baytrail to follow its bigger brethen:
>> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
>> 
>> There is also a review comment attached to this commit from Deepak S
>> on avoiding punit access on Cherryview and thus it was excluded on
>> common reclocking path. By taking the same approach and omitting
>> the punit access by not tweaking the thresholds when the hardware
>> has been asked to move into different frequency, considerable gains
>> in stability have been observed.
>> 
>> With J1900 box, light render/video load would end up in system hang
>> in usually less than 12 hours. With this patch applied, the cumulative
>> uptime has now been 34 days without issues. To provoke system hang,
>> light loads on both render and bsd engines in parallel have been used:
>> glxgears >/dev/null 2>/dev/null &
>> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
>> 
>> So far, author has not witnessed system hang with above load
>> and this patch applied. Reports from the tenacious people at
>> kernel bugzilla are also promising.
>> 
>> Considering that the punit access frequency with this patch is
>> considerably less, there is a possibility that this will push
>> the, still unknown, root cause past the triggering point on most loads.
>> 
>> But as we now can reliably reproduce the hang independently,
>> we can reduce the pain that users are having and use a
>> static thresholds until a root cause is found.
>> 
>> v3: don't break debugfs and simplification (Chris Wilson)
>> 
>> References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>> Cc: Len Brown <len.brown@intel.com>
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Cc: Jani Nikula <jani.nikula@intel.com>
>> Cc: fritsch@xbmc.org
>> Cc: miku@iki.fi
>> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
>> CC: Michal Feix <michal@feix.cz>
>> Cc: Hans de Goede <hdegoede@redhat.com>
>> Cc: Deepak S <deepak.s@linux.intel.com>
>> Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
>> Cc: <stable@vger.kernel.org> # v4.2+
>> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>
> Had a couple of weekends to try and find an alternative explanation
> (a root cause for the hangs would be nice!). If it is just the writes to
> the RPS registers, are we safe on resume (etc)?
>
> However, I've drawn a blank on explaining what the hw is doing wrong
> (but found a couple of bugs in the byt manual RPS evaluation which
> desire review), so
> Acked-by: Chris Wilson <chris@chris-wilson.co.uk>

Pushed, thanks.
-Mika

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-02-27 13:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-15 12:37 [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2 Mika Kuoppala
2017-02-15 12:52 ` Chris Wilson
2017-02-15 12:52   ` Chris Wilson
2017-02-15 13:51   ` Mika Kuoppala
2017-02-15 13:52 ` [PATCH] drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3 Mika Kuoppala
2017-02-15 13:52   ` Mika Kuoppala
2017-02-27  9:25   ` Chris Wilson
2017-02-27  9:25     ` Chris Wilson
2017-02-27 13:22     ` Mika Kuoppala
2017-02-27 13:22       ` Mika Kuoppala
2017-02-15 15:28 ` ✓ Fi.CI.BAT: success for drm/i915: Avoid tweaking evaluation thresholds on Baytrail v2 (rev2) Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.