All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915/byt: Avoid tweaking evaluation thresholds
@ 2017-01-25 12:31 ` Mika Kuoppala
  0 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-01-25 12:31 UTC (permalink / raw)
  To: intel-gfx
  Cc: Mika Kuoppala, Chris Wilson, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

Certain Baytrails, namely the 4 cpu core variants, have been
plaqued by spurious system hangs, mostly occurring with light loads.

Multiple bisects by various people point to a commit which changes the
reclocking strategy for Baytrail to follow its bigger brethen:
commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")

There is also a review comment attached to this commit from Deepak S
on avoiding punit access on Cherryview and thus it is excluded on
common reclocking path. By taking the same approach and omitting
the punit access by not tweaking the thresholds when the hardware
has been asked to move into different frequency, considerable gains
in stability have been observed.

With J1900 box, light render/video load would end up in system hang
in usually less than 12 hours. With this patch applied, the cumulative
uptime has now been 34 days without issues. To provoke system hang,
light loads on both render and bsd engines in parallel have been used:
glxgears >/dev/null 2>/dev/null &
mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4

So far, author has not witnessed system hang with above load
and this patch applied. Reports from the tenacious people at
kernel bugzilla are also promising.

Considering that the punit access frequency with this patch is
considerably less, there is a possibility that this will push
the, still unknown, root cause past the triggering point on most loads.
Further work on investigating the punit accesses on byt is welcomed.

References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: fritsch@xbmc.org
Cc: miku@iki.fi
Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
CC: Michal Feix <michal@feix.cz>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Deepak S <deepak.s@linux.intel.com>
Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Cc: <stable@vger.kernel.org> # v4.2+
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 4 ++--
 drivers/gpu/drm/i915/i915_reg.h | 2 ++
 drivers/gpu/drm/i915/intel_pm.c | 2 +-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 3fc286c..4b9635f 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1039,7 +1039,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
 	if (pm_iir & GEN6_PM_RP_DOWN_EI_EXPIRED) {
 		if (!vlv_c0_above(dev_priv,
 				  &dev_priv->rps.down_ei, &now,
-				  dev_priv->rps.down_threshold))
+				  VLV_RP_DOWN_EI_THRESHOLD))
 			events |= GEN6_PM_RP_DOWN_THRESHOLD;
 		dev_priv->rps.down_ei = now;
 	}
@@ -1047,7 +1047,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
 	if (pm_iir & GEN6_PM_RP_UP_EI_EXPIRED) {
 		if (vlv_c0_above(dev_priv,
 				 &dev_priv->rps.up_ei, &now,
-				 dev_priv->rps.up_threshold))
+				 VLV_RP_UP_EI_THRESHOLD))
 			events |= GEN6_PM_RP_UP_THRESHOLD;
 		dev_priv->rps.up_ei = now;
 	}
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 70d9616..09f6aea 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -787,6 +787,8 @@ enum skl_disp_power_wells {
 #define 	CHV_BIAS_CPU_50_SOC_50 (3 << 2)
 
 #define VLV_CZ_CLOCK_TO_MILLI_SEC		100000
+#define VLV_RP_UP_EI_THRESHOLD			90
+#define VLV_RP_DOWN_EI_THRESHOLD		70
 
 /* vlv2 north clock has */
 #define CCK_FUSE_REG				0x8
diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index db24f89..1923b6b 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -4983,7 +4983,7 @@ static void valleyview_set_rps(struct drm_i915_private *dev_priv, u8 val)
 
 	if (val != dev_priv->rps.cur_freq) {
 		vlv_punit_write(dev_priv, PUNIT_REG_GPU_FREQ_REQ, val);
-		if (!IS_CHERRYVIEW(dev_priv))
+		if (!(IS_CHERRYVIEW(dev_priv) || IS_VALLEYVIEW(dev_priv)))
 			gen6_set_rps_thresholds(dev_priv, val);
 	}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH] drm/i915/byt: Avoid tweaking evaluation thresholds
@ 2017-01-25 12:31 ` Mika Kuoppala
  0 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-01-25 12:31 UTC (permalink / raw)
  To: intel-gfx
  Cc: Len Brown, Michal Feix, Jani Nikula, Daniel Vetter,
	Hans de Goede, miku, Jarkko Nikula, Ezequiel Garcia, # v4 . 2+,
	fritsch

Certain Baytrails, namely the 4 cpu core variants, have been
plaqued by spurious system hangs, mostly occurring with light loads.

Multiple bisects by various people point to a commit which changes the
reclocking strategy for Baytrail to follow its bigger brethen:
commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")

There is also a review comment attached to this commit from Deepak S
on avoiding punit access on Cherryview and thus it is excluded on
common reclocking path. By taking the same approach and omitting
the punit access by not tweaking the thresholds when the hardware
has been asked to move into different frequency, considerable gains
in stability have been observed.

With J1900 box, light render/video load would end up in system hang
in usually less than 12 hours. With this patch applied, the cumulative
uptime has now been 34 days without issues. To provoke system hang,
light loads on both render and bsd engines in parallel have been used:
glxgears >/dev/null 2>/dev/null &
mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4

So far, author has not witnessed system hang with above load
and this patch applied. Reports from the tenacious people at
kernel bugzilla are also promising.

Considering that the punit access frequency with this patch is
considerably less, there is a possibility that this will push
the, still unknown, root cause past the triggering point on most loads.
Further work on investigating the punit accesses on byt is welcomed.

References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: fritsch@xbmc.org
Cc: miku@iki.fi
Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
CC: Michal Feix <michal@feix.cz>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Deepak S <deepak.s@linux.intel.com>
Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Cc: <stable@vger.kernel.org> # v4.2+
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 4 ++--
 drivers/gpu/drm/i915/i915_reg.h | 2 ++
 drivers/gpu/drm/i915/intel_pm.c | 2 +-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 3fc286c..4b9635f 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1039,7 +1039,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
 	if (pm_iir & GEN6_PM_RP_DOWN_EI_EXPIRED) {
 		if (!vlv_c0_above(dev_priv,
 				  &dev_priv->rps.down_ei, &now,
-				  dev_priv->rps.down_threshold))
+				  VLV_RP_DOWN_EI_THRESHOLD))
 			events |= GEN6_PM_RP_DOWN_THRESHOLD;
 		dev_priv->rps.down_ei = now;
 	}
@@ -1047,7 +1047,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
 	if (pm_iir & GEN6_PM_RP_UP_EI_EXPIRED) {
 		if (vlv_c0_above(dev_priv,
 				 &dev_priv->rps.up_ei, &now,
-				 dev_priv->rps.up_threshold))
+				 VLV_RP_UP_EI_THRESHOLD))
 			events |= GEN6_PM_RP_UP_THRESHOLD;
 		dev_priv->rps.up_ei = now;
 	}
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 70d9616..09f6aea 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -787,6 +787,8 @@ enum skl_disp_power_wells {
 #define 	CHV_BIAS_CPU_50_SOC_50 (3 << 2)
 
 #define VLV_CZ_CLOCK_TO_MILLI_SEC		100000
+#define VLV_RP_UP_EI_THRESHOLD			90
+#define VLV_RP_DOWN_EI_THRESHOLD		70
 
 /* vlv2 north clock has */
 #define CCK_FUSE_REG				0x8
diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index db24f89..1923b6b 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -4983,7 +4983,7 @@ static void valleyview_set_rps(struct drm_i915_private *dev_priv, u8 val)
 
 	if (val != dev_priv->rps.cur_freq) {
 		vlv_punit_write(dev_priv, PUNIT_REG_GPU_FREQ_REQ, val);
-		if (!IS_CHERRYVIEW(dev_priv))
+		if (!(IS_CHERRYVIEW(dev_priv) || IS_VALLEYVIEW(dev_priv)))
 			gen6_set_rps_thresholds(dev_priv, val);
 	}
 
-- 
2.7.4

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/byt: Avoid tweaking evaluation thresholds
  2017-01-25 12:31 ` Mika Kuoppala
  (?)
@ 2017-01-25 12:42 ` Chris Wilson
  2017-01-25 13:09     ` Mika Kuoppala
  -1 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2017-01-25 12:42 UTC (permalink / raw)
  To: Mika Kuoppala
  Cc: intel-gfx, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

On Wed, Jan 25, 2017 at 02:31:08PM +0200, Mika Kuoppala wrote:
> Certain Baytrails, namely the 4 cpu core variants, have been
> plaqued by spurious system hangs, mostly occurring with light loads.
> 
> Multiple bisects by various people point to a commit which changes the
> reclocking strategy for Baytrail to follow its bigger brethen:
> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
> 
> There is also a review comment attached to this commit from Deepak S
> on avoiding punit access on Cherryview and thus it is excluded on
> common reclocking path. By taking the same approach and omitting
> the punit access by not tweaking the thresholds when the hardware
> has been asked to move into different frequency, considerable gains
> in stability have been observed.
> 
> With J1900 box, light render/video load would end up in system hang
> in usually less than 12 hours. With this patch applied, the cumulative
> uptime has now been 34 days without issues. To provoke system hang,
> light loads on both render and bsd engines in parallel have been used:
> glxgears >/dev/null 2>/dev/null &
> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
> 
> So far, author has not witnessed system hang with above load
> and this patch applied. Reports from the tenacious people at
> kernel bugzilla are also promising.
> 
> Considering that the punit access frequency with this patch is
> considerably less, there is a possibility that this will push
> the, still unknown, root cause past the triggering point on most loads.
> Further work on investigating the punit accesses on byt is welcomed.

Please find the underlying problem and not disabling rps for all vlv
for a GT specific problem.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/byt: Avoid tweaking evaluation thresholds
  2017-01-25 12:42 ` Chris Wilson
@ 2017-01-25 13:09     ` Mika Kuoppala
  0 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-01-25 13:09 UTC (permalink / raw)
  To: Chris Wilson
  Cc: intel-gfx, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Wed, Jan 25, 2017 at 02:31:08PM +0200, Mika Kuoppala wrote:
>> Certain Baytrails, namely the 4 cpu core variants, have been
>> plaqued by spurious system hangs, mostly occurring with light loads.
>> 
>> Multiple bisects by various people point to a commit which changes the
>> reclocking strategy for Baytrail to follow its bigger brethen:
>> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
>> 
>> There is also a review comment attached to this commit from Deepak S
>> on avoiding punit access on Cherryview and thus it is excluded on
>> common reclocking path. By taking the same approach and omitting
>> the punit access by not tweaking the thresholds when the hardware
>> has been asked to move into different frequency, considerable gains
>> in stability have been observed.
>> 
>> With J1900 box, light render/video load would end up in system hang
>> in usually less than 12 hours. With this patch applied, the cumulative
>> uptime has now been 34 days without issues. To provoke system hang,
>> light loads on both render and bsd engines in parallel have been used:
>> glxgears >/dev/null 2>/dev/null &
>> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
>> 
>> So far, author has not witnessed system hang with above load
>> and this patch applied. Reports from the tenacious people at
>> kernel bugzilla are also promising.
>> 
>> Considering that the punit access frequency with this patch is
>> considerably less, there is a possibility that this will push
>> the, still unknown, root cause past the triggering point on most loads.
>> Further work on investigating the punit accesses on byt is welcomed.
>
> Please find the underlying problem and not disabling rps for all vlv
> for a GT specific problem.

This is not disabling rps.
-Mika

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/byt: Avoid tweaking evaluation thresholds
@ 2017-01-25 13:09     ` Mika Kuoppala
  0 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-01-25 13:09 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Len Brown, Michal Feix, Jani Nikula, Daniel Vetter, intel-gfx,
	fritsch, Hans de Goede, miku, Jarkko Nikula, Ezequiel Garcia,
	# v4 . 2+

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Wed, Jan 25, 2017 at 02:31:08PM +0200, Mika Kuoppala wrote:
>> Certain Baytrails, namely the 4 cpu core variants, have been
>> plaqued by spurious system hangs, mostly occurring with light loads.
>> 
>> Multiple bisects by various people point to a commit which changes the
>> reclocking strategy for Baytrail to follow its bigger brethen:
>> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
>> 
>> There is also a review comment attached to this commit from Deepak S
>> on avoiding punit access on Cherryview and thus it is excluded on
>> common reclocking path. By taking the same approach and omitting
>> the punit access by not tweaking the thresholds when the hardware
>> has been asked to move into different frequency, considerable gains
>> in stability have been observed.
>> 
>> With J1900 box, light render/video load would end up in system hang
>> in usually less than 12 hours. With this patch applied, the cumulative
>> uptime has now been 34 days without issues. To provoke system hang,
>> light loads on both render and bsd engines in parallel have been used:
>> glxgears >/dev/null 2>/dev/null &
>> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
>> 
>> So far, author has not witnessed system hang with above load
>> and this patch applied. Reports from the tenacious people at
>> kernel bugzilla are also promising.
>> 
>> Considering that the punit access frequency with this patch is
>> considerably less, there is a possibility that this will push
>> the, still unknown, root cause past the triggering point on most loads.
>> Further work on investigating the punit accesses on byt is welcomed.
>
> Please find the underlying problem and not disabling rps for all vlv
> for a GT specific problem.

This is not disabling rps.
-Mika

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/byt: Avoid tweaking evaluation thresholds
  2017-01-25 13:09     ` Mika Kuoppala
  (?)
@ 2017-01-25 13:17     ` Chris Wilson
  2017-01-25 13:35       ` Mika Kuoppala
  -1 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2017-01-25 13:17 UTC (permalink / raw)
  To: Mika Kuoppala
  Cc: intel-gfx, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

On Wed, Jan 25, 2017 at 03:09:04PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > On Wed, Jan 25, 2017 at 02:31:08PM +0200, Mika Kuoppala wrote:
> >> Certain Baytrails, namely the 4 cpu core variants, have been
> >> plaqued by spurious system hangs, mostly occurring with light loads.
> >> 
> >> Multiple bisects by various people point to a commit which changes the
> >> reclocking strategy for Baytrail to follow its bigger brethen:
> >> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
> >> 
> >> There is also a review comment attached to this commit from Deepak S
> >> on avoiding punit access on Cherryview and thus it is excluded on
> >> common reclocking path. By taking the same approach and omitting
> >> the punit access by not tweaking the thresholds when the hardware
> >> has been asked to move into different frequency, considerable gains
> >> in stability have been observed.
> >> 
> >> With J1900 box, light render/video load would end up in system hang
> >> in usually less than 12 hours. With this patch applied, the cumulative
> >> uptime has now been 34 days without issues. To provoke system hang,
> >> light loads on both render and bsd engines in parallel have been used:
> >> glxgears >/dev/null 2>/dev/null &
> >> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
> >> 
> >> So far, author has not witnessed system hang with above load
> >> and this patch applied. Reports from the tenacious people at
> >> kernel bugzilla are also promising.
> >> 
> >> Considering that the punit access frequency with this patch is
> >> considerably less, there is a possibility that this will push
> >> the, still unknown, root cause past the triggering point on most loads.
> >> Further work on investigating the punit accesses on byt is welcomed.
> >
> > Please find the underlying problem and not disabling rps for all vlv
> > for a GT specific problem.
> 
> This is not disabling rps.

Your are disabling the key ingredients of the algorithm, making it less
generic in order to workaround a problem elsewhere. You are tackling the
symptoms and not the cause.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915/byt: Avoid tweaking evaluation thresholds
  2017-01-25 12:31 ` Mika Kuoppala
  (?)
  (?)
@ 2017-01-25 13:24 ` Patchwork
  -1 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2017-01-25 13:24 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/byt: Avoid tweaking evaluation thresholds
URL   : https://patchwork.freedesktop.org/series/18543/
State : success

== Summary ==

Series 18543v1 drm/i915/byt: Avoid tweaking evaluation thresholds
https://patchwork.freedesktop.org/api/1.0/series/18543/revisions/1/mbox/


fi-bdw-5557u     total:247  pass:233  dwarn:0   dfail:0   fail:0   skip:14 
fi-bsw-n3050     total:247  pass:208  dwarn:0   dfail:0   fail:0   skip:39 
fi-bxt-j4205     total:247  pass:225  dwarn:0   dfail:0   fail:0   skip:22 
fi-bxt-t5700     total:79   pass:66   dwarn:0   dfail:0   fail:0   skip:12 
fi-byt-j1900     total:247  pass:220  dwarn:0   dfail:0   fail:0   skip:27 
fi-byt-n2820     total:247  pass:216  dwarn:0   dfail:0   fail:0   skip:31 
fi-hsw-4770      total:247  pass:228  dwarn:0   dfail:0   fail:0   skip:19 
fi-hsw-4770r     total:247  pass:228  dwarn:0   dfail:0   fail:0   skip:19 
fi-ivb-3520m     total:247  pass:226  dwarn:0   dfail:0   fail:0   skip:21 
fi-ivb-3770      total:247  pass:226  dwarn:0   dfail:0   fail:0   skip:21 
fi-kbl-7500u     total:247  pass:226  dwarn:0   dfail:0   fail:0   skip:21 
fi-skl-6260u     total:247  pass:234  dwarn:0   dfail:0   fail:0   skip:13 
fi-skl-6700hq    total:247  pass:227  dwarn:0   dfail:0   fail:0   skip:20 
fi-skl-6700k     total:247  pass:222  dwarn:4   dfail:0   fail:0   skip:21 
fi-skl-6770hq    total:247  pass:234  dwarn:0   dfail:0   fail:0   skip:13 
fi-snb-2520m     total:247  pass:216  dwarn:0   dfail:0   fail:0   skip:31 
fi-snb-2600      total:247  pass:215  dwarn:0   dfail:0   fail:0   skip:32 

396d17a6de32b4ef6cf1b531248e25ca6efe8001 drm-tip: 2017y-01m-25d-11h-07m-11s UTC integration manifest
75cf950 drm/i915/byt: Avoid tweaking evaluation thresholds

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_3603/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/byt: Avoid tweaking evaluation thresholds
  2017-01-25 13:17     ` Chris Wilson
@ 2017-01-25 13:35       ` Mika Kuoppala
  0 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-01-25 13:35 UTC (permalink / raw)
  To: Chris Wilson
  Cc: intel-gfx, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Wed, Jan 25, 2017 at 03:09:04PM +0200, Mika Kuoppala wrote:
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > On Wed, Jan 25, 2017 at 02:31:08PM +0200, Mika Kuoppala wrote:
>> >> Certain Baytrails, namely the 4 cpu core variants, have been
>> >> plaqued by spurious system hangs, mostly occurring with light loads.
>> >> 
>> >> Multiple bisects by various people point to a commit which changes the
>> >> reclocking strategy for Baytrail to follow its bigger brethen:
>> >> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
>> >> 
>> >> There is also a review comment attached to this commit from Deepak S
>> >> on avoiding punit access on Cherryview and thus it is excluded on
>> >> common reclocking path. By taking the same approach and omitting
>> >> the punit access by not tweaking the thresholds when the hardware
>> >> has been asked to move into different frequency, considerable gains
>> >> in stability have been observed.
>> >> 
>> >> With J1900 box, light render/video load would end up in system hang
>> >> in usually less than 12 hours. With this patch applied, the cumulative
>> >> uptime has now been 34 days without issues. To provoke system hang,
>> >> light loads on both render and bsd engines in parallel have been used:
>> >> glxgears >/dev/null 2>/dev/null &
>> >> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
>> >> 
>> >> So far, author has not witnessed system hang with above load
>> >> and this patch applied. Reports from the tenacious people at
>> >> kernel bugzilla are also promising.
>> >> 
>> >> Considering that the punit access frequency with this patch is
>> >> considerably less, there is a possibility that this will push
>> >> the, still unknown, root cause past the triggering point on most loads.
>> >> Further work on investigating the punit accesses on byt is welcomed.
>> >
>> > Please find the underlying problem and not disabling rps for all vlv
>> > for a GT specific problem.
>> 
>> This is not disabling rps.
>
> Your are disabling the key ingredients of the algorithm, making it less
> generic in order to workaround a problem elsewhere. You are tackling the
> symptoms and not the cause.

Yes, definitely we are tackling the symptoms.

We have been trying to find the root cause for 2 years.
Admittely hindered by the multiple other causes for
system hangs on baytrail platform.

One could argue that why was the deviation for Cherryview accepted,
as this just mimics the same way, omitting the sw adjustments.

It allows baytrail users to run their rigs without
intel_idle.max_cstate=1 which kind of ruins their power budget by far
bigger margin than this patch does.

-Mika

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/byt: Avoid tweaking evaluation thresholds
  2017-01-25 12:31 ` Mika Kuoppala
@ 2017-01-26  8:47   ` Daniel Vetter
  -1 siblings, 0 replies; 11+ messages in thread
From: Daniel Vetter @ 2017-01-26  8:47 UTC (permalink / raw)
  To: Mika Kuoppala
  Cc: intel-gfx, Chris Wilson, Ville Syrjälä,
	Len Brown, Daniel Vetter, Jani Nikula, fritsch, miku,
	Ezequiel Garcia, Michal Feix, Hans de Goede, Deepak S,
	Jarkko Nikula, # v4 . 2+

On Wed, Jan 25, 2017 at 02:31:08PM +0200, Mika Kuoppala wrote:
> Certain Baytrails, namely the 4 cpu core variants, have been
> plaqued by spurious system hangs, mostly occurring with light loads.
> 
> Multiple bisects by various people point to a commit which changes the
> reclocking strategy for Baytrail to follow its bigger brethen:
> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
> 
> There is also a review comment attached to this commit from Deepak S
> on avoiding punit access on Cherryview and thus it is excluded on
> common reclocking path. By taking the same approach and omitting
> the punit access by not tweaking the thresholds when the hardware
> has been asked to move into different frequency, considerable gains
> in stability have been observed.
> 
> With J1900 box, light render/video load would end up in system hang
> in usually less than 12 hours. With this patch applied, the cumulative
> uptime has now been 34 days without issues. To provoke system hang,
> light loads on both render and bsd engines in parallel have been used:
> glxgears >/dev/null 2>/dev/null &
> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
> 
> So far, author has not witnessed system hang with above load
> and this patch applied. Reports from the tenacious people at
> kernel bugzilla are also promising.
> 
> Considering that the punit access frequency with this patch is
> considerably less, there is a possibility that this will push
> the, still unknown, root cause past the triggering point on most loads.
> Further work on investigating the punit accesses on byt is welcomed.
> 
> References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Ville Syrj�l� <ville.syrjala@linux.intel.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Jani Nikula <jani.nikula@intel.com>
> Cc: fritsch@xbmc.org
> Cc: miku@iki.fi
> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
> CC: Michal Feix <michal@feix.cz>
> Cc: Hans de Goede <hdegoede@redhat.com>
> Cc: Deepak S <deepak.s@linux.intel.com>
> Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v4.2+
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>

It sucks, but I guess this is better than dead machines. I'd say let's
wait another 1-2 weeks for tested-bys to trickle in, and if it does fix
the problem then let's apply it. rps keeps on sucking, that's
unfortunately not news at all.

Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

> ---
>  drivers/gpu/drm/i915/i915_irq.c | 4 ++--
>  drivers/gpu/drm/i915/i915_reg.h | 2 ++
>  drivers/gpu/drm/i915/intel_pm.c | 2 +-
>  3 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 3fc286c..4b9635f 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1039,7 +1039,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
>  	if (pm_iir & GEN6_PM_RP_DOWN_EI_EXPIRED) {
>  		if (!vlv_c0_above(dev_priv,
>  				  &dev_priv->rps.down_ei, &now,
> -				  dev_priv->rps.down_threshold))
> +				  VLV_RP_DOWN_EI_THRESHOLD))
>  			events |= GEN6_PM_RP_DOWN_THRESHOLD;
>  		dev_priv->rps.down_ei = now;
>  	}
> @@ -1047,7 +1047,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
>  	if (pm_iir & GEN6_PM_RP_UP_EI_EXPIRED) {
>  		if (vlv_c0_above(dev_priv,
>  				 &dev_priv->rps.up_ei, &now,
> -				 dev_priv->rps.up_threshold))
> +				 VLV_RP_UP_EI_THRESHOLD))
>  			events |= GEN6_PM_RP_UP_THRESHOLD;
>  		dev_priv->rps.up_ei = now;
>  	}
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 70d9616..09f6aea 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -787,6 +787,8 @@ enum skl_disp_power_wells {
>  #define 	CHV_BIAS_CPU_50_SOC_50 (3 << 2)
>  
>  #define VLV_CZ_CLOCK_TO_MILLI_SEC		100000
> +#define VLV_RP_UP_EI_THRESHOLD			90
> +#define VLV_RP_DOWN_EI_THRESHOLD		70
>  
>  /* vlv2 north clock has */
>  #define CCK_FUSE_REG				0x8
> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> index db24f89..1923b6b 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -4983,7 +4983,7 @@ static void valleyview_set_rps(struct drm_i915_private *dev_priv, u8 val)
>  
>  	if (val != dev_priv->rps.cur_freq) {
>  		vlv_punit_write(dev_priv, PUNIT_REG_GPU_FREQ_REQ, val);
> -		if (!IS_CHERRYVIEW(dev_priv))
> +		if (!(IS_CHERRYVIEW(dev_priv) || IS_VALLEYVIEW(dev_priv)))
>  			gen6_set_rps_thresholds(dev_priv, val);
>  	}
>  
> -- 
> 2.7.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/byt: Avoid tweaking evaluation thresholds
@ 2017-01-26  8:47   ` Daniel Vetter
  0 siblings, 0 replies; 11+ messages in thread
From: Daniel Vetter @ 2017-01-26  8:47 UTC (permalink / raw)
  To: Mika Kuoppala
  Cc: Len Brown, Michal Feix, Jani Nikula, Daniel Vetter, intel-gfx,
	Hans de Goede, miku, Jarkko Nikula, Ezequiel Garcia, # v4 . 2+,
	fritsch

On Wed, Jan 25, 2017 at 02:31:08PM +0200, Mika Kuoppala wrote:
> Certain Baytrails, namely the 4 cpu core variants, have been
> plaqued by spurious system hangs, mostly occurring with light loads.
> 
> Multiple bisects by various people point to a commit which changes the
> reclocking strategy for Baytrail to follow its bigger brethen:
> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
> 
> There is also a review comment attached to this commit from Deepak S
> on avoiding punit access on Cherryview and thus it is excluded on
> common reclocking path. By taking the same approach and omitting
> the punit access by not tweaking the thresholds when the hardware
> has been asked to move into different frequency, considerable gains
> in stability have been observed.
> 
> With J1900 box, light render/video load would end up in system hang
> in usually less than 12 hours. With this patch applied, the cumulative
> uptime has now been 34 days without issues. To provoke system hang,
> light loads on both render and bsd engines in parallel have been used:
> glxgears >/dev/null 2>/dev/null &
> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
> 
> So far, author has not witnessed system hang with above load
> and this patch applied. Reports from the tenacious people at
> kernel bugzilla are also promising.
> 
> Considering that the punit access frequency with this patch is
> considerably less, there is a possibility that this will push
> the, still unknown, root cause past the triggering point on most loads.
> Further work on investigating the punit accesses on byt is welcomed.
> 
> References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Jani Nikula <jani.nikula@intel.com>
> Cc: fritsch@xbmc.org
> Cc: miku@iki.fi
> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
> CC: Michal Feix <michal@feix.cz>
> Cc: Hans de Goede <hdegoede@redhat.com>
> Cc: Deepak S <deepak.s@linux.intel.com>
> Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v4.2+
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>

It sucks, but I guess this is better than dead machines. I'd say let's
wait another 1-2 weeks for tested-bys to trickle in, and if it does fix
the problem then let's apply it. rps keeps on sucking, that's
unfortunately not news at all.

Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

> ---
>  drivers/gpu/drm/i915/i915_irq.c | 4 ++--
>  drivers/gpu/drm/i915/i915_reg.h | 2 ++
>  drivers/gpu/drm/i915/intel_pm.c | 2 +-
>  3 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 3fc286c..4b9635f 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1039,7 +1039,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
>  	if (pm_iir & GEN6_PM_RP_DOWN_EI_EXPIRED) {
>  		if (!vlv_c0_above(dev_priv,
>  				  &dev_priv->rps.down_ei, &now,
> -				  dev_priv->rps.down_threshold))
> +				  VLV_RP_DOWN_EI_THRESHOLD))
>  			events |= GEN6_PM_RP_DOWN_THRESHOLD;
>  		dev_priv->rps.down_ei = now;
>  	}
> @@ -1047,7 +1047,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
>  	if (pm_iir & GEN6_PM_RP_UP_EI_EXPIRED) {
>  		if (vlv_c0_above(dev_priv,
>  				 &dev_priv->rps.up_ei, &now,
> -				 dev_priv->rps.up_threshold))
> +				 VLV_RP_UP_EI_THRESHOLD))
>  			events |= GEN6_PM_RP_UP_THRESHOLD;
>  		dev_priv->rps.up_ei = now;
>  	}
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 70d9616..09f6aea 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -787,6 +787,8 @@ enum skl_disp_power_wells {
>  #define 	CHV_BIAS_CPU_50_SOC_50 (3 << 2)
>  
>  #define VLV_CZ_CLOCK_TO_MILLI_SEC		100000
> +#define VLV_RP_UP_EI_THRESHOLD			90
> +#define VLV_RP_DOWN_EI_THRESHOLD		70
>  
>  /* vlv2 north clock has */
>  #define CCK_FUSE_REG				0x8
> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> index db24f89..1923b6b 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -4983,7 +4983,7 @@ static void valleyview_set_rps(struct drm_i915_private *dev_priv, u8 val)
>  
>  	if (val != dev_priv->rps.cur_freq) {
>  		vlv_punit_write(dev_priv, PUNIT_REG_GPU_FREQ_REQ, val);
> -		if (!IS_CHERRYVIEW(dev_priv))
> +		if (!(IS_CHERRYVIEW(dev_priv) || IS_VALLEYVIEW(dev_priv)))
>  			gen6_set_rps_thresholds(dev_priv, val);
>  	}
>  
> -- 
> 2.7.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] drm/i915/byt: Avoid tweaking evaluation thresholds
  2017-01-26  8:47   ` Daniel Vetter
  (?)
@ 2017-01-26 17:29   ` Brown, Len
  -1 siblings, 0 replies; 11+ messages in thread
From: Brown, Len @ 2017-01-26 17:29 UTC (permalink / raw)
  To: Daniel Vetter, Mika Kuoppala
  Cc: intel-gfx, Chris Wilson, Ville Syrjälä,
	Daniel Vetter, Nikula, Jani, fritsch, miku, Ezequiel Garcia,
	Michal Feix, Hans de Goede, Deepak S, Jarkko Nikula, # v4 . 2+

> It sucks, but I guess this is better than dead machines. I'd say let's
> wait another 1-2 weeks for tested-bys to trickle in, and if it does fix
> the problem then let's apply it. rps keeps on sucking, that's
> unfortunately not news at all.
> 
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

I have 3 machines that I can hang in under 10 minutes.
They are different types of baytrail: n3540, j1900, z3775.

With this revert applied, all  three machines have
the same stress test for over 7 days without failure.

I disagree with an action plan that includes the word "wait".

Tested-by: Len Brown <len.brown@intel.com>

Indeed, my question is if we can turn off GFX p-states entirely
on this hardware.  Is there a command line parameter I can
use to do that?  If we have one, it will certainly make
troubleshooting orders of magnitude easier.

Note that the bisected patch

     commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff
     Author: Chris Wilson <chris@chris-wilson.co.uk>
     Date:   Tue Apr 7 16:20:28 2015 +0100
    
     drm/i915: Agressive downclocking on Baytrail

was applied to Linux 3.17-rc1.
Thus, this revert should be applied to every stable release back to 3.17.

thanks,
-Len


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-01-26 17:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-25 12:31 [PATCH] drm/i915/byt: Avoid tweaking evaluation thresholds Mika Kuoppala
2017-01-25 12:31 ` Mika Kuoppala
2017-01-25 12:42 ` Chris Wilson
2017-01-25 13:09   ` Mika Kuoppala
2017-01-25 13:09     ` Mika Kuoppala
2017-01-25 13:17     ` Chris Wilson
2017-01-25 13:35       ` Mika Kuoppala
2017-01-25 13:24 ` ✓ Fi.CI.BAT: success for " Patchwork
2017-01-26  8:47 ` [PATCH] " Daniel Vetter
2017-01-26  8:47   ` Daniel Vetter
2017-01-26 17:29   ` Brown, Len

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.