linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang
@ 2020-11-01 17:41 rwright
  2020-11-01 17:41 ` [PATCH v3 1/3] drm/i915: Introduce quirk QUIRK_RENDERCLEAR_REDUCED rwright
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: rwright @ 2020-11-01 17:41 UTC (permalink / raw)
  To: jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, daniel,
	sumit.semwal, christian.koenig, hdegoede, wambui.karugax, chris,
	matthew.auld, akeem.g.abodunrin, prathap.kumar.valsan,
	mika.kuoppala, rwright
  Cc: intel-gfx, dri-devel, linux-kernel, linux-media

From: Randy Wright <rwright@hpe.com>

For several months, I've been experiencing GPU hangs when  starting
Cinnamon on an HP Pavilion Mini 300-020 if I try to run an upstream
kernel.  I reported this recently in
https://gitlab.freedesktop.org/drm/intel/-/issues/2413 where I have
attached the requested evidence including the state collected from
/sys/class/drm/card0/error and debug output from dmesg.

I ran a bisect to find the problem, which indicates this is the
troublesome commit:

  [47f8253d2b8947d79fd3196bf96c1959c0f25f20] drm/i915/gen7: Clear all EU/L3 residual contexts

The nature of that commit suggested to me that reducing the
batch size used in the context clear operation might help this
relatively low-powered system to avoid the hang.... and it did!
I simply forced this system to take the smaller batch length that is
already used for non-Haswell systems.

The first two versions of this patch were posted as RFC
patches to the Intel-gfx list, implementing the same
algorithmic change in function batch_get_defaults,
but without employing a properly constructed quirk.

I've now cleaned up the patch to employ a new QUIRK_RENDERCLEAR_REDUCED.
The quirk is presently set only for the aforementioned HP Pavilion Mini
300-020.  The patch now touches three files to define the quirk, set it,
and then check for it in function batch_get_defaults.

Randy Wright (3):
  drm/i915: Introduce quirk QUIRK_RENDERCLEAR_REDUCED
  drm/i915/display: Add function quirk_renderclear_reduced
  drm/i915/gt: Force reduced batch size if new QUIRK_RENDERCLEAR_REDUCED
    is set.

 drivers/gpu/drm/i915/display/intel_quirks.c | 13 +++++++++++++
 drivers/gpu/drm/i915/gt/gen7_renderclear.c  |  2 +-
 drivers/gpu/drm/i915/i915_drv.h             |  1 +
 3 files changed, 15 insertions(+), 1 deletion(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/3] drm/i915: Introduce quirk QUIRK_RENDERCLEAR_REDUCED
  2020-11-01 17:41 [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang rwright
@ 2020-11-01 17:41 ` rwright
  2020-11-01 17:41 ` [PATCH v3 2/3] drm/i915/display: Add function quirk_renderclear_reduced rwright
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: rwright @ 2020-11-01 17:41 UTC (permalink / raw)
  To: jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, daniel,
	sumit.semwal, christian.koenig, hdegoede, wambui.karugax, chris,
	matthew.auld, akeem.g.abodunrin, prathap.kumar.valsan,
	mika.kuoppala, rwright
  Cc: intel-gfx, dri-devel, linux-kernel, linux-media

From: Randy Wright <rwright@hpe.com>

Introduce quirk QUIRK_RENDERCLEAR_REDUCED, which will be used
to force a limited batch buffer size for clearing
residual contexts in gen7_renderclear.c.

Signed-off-by: Randy Wright <rwright@hpe.com>
---
 drivers/gpu/drm/i915/i915_drv.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e4f7f6518945..e8873462eb2c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -525,6 +525,7 @@ struct i915_psr {
 #define QUIRK_PIN_SWIZZLED_PAGES (1<<5)
 #define QUIRK_INCREASE_T12_DELAY (1<<6)
 #define QUIRK_INCREASE_DDI_DISABLED_TIME (1<<7)
+#define QUIRK_RENDERCLEAR_REDUCED (1<<8)
 
 struct intel_fbdev;
 struct intel_fbc_work;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/3] drm/i915/display: Add function quirk_renderclear_reduced
  2020-11-01 17:41 [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang rwright
  2020-11-01 17:41 ` [PATCH v3 1/3] drm/i915: Introduce quirk QUIRK_RENDERCLEAR_REDUCED rwright
@ 2020-11-01 17:41 ` rwright
  2020-11-01 17:41 ` [PATCH v3 3/3] drm/i915/gt: Force reduced batch size if new QUIRK_RENDERCLEAR_REDUCED is set rwright
  2020-11-02  9:48 ` [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang Hans de Goede
  3 siblings, 0 replies; 8+ messages in thread
From: rwright @ 2020-11-01 17:41 UTC (permalink / raw)
  To: jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, daniel,
	sumit.semwal, christian.koenig, hdegoede, wambui.karugax, chris,
	matthew.auld, akeem.g.abodunrin, prathap.kumar.valsan,
	mika.kuoppala, rwright
  Cc: intel-gfx, dri-devel, linux-kernel, linux-media

From: Randy Wright <rwright@hpe.com>

Added function quirk_renderclear_reduced to set QUIRK_RENDERCLEAR_REDUCED
for designated platforms.  Applying QUIRK_RENDERCLEAR_REDUCED for
the HP Pavilion Mini 300-020 prevents a GPU hang.

Signed-off-by: Randy Wright <rwright@hpe.com>
---
 drivers/gpu/drm/i915/display/intel_quirks.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/display/intel_quirks.c b/drivers/gpu/drm/i915/display/intel_quirks.c
index 46beb155d835..630b984ba49c 100644
--- a/drivers/gpu/drm/i915/display/intel_quirks.c
+++ b/drivers/gpu/drm/i915/display/intel_quirks.c
@@ -53,6 +53,16 @@ static void quirk_increase_ddi_disabled_time(struct drm_i915_private *i915)
 	drm_info(&i915->drm, "Applying Increase DDI Disabled quirk\n");
 }
 
+/*
+ * Force use of smaller batch size in gen7_renderclear.c
+ * Needed on (at least) HP Pavilion Mini 300-020 to avoid GPU hang.
+ */
+static void quirk_renderclear_reduced(struct drm_i915_private *i915)
+{
+	i915->quirks |= QUIRK_RENDERCLEAR_REDUCED;
+	drm_info(&i915->drm, "Applying Renderclear Reduced quirk\n");
+}
+
 struct intel_quirk {
 	int device;
 	int subsystem_vendor;
@@ -141,6 +151,9 @@ static struct intel_quirk intel_quirks[] = {
 	/* HP Chromebook 14 (Celeron 2955U) */
 	{ 0x0a06, 0x103c, 0x21ed, quirk_backlight_present },
 
+	/* HP Mini 300-020 */
+	{ 0x0a06, 0x103c, 0x2b38, quirk_renderclear_reduced },
+
 	/* Dell Chromebook 11 */
 	{ 0x0a06, 0x1028, 0x0a35, quirk_backlight_present },
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 3/3] drm/i915/gt: Force reduced batch size if new QUIRK_RENDERCLEAR_REDUCED is set.
  2020-11-01 17:41 [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang rwright
  2020-11-01 17:41 ` [PATCH v3 1/3] drm/i915: Introduce quirk QUIRK_RENDERCLEAR_REDUCED rwright
  2020-11-01 17:41 ` [PATCH v3 2/3] drm/i915/display: Add function quirk_renderclear_reduced rwright
@ 2020-11-01 17:41 ` rwright
  2020-11-02  9:48 ` [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang Hans de Goede
  3 siblings, 0 replies; 8+ messages in thread
From: rwright @ 2020-11-01 17:41 UTC (permalink / raw)
  To: jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, daniel,
	sumit.semwal, christian.koenig, hdegoede, wambui.karugax, chris,
	matthew.auld, akeem.g.abodunrin, prathap.kumar.valsan,
	mika.kuoppala, rwright
  Cc: intel-gfx, dri-devel, linux-kernel, linux-media

From: Randy Wright <rwright@hpe.com>

In function batch_get_defaults, the smaller batch size
will be selected if QUIRK_RENDERCLEAR_REDUCED is set.

Signed-off-by: Randy Wright <rwright@hpe.com>
---
 drivers/gpu/drm/i915/gt/gen7_renderclear.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/gen7_renderclear.c b/drivers/gpu/drm/i915/gt/gen7_renderclear.c
index d93d85cd3027..e5265cdf696b 100644
--- a/drivers/gpu/drm/i915/gt/gen7_renderclear.c
+++ b/drivers/gpu/drm/i915/gt/gen7_renderclear.c
@@ -49,7 +49,7 @@ struct batch_vals {
 static void
 batch_get_defaults(struct drm_i915_private *i915, struct batch_vals *bv)
 {
-	if (IS_HASWELL(i915)) {
+	if (IS_HASWELL(i915) && !(i915->quirks & QUIRK_RENDERCLEAR_REDUCED)) {
 		bv->max_primitives = 280;
 		bv->max_urb_entries = MAX_URB_ENTRIES;
 		bv->surface_height = 16 * 16;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang
  2020-11-01 17:41 [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang rwright
                   ` (2 preceding siblings ...)
  2020-11-01 17:41 ` [PATCH v3 3/3] drm/i915/gt: Force reduced batch size if new QUIRK_RENDERCLEAR_REDUCED is set rwright
@ 2020-11-02  9:48 ` Hans de Goede
  2020-11-02 19:57   ` rwright
  3 siblings, 1 reply; 8+ messages in thread
From: Hans de Goede @ 2020-11-02  9:48 UTC (permalink / raw)
  To: rwright, jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied,
	daniel, sumit.semwal, christian.koenig, wambui.karugax, chris,
	matthew.auld, akeem.g.abodunrin, prathap.kumar.valsan,
	mika.kuoppala
  Cc: intel-gfx, dri-devel, linux-kernel, linux-media

Hi,

On 11/1/20 6:41 PM, rwright@hpe.com wrote:
> From: Randy Wright <rwright@hpe.com>
> 
> For several months, I've been experiencing GPU hangs when  starting
> Cinnamon on an HP Pavilion Mini 300-020 if I try to run an upstream
> kernel.  I reported this recently in
> https://gitlab.freedesktop.org/drm/intel/-/issues/2413 where I have
> attached the requested evidence including the state collected from
> /sys/class/drm/card0/error and debug output from dmesg.
> 
> I ran a bisect to find the problem, which indicates this is the
> troublesome commit:
> 
>   [47f8253d2b8947d79fd3196bf96c1959c0f25f20] drm/i915/gen7: Clear all EU/L3 residual contexts
> 
> The nature of that commit suggested to me that reducing the
> batch size used in the context clear operation might help this
> relatively low-powered system to avoid the hang.... and it did!
> I simply forced this system to take the smaller batch length that is
> already used for non-Haswell systems.
> 
> The first two versions of this patch were posted as RFC
> patches to the Intel-gfx list, implementing the same
> algorithmic change in function batch_get_defaults,
> but without employing a properly constructed quirk.
> 
> I've now cleaned up the patch to employ a new QUIRK_RENDERCLEAR_REDUCED.
> The quirk is presently set only for the aforementioned HP Pavilion Mini
> 300-020.  The patch now touches three files to define the quirk, set it,
> and then check for it in function batch_get_defaults.

Note I'm not really an i915 dev.

With that said I do wonder if we should not use the
reduced batch size in a lot more cases, the machine in question uses a
3558U CPU if the iGPU of that CPU has this issue, then I would expect
pretty much all Haswell U models (at a minimum) to have this issue.

So solving this with a quirk for just the HP Pavilion Mini 300-020
seems wrong to me. I think we need a more generic way of enabling
the reduced batch size. I even wonder if we should not simply use
it everywhere. Since you do have a proper Haswell CPU, I guess
it being an U model makes the hang easier to trigger, but I suspect
the higher TPD ones may also still be susceptible ...

Regards,

Hans


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang
  2020-11-02  9:48 ` [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang Hans de Goede
@ 2020-11-02 19:57   ` rwright
  2020-11-07 17:57     ` rwright
  0 siblings, 1 reply; 8+ messages in thread
From: rwright @ 2020-11-02 19:57 UTC (permalink / raw)
  To: Hans de Goede
  Cc: jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, daniel,
	sumit.semwal, christian.koenig, wambui.karugax, chris,
	matthew.auld, akeem.g.abodunrin, prathap.kumar.valsan,
	mika.kuoppala, intel-gfx, dri-devel, linux-kernel, linux-media

On Mon, Nov 02, 2020 at 10:48:54AM +0100, Hans de Goede wrote:
> Hi,
> 
> On 11/1/20 6:41 PM, rwright@hpe.com wrote:
> > From: Randy Wright <rwright@hpe.com>
> > 
> > For several months, I've been experiencing GPU hangs when  starting
> > Cinnamon on an HP Pavilion Mini 300-020 if I try to run an upstream
> > kernel.  I reported this recently in
> > https://gitlab.freedesktop.org/drm/intel/-/issues/2413 where I have
> > attached the requested evidence including the state collected from
> > /sys/class/drm/card0/error and debug output from dmesg.
> > 
> > I ran a bisect to find the problem, which indicates this is the
> > troublesome commit:
> > 
> >   [47f8253d2b8947d79fd3196bf96c1959c0f25f20] drm/i915/gen7: Clear all EU/L3 residual contexts
> > ...
> > I've now cleaned up the patch to employ a new QUIRK_RENDERCLEAR_REDUCED.
> > The quirk is presently set only for the aforementioned HP Pavilion Mini
> > 300-020.  The patch now touches three files to define the quirk, set it,
> > and then check for it in function batch_get_defaults.
> 
> Note I'm not really an i915 dev.
> 
> With that said I do wonder if we should not use the
> reduced batch size in a lot more cases, the machine in question uses a
> 3558U CPU if the iGPU of that CPU has this issue, then I would expect
> pretty much all Haswell U models (at a minimum) to have this issue.
> 
> So solving this with a quirk for just the HP Pavilion Mini 300-020
> seems wrong to me. I think we need a more generic way of enabling
> the reduced batch size. I even wonder if we should not simply use
> it everywhere. Since you do have a proper Haswell CPU, I guess
> it being an U model makes the hang easier to trigger, but I suspect
> the higher TPD ones may also still be susceptible ...
> 
> Regards,
> 
> Hans
> 

Hi Hans,

As you noted, the 3558U cpu is one of the least powerful processors
to be designated as a Haswell, but there are others at the low end
of the Haswell architecture that I also suspect might exhibit
similar problems.

That leads me to think that more gpu hangs like mine will be reported
when commit 47f8253d makes its way into widely used kernels. And that's
why I chose to implement a quirk that would allow enrolling other
systems as they are identified.

Your remark about applying the reduced batch size in all cases certainly
would simplify the patch.   However, I don't have any other systems
using the i915 driver on which I could try to measure the putative
performance penalty of reducing the batch size on a system that worked
properly with the large size.   So I couldn't thoroughly investigate
the consequences of a broader change.

That said, if the i915 maintainers respond in favor of the simpler
unconditional reduction of the batch size, I will be glad to
propose a much simpler version of my patch.

I probably should clarify that this patch is to resolve a problem on a
personally owned system that I use at home.  It is not related to a
problem with any of HPE's products, and so I don't have a lab full of
systems using the i915 driver on which I can test a change that would
have an effect many products.  The consumer products like Pavilions
stayed with HP when HPE split from HP five years ago.

--
Randy Wright            Usmail: Hewlett Packard Enterprise
Email: rwright@hpe.com          Servers Linux Enablement
Phone: (970) 898-0998           3404 E. Harmony Rd, Mailstop 36
                                Fort Collins, CO 80528-9599 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang
  2020-11-02 19:57   ` rwright
@ 2020-11-07 17:57     ` rwright
  0 siblings, 0 replies; 8+ messages in thread
From: rwright @ 2020-11-07 17:57 UTC (permalink / raw)
  To: Hans de Goede
  Cc: jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, daniel,
	sumit.semwal, christian.koenig, wambui.karugax, chris,
	matthew.auld, akeem.g.abodunrin, prathap.kumar.valsan,
	mika.kuoppala, intel-gfx, dri-devel, linux-kernel, linux-media

On Mon, Nov 02, 2020 at 12:57:10PM -0700, rwright@hpe.com wrote:
> On Mon, Nov 02, 2020 at 10:48:54AM +0100, Hans de Goede wrote:
> > Hi,
> > 
> ...
> That said, if the i915 maintainers respond in favor of the simpler
> unconditional reduction of the batch size, I will be glad to
> propose a much simpler version of my patch.
> ...

I received a suggestion from Mika Kuoppala to test 
https://patchwork.freedesktop.org/patch/399174/?series=83531&rev=1 as a
solution for the GPU hang I observed, and the test was successful.
I recommend this patch as a better approch than my own, as it 
addresses more general cases without introducing a new quirk.

--
Randy Wright            Usmail: Hewlett Packard Enterprise
Email: rwright@hpe.com          Servers Linux Enablement
Phone: (970) 898-0998           3404 E. Harmony Rd, Mailstop 36
                                Fort Collins, CO 80528-9599 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang
@ 2020-11-01 14:42 rwright
  0 siblings, 0 replies; 8+ messages in thread
From: rwright @ 2020-11-01 14:42 UTC (permalink / raw)
  To: jani.nikula, joonas.lahtinen, rodrigo.vivi, airlied, daniel,
	sumit.semwal, christian.koenig, hdegoede, wambui.karugax, chris,
	matthew.auld, akeem.g.abodunrin, prathap.kumar.valsan,
	mika.kuoppala, rwright
  Cc: intel-gfx, dri-devel, linux-kernel, linux-media

From: Randy Wright <rwright@hpe.com>

For several months, I've been experiencing GPU hangs when  starting
Cinnamon on an HP Pavilion Mini 300-020 if I try to run an upstream
kernel.  I reported this recently in
https://gitlab.freedesktop.org/drm/intel/-/issues/2413 where I have
attached the requested evidence including the state collected from
/sys/class/drm/card0/error and debug output from dmesg.

I ran a bisect to find the problem, which indicates this is the
troublesome commit:

  [47f8253d2b8947d79fd3196bf96c1959c0f25f20] drm/i915/gen7: Clear all EU/L3 residual contexts

The nature of that commit suggested to me that reducing the
batch size used in the context clear operation might help this
relatively low-powered system to avoid the hang.... and it did!
I simply forced this system to take the smaller batch length that is
already used for non-Haswell systems.

The first two versions of this patch were posted as RFC
patches to the Intel-gfx list, implementing the same
algorithmic change in function batch_get_defaults,
but without employing a properly constructed quirk.

I've now cleaned up the patch to employ a new QUIRK_RENDERCLEAR_REDUCED.
The quirk is presently set only for the aforementioned HP Pavilion Mini
300-020.  The patch now touches three files to define the quirk, set it,
and then check for it in function batch_get_defaults.

Randy Wright (3):
  drm/i915: Introduce quirk QUIRK_RENDERCLEAR_REDUCED
  drm/i915/display: Add function quirk_renderclear_reduced
  drm/i915/gt: Force reduced batch size if new QUIRK_RENDERCLEAR_REDUCED
    is set.

 drivers/gpu/drm/i915/display/intel_quirks.c | 13 +++++++++++++
 drivers/gpu/drm/i915/gt/gen7_renderclear.c  |  2 +-
 drivers/gpu/drm/i915/i915_drv.h             |  1 +
 3 files changed, 15 insertions(+), 1 deletion(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-11-07 17:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-01 17:41 [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang rwright
2020-11-01 17:41 ` [PATCH v3 1/3] drm/i915: Introduce quirk QUIRK_RENDERCLEAR_REDUCED rwright
2020-11-01 17:41 ` [PATCH v3 2/3] drm/i915/display: Add function quirk_renderclear_reduced rwright
2020-11-01 17:41 ` [PATCH v3 3/3] drm/i915/gt: Force reduced batch size if new QUIRK_RENDERCLEAR_REDUCED is set rwright
2020-11-02  9:48 ` [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang Hans de Goede
2020-11-02 19:57   ` rwright
2020-11-07 17:57     ` rwright
  -- strict thread matches above, loose matches on Subject: below --
2020-11-01 14:42 rwright

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).