All of lore.kernel.org
 help / color / mirror / Atom feed
From: rwright@hpe.com
To: Hans de Goede <hdegoede@redhat.com>
Cc: jani.nikula@linux.intel.com, joonas.lahtinen@linux.intel.com,
	rodrigo.vivi@intel.com, airlied@linux.ie, daniel@ffwll.ch,
	sumit.semwal@linaro.org, christian.koenig@amd.com,
	wambui.karugax@gmail.com, chris@chris-wilson.co.uk,
	matthew.auld@intel.com, akeem.g.abodunrin@intel.com,
	prathap.kumar.valsan@intel.com, mika.kuoppala@linux.intel.com,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, linux-media@vger.kernel.org
Subject: Re: [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang
Date: Mon, 2 Nov 2020 12:57:11 -0700	[thread overview]
Message-ID: <20201102195710.GA12790@rfwz62> (raw)
In-Reply-To: <8cdf0dd0-2a2f-bae9-71ea-89a88fdb14a5@redhat.com>

On Mon, Nov 02, 2020 at 10:48:54AM +0100, Hans de Goede wrote:
> Hi,
> 
> On 11/1/20 6:41 PM, rwright@hpe.com wrote:
> > From: Randy Wright <rwright@hpe.com>
> > 
> > For several months, I've been experiencing GPU hangs when  starting
> > Cinnamon on an HP Pavilion Mini 300-020 if I try to run an upstream
> > kernel.  I reported this recently in
> > https://gitlab.freedesktop.org/drm/intel/-/issues/2413 where I have
> > attached the requested evidence including the state collected from
> > /sys/class/drm/card0/error and debug output from dmesg.
> > 
> > I ran a bisect to find the problem, which indicates this is the
> > troublesome commit:
> > 
> >   [47f8253d2b8947d79fd3196bf96c1959c0f25f20] drm/i915/gen7: Clear all EU/L3 residual contexts
> > ...
> > I've now cleaned up the patch to employ a new QUIRK_RENDERCLEAR_REDUCED.
> > The quirk is presently set only for the aforementioned HP Pavilion Mini
> > 300-020.  The patch now touches three files to define the quirk, set it,
> > and then check for it in function batch_get_defaults.
> 
> Note I'm not really an i915 dev.
> 
> With that said I do wonder if we should not use the
> reduced batch size in a lot more cases, the machine in question uses a
> 3558U CPU if the iGPU of that CPU has this issue, then I would expect
> pretty much all Haswell U models (at a minimum) to have this issue.
> 
> So solving this with a quirk for just the HP Pavilion Mini 300-020
> seems wrong to me. I think we need a more generic way of enabling
> the reduced batch size. I even wonder if we should not simply use
> it everywhere. Since you do have a proper Haswell CPU, I guess
> it being an U model makes the hang easier to trigger, but I suspect
> the higher TPD ones may also still be susceptible ...
> 
> Regards,
> 
> Hans
> 

Hi Hans,

As you noted, the 3558U cpu is one of the least powerful processors
to be designated as a Haswell, but there are others at the low end
of the Haswell architecture that I also suspect might exhibit
similar problems.

That leads me to think that more gpu hangs like mine will be reported
when commit 47f8253d makes its way into widely used kernels. And that's
why I chose to implement a quirk that would allow enrolling other
systems as they are identified.

Your remark about applying the reduced batch size in all cases certainly
would simplify the patch.   However, I don't have any other systems
using the i915 driver on which I could try to measure the putative
performance penalty of reducing the batch size on a system that worked
properly with the large size.   So I couldn't thoroughly investigate
the consequences of a broader change.

That said, if the i915 maintainers respond in favor of the simpler
unconditional reduction of the batch size, I will be glad to
propose a much simpler version of my patch.

I probably should clarify that this patch is to resolve a problem on a
personally owned system that I use at home.  It is not related to a
problem with any of HPE's products, and so I don't have a lab full of
systems using the i915 driver on which I can test a change that would
have an effect many products.  The consumer products like Pavilions
stayed with HP when HPE split from HP five years ago.

--
Randy Wright            Usmail: Hewlett Packard Enterprise
Email: rwright@hpe.com          Servers Linux Enablement
Phone: (970) 898-0998           3404 E. Harmony Rd, Mailstop 36
                                Fort Collins, CO 80528-9599 

WARNING: multiple messages have this Message-ID (diff)
From: rwright@hpe.com
To: Hans de Goede <hdegoede@redhat.com>
Cc: dri-devel@lists.freedesktop.org, airlied@linux.ie,
	mika.kuoppala@linux.intel.com, intel-gfx@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, christian.koenig@amd.com,
	linux-media@vger.kernel.org, matthew.auld@intel.com,
	rodrigo.vivi@intel.com, akeem.g.abodunrin@intel.com,
	chris@chris-wilson.co.uk, prathap.kumar.valsan@intel.com,
	wambui.karugax@gmail.com
Subject: Re: [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang
Date: Mon, 2 Nov 2020 12:57:11 -0700	[thread overview]
Message-ID: <20201102195710.GA12790@rfwz62> (raw)
In-Reply-To: <8cdf0dd0-2a2f-bae9-71ea-89a88fdb14a5@redhat.com>

On Mon, Nov 02, 2020 at 10:48:54AM +0100, Hans de Goede wrote:
> Hi,
> 
> On 11/1/20 6:41 PM, rwright@hpe.com wrote:
> > From: Randy Wright <rwright@hpe.com>
> > 
> > For several months, I've been experiencing GPU hangs when  starting
> > Cinnamon on an HP Pavilion Mini 300-020 if I try to run an upstream
> > kernel.  I reported this recently in
> > https://gitlab.freedesktop.org/drm/intel/-/issues/2413 where I have
> > attached the requested evidence including the state collected from
> > /sys/class/drm/card0/error and debug output from dmesg.
> > 
> > I ran a bisect to find the problem, which indicates this is the
> > troublesome commit:
> > 
> >   [47f8253d2b8947d79fd3196bf96c1959c0f25f20] drm/i915/gen7: Clear all EU/L3 residual contexts
> > ...
> > I've now cleaned up the patch to employ a new QUIRK_RENDERCLEAR_REDUCED.
> > The quirk is presently set only for the aforementioned HP Pavilion Mini
> > 300-020.  The patch now touches three files to define the quirk, set it,
> > and then check for it in function batch_get_defaults.
> 
> Note I'm not really an i915 dev.
> 
> With that said I do wonder if we should not use the
> reduced batch size in a lot more cases, the machine in question uses a
> 3558U CPU if the iGPU of that CPU has this issue, then I would expect
> pretty much all Haswell U models (at a minimum) to have this issue.
> 
> So solving this with a quirk for just the HP Pavilion Mini 300-020
> seems wrong to me. I think we need a more generic way of enabling
> the reduced batch size. I even wonder if we should not simply use
> it everywhere. Since you do have a proper Haswell CPU, I guess
> it being an U model makes the hang easier to trigger, but I suspect
> the higher TPD ones may also still be susceptible ...
> 
> Regards,
> 
> Hans
> 

Hi Hans,

As you noted, the 3558U cpu is one of the least powerful processors
to be designated as a Haswell, but there are others at the low end
of the Haswell architecture that I also suspect might exhibit
similar problems.

That leads me to think that more gpu hangs like mine will be reported
when commit 47f8253d makes its way into widely used kernels. And that's
why I chose to implement a quirk that would allow enrolling other
systems as they are identified.

Your remark about applying the reduced batch size in all cases certainly
would simplify the patch.   However, I don't have any other systems
using the i915 driver on which I could try to measure the putative
performance penalty of reducing the batch size on a system that worked
properly with the large size.   So I couldn't thoroughly investigate
the consequences of a broader change.

That said, if the i915 maintainers respond in favor of the simpler
unconditional reduction of the batch size, I will be glad to
propose a much simpler version of my patch.

I probably should clarify that this patch is to resolve a problem on a
personally owned system that I use at home.  It is not related to a
problem with any of HPE's products, and so I don't have a lab full of
systems using the i915 driver on which I can test a change that would
have an effect many products.  The consumer products like Pavilions
stayed with HP when HPE split from HP five years ago.

--
Randy Wright            Usmail: Hewlett Packard Enterprise
Email: rwright@hpe.com          Servers Linux Enablement
Phone: (970) 898-0998           3404 E. Harmony Rd, Mailstop 36
                                Fort Collins, CO 80528-9599 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

WARNING: multiple messages have this Message-ID (diff)
From: rwright@hpe.com
To: Hans de Goede <hdegoede@redhat.com>
Cc: dri-devel@lists.freedesktop.org, airlied@linux.ie,
	intel-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	christian.koenig@amd.com, linux-media@vger.kernel.org,
	matthew.auld@intel.com, chris@chris-wilson.co.uk,
	sumit.semwal@linaro.org, wambui.karugax@gmail.com
Subject: Re: [Intel-gfx] [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang
Date: Mon, 2 Nov 2020 12:57:11 -0700	[thread overview]
Message-ID: <20201102195710.GA12790@rfwz62> (raw)
In-Reply-To: <8cdf0dd0-2a2f-bae9-71ea-89a88fdb14a5@redhat.com>

On Mon, Nov 02, 2020 at 10:48:54AM +0100, Hans de Goede wrote:
> Hi,
> 
> On 11/1/20 6:41 PM, rwright@hpe.com wrote:
> > From: Randy Wright <rwright@hpe.com>
> > 
> > For several months, I've been experiencing GPU hangs when  starting
> > Cinnamon on an HP Pavilion Mini 300-020 if I try to run an upstream
> > kernel.  I reported this recently in
> > https://gitlab.freedesktop.org/drm/intel/-/issues/2413 where I have
> > attached the requested evidence including the state collected from
> > /sys/class/drm/card0/error and debug output from dmesg.
> > 
> > I ran a bisect to find the problem, which indicates this is the
> > troublesome commit:
> > 
> >   [47f8253d2b8947d79fd3196bf96c1959c0f25f20] drm/i915/gen7: Clear all EU/L3 residual contexts
> > ...
> > I've now cleaned up the patch to employ a new QUIRK_RENDERCLEAR_REDUCED.
> > The quirk is presently set only for the aforementioned HP Pavilion Mini
> > 300-020.  The patch now touches three files to define the quirk, set it,
> > and then check for it in function batch_get_defaults.
> 
> Note I'm not really an i915 dev.
> 
> With that said I do wonder if we should not use the
> reduced batch size in a lot more cases, the machine in question uses a
> 3558U CPU if the iGPU of that CPU has this issue, then I would expect
> pretty much all Haswell U models (at a minimum) to have this issue.
> 
> So solving this with a quirk for just the HP Pavilion Mini 300-020
> seems wrong to me. I think we need a more generic way of enabling
> the reduced batch size. I even wonder if we should not simply use
> it everywhere. Since you do have a proper Haswell CPU, I guess
> it being an U model makes the hang easier to trigger, but I suspect
> the higher TPD ones may also still be susceptible ...
> 
> Regards,
> 
> Hans
> 

Hi Hans,

As you noted, the 3558U cpu is one of the least powerful processors
to be designated as a Haswell, but there are others at the low end
of the Haswell architecture that I also suspect might exhibit
similar problems.

That leads me to think that more gpu hangs like mine will be reported
when commit 47f8253d makes its way into widely used kernels. And that's
why I chose to implement a quirk that would allow enrolling other
systems as they are identified.

Your remark about applying the reduced batch size in all cases certainly
would simplify the patch.   However, I don't have any other systems
using the i915 driver on which I could try to measure the putative
performance penalty of reducing the batch size on a system that worked
properly with the large size.   So I couldn't thoroughly investigate
the consequences of a broader change.

That said, if the i915 maintainers respond in favor of the simpler
unconditional reduction of the batch size, I will be glad to
propose a much simpler version of my patch.

I probably should clarify that this patch is to resolve a problem on a
personally owned system that I use at home.  It is not related to a
problem with any of HPE's products, and so I don't have a lab full of
systems using the i915 driver on which I can test a change that would
have an effect many products.  The consumer products like Pavilions
stayed with HP when HPE split from HP five years ago.

--
Randy Wright            Usmail: Hewlett Packard Enterprise
Email: rwright@hpe.com          Servers Linux Enablement
Phone: (970) 898-0998           3404 E. Harmony Rd, Mailstop 36
                                Fort Collins, CO 80528-9599 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2020-11-02 19:58 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-01 17:41 [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang rwright
2020-11-01 17:41 ` [Intel-gfx] " rwright
2020-11-01 17:41 ` rwright
2020-11-01 17:41 ` [PATCH v3 1/3] drm/i915: Introduce quirk QUIRK_RENDERCLEAR_REDUCED rwright
2020-11-01 17:41   ` [Intel-gfx] " rwright
2020-11-01 17:41   ` rwright
2020-11-01 17:41 ` [PATCH v3 2/3] drm/i915/display: Add function quirk_renderclear_reduced rwright
2020-11-01 17:41   ` [Intel-gfx] " rwright
2020-11-01 17:41   ` rwright
2020-11-01 17:41 ` [PATCH v3 3/3] drm/i915/gt: Force reduced batch size if new QUIRK_RENDERCLEAR_REDUCED is set rwright
2020-11-01 17:41   ` [Intel-gfx] " rwright
2020-11-01 17:41   ` rwright
2020-11-01 18:05 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Reduce context clear batch size to avoid gpu hang (rev2) Patchwork
2020-11-01 18:06 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2020-11-01 18:24 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2020-11-02  9:48 ` [PATCH v3 0/3] Reduce context clear batch size to avoid gpu hang Hans de Goede
2020-11-02  9:48   ` [Intel-gfx] " Hans de Goede
2020-11-02  9:48   ` Hans de Goede
2020-11-02 19:57   ` rwright [this message]
2020-11-02 19:57     ` [Intel-gfx] " rwright
2020-11-02 19:57     ` rwright
2020-11-07 17:57     ` rwright
2020-11-07 17:57       ` [Intel-gfx] " rwright
2020-11-07 17:57       ` rwright
  -- strict thread matches above, loose matches on Subject: below --
2020-11-01 14:42 rwright
2020-11-01 14:42 ` rwright

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201102195710.GA12790@rfwz62 \
    --to=rwright@hpe.com \
    --cc=airlied@linux.ie \
    --cc=akeem.g.abodunrin@intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=christian.koenig@amd.com \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hdegoede@redhat.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=joonas.lahtinen@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=matthew.auld@intel.com \
    --cc=mika.kuoppala@linux.intel.com \
    --cc=prathap.kumar.valsan@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=sumit.semwal@linaro.org \
    --cc=wambui.karugax@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.