linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
@ 2022-10-13 20:33 Hans de Goede
  2022-10-15 14:25 ` Hans de Goede
  2022-10-17  8:30 ` Jani Nikula
  0 siblings, 2 replies; 13+ messages in thread
From: Hans de Goede @ 2022-10-13 20:33 UTC (permalink / raw)
  To: intel-gfx, Linux Kernel Mailing List,
	Thorsten Leemhuis (regressions address)

Hi All,

Yesterday I got a new Lenovo ThinkPad X1 yoga gen 7 laptop, since I plan
to make this my new day to day laptop I have copied over the entire
rootfs, /home, etc. from my current laptop to avoid having to tweak
everything to my liking again.

This meant I had an initramfs generated for the other laptop. Which should
be fine since both are Intel machines and the old 5.19.y initramfs-es
worked fine. But 6.0.0 crashed with what seems like random memory
corruption (list integrity checks failing) until I regenerated the initrd ...

Comparing the old vs regenerated initrds showed no relevant differences,
which made me think this is a CPU ucode issue (which is pre-fixed
to the initrd for early microcode loading).

After some tests I have the following obeservations with 6.0.0:

1. The least stable is the old initrd (so with the wrong
ucode prefixed) this crashes before ever reaching gdm.
I believe that this is caused by late microcode loading
kicking in in this case (I though that was being removed?)
and doing load microcode loading on the i7-1260P with its
mix of P + E cores seems to seriously mess things up.

2. Slightly more stable, lasting at least a few minutes
before crashing is using dis_ucode_ldr

3. Using nomodeset seems to stabilize things even with
the old initrd with the wrong microcode prefixed

4. 5.19, with an old initrd and with normal modesetting
enabled works fine, so in a way this is a 6.0.0 regression

5. Using 6.0 with the new initrd with the new microcode
seems mostly stable, although sometimes this seems to 
hang very early during boot, esp. if a previous boot
crashed and I have not run this for a long time yet.

6. After crashes it seems to be necessary to powercycle
the machine to get things back in working condition.


With 6.0 the following WARN triggers:
drivers/gpu/drm/i915/display/intel_bios.c:477:

        drm_WARN(&i915->drm, min_size == 0,
                 "Block %d min_size is zero\n", section_id);

Since nomodeset helps this might be quite relevant, in 5.19.13
this does not happen, but I'm not sure if 5.19 has this check
at all.


There is a 2022/10/07 BIOS update which includes a CPU microcode
update available from Lenovo, I have not applied this yet in case
people want to investigate this further first.

Regards,

Hans



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
  2022-10-13 20:33 alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related Hans de Goede
@ 2022-10-15 14:25 ` Hans de Goede
  2022-10-17  8:17   ` [Intel-gfx] " Tvrtko Ursulin
  2022-10-17  8:30 ` Jani Nikula
  1 sibling, 1 reply; 13+ messages in thread
From: Hans de Goede @ 2022-10-15 14:25 UTC (permalink / raw)
  To: intel-gfx, Linux Kernel Mailing List,
	Thorsten Leemhuis (regressions address)

Hi,

On 10/13/22 22:33, Hans de Goede wrote:
> Hi All,
> 
> Yesterday I got a new Lenovo ThinkPad X1 yoga gen 7 laptop, since I plan
> to make this my new day to day laptop I have copied over the entire
> rootfs, /home, etc. from my current laptop to avoid having to tweak
> everything to my liking again.
> 
> This meant I had an initramfs generated for the other laptop. Which should
> be fine since both are Intel machines and the old 5.19.y initramfs-es
> worked fine. But 6.0.0 crashed with what seems like random memory
> corruption (list integrity checks failing) until I regenerated the initrd ...
> 
> Comparing the old vs regenerated initrds showed no relevant differences,
> which made me think this is a CPU ucode issue (which is pre-fixed
> to the initrd for early microcode loading).
> 
> After some tests I have the following obeservations with 6.0.0:
> 
> 1. The least stable is the old initrd (so with the wrong
> ucode prefixed) this crashes before ever reaching gdm.
> I believe that this is caused by late microcode loading
> kicking in in this case (I though that was being removed?)
> and doing load microcode loading on the i7-1260P with its
> mix of P + E cores seems to seriously mess things up.
> 
> 2. Slightly more stable, lasting at least a few minutes
> before crashing is using dis_ucode_ldr
> 
> 3. Using nomodeset seems to stabilize things even with
> the old initrd with the wrong microcode prefixed
> 
> 4. 5.19, with an old initrd and with normal modesetting
> enabled works fine, so in a way this is a 6.0.0 regression
> 
> 5. Using 6.0 with the new initrd with the new microcode
> seems mostly stable, although sometimes this seems to 
> hang very early during boot, esp. if a previous boot
> crashed and I have not run this for a long time yet.
> 
> 6. After crashes it seems to be necessary to powercycle
> the machine to get things back in working condition.
> 
> 
> With 6.0 the following WARN triggers:
> drivers/gpu/drm/i915/display/intel_bios.c:477:
> 
>         drm_WARN(&i915->drm, min_size == 0,
>                  "Block %d min_size is zero\n", section_id);
> 
> Since nomodeset helps this might be quite relevant, in 5.19.13
> this does not happen, but I'm not sure if 5.19 has this check
> at all.
> 
> 
> There is a 2022/10/07 BIOS update which includes a CPU microcode
> update available from Lenovo, I have not applied this yet in case
> people want to investigate this further first.

A quick update on this, the microcode being in the initrd or not
seems to be a bit of a red herring. Yesterday the machine crashed
twice at boot with 6.0.0 with an initrd which did correctly have
the alderlake microcode cpio archive prefixed.

Where as with 5.19 it boots correctly everytime. I will try to
make some time to git bisect this sometime next week. I expect
this is an i915 issue though since 6.0.0 with nomodeset on
the cmdline does seem to boot successfully every time.

Regards,

Hans


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
  2022-10-15 14:25 ` Hans de Goede
@ 2022-10-17  8:17   ` Tvrtko Ursulin
  0 siblings, 0 replies; 13+ messages in thread
From: Tvrtko Ursulin @ 2022-10-17  8:17 UTC (permalink / raw)
  To: Hans de Goede, intel-gfx, Linux Kernel Mailing List,
	Thorsten Leemhuis (regressions address)
  Cc: Jani Nikula, Ville Syrjälä



+ Jani and Ville for the intel_bios.c warn - no idea if that is relevant.

Hi,

On 15/10/2022 15:25, Hans de Goede wrote:
> Hi,
> 
> On 10/13/22 22:33, Hans de Goede wrote:
>> Hi All,
>>
>> Yesterday I got a new Lenovo ThinkPad X1 yoga gen 7 laptop, since I plan
>> to make this my new day to day laptop I have copied over the entire
>> rootfs, /home, etc. from my current laptop to avoid having to tweak
>> everything to my liking again.
>>
>> This meant I had an initramfs generated for the other laptop. Which should
>> be fine since both are Intel machines and the old 5.19.y initramfs-es
>> worked fine. But 6.0.0 crashed with what seems like random memory
>> corruption (list integrity checks failing) until I regenerated the initrd ...
>>
>> Comparing the old vs regenerated initrds showed no relevant differences,
>> which made me think this is a CPU ucode issue (which is pre-fixed
>> to the initrd for early microcode loading).
>>
>> After some tests I have the following obeservations with 6.0.0:
>>
>> 1. The least stable is the old initrd (so with the wrong
>> ucode prefixed) this crashes before ever reaching gdm.
>> I believe that this is caused by late microcode loading
>> kicking in in this case (I though that was being removed?)
>> and doing load microcode loading on the i7-1260P with its
>> mix of P + E cores seems to seriously mess things up.
>>
>> 2. Slightly more stable, lasting at least a few minutes
>> before crashing is using dis_ucode_ldr
>>
>> 3. Using nomodeset seems to stabilize things even with
>> the old initrd with the wrong microcode prefixed
>>
>> 4. 5.19, with an old initrd and with normal modesetting
>> enabled works fine, so in a way this is a 6.0.0 regression
>>
>> 5. Using 6.0 with the new initrd with the new microcode
>> seems mostly stable, although sometimes this seems to
>> hang very early during boot, esp. if a previous boot
>> crashed and I have not run this for a long time yet.
>>
>> 6. After crashes it seems to be necessary to powercycle
>> the machine to get things back in working condition.
>>
>>
>> With 6.0 the following WARN triggers:
>> drivers/gpu/drm/i915/display/intel_bios.c:477:
>>
>>          drm_WARN(&i915->drm, min_size == 0,
>>                   "Block %d min_size is zero\n", section_id);
>>
>> Since nomodeset helps this might be quite relevant, in 5.19.13
>> this does not happen, but I'm not sure if 5.19 has this check
>> at all.
>>
>>
>> There is a 2022/10/07 BIOS update which includes a CPU microcode
>> update available from Lenovo, I have not applied this yet in case
>> people want to investigate this further first.
> 
> A quick update on this, the microcode being in the initrd or not
> seems to be a bit of a red herring. Yesterday the machine crashed
> twice at boot with 6.0.0 with an initrd which did correctly have
> the alderlake microcode cpio archive prefixed.
> 
> Where as with 5.19 it boots correctly everytime. I will try to
> make some time to git bisect this sometime next week. I expect
> this is an i915 issue though since 6.0.0 with nomodeset on
> the cmdline does seem to boot successfully every time.

Maybe try with KASAN to see if it catches something before random list 
corruption starts happening?

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
  2022-10-13 20:33 alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related Hans de Goede
  2022-10-15 14:25 ` Hans de Goede
@ 2022-10-17  8:30 ` Jani Nikula
  2022-10-17  8:32   ` Hans de Goede
  2022-10-17  8:39   ` Jani Nikula
  1 sibling, 2 replies; 13+ messages in thread
From: Jani Nikula @ 2022-10-17  8:30 UTC (permalink / raw)
  To: Hans de Goede, intel-gfx, Linux Kernel Mailing List,
	Thorsten Leemhuis (regressions address)

On Thu, 13 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
> With 6.0 the following WARN triggers:
> drivers/gpu/drm/i915/display/intel_bios.c:477:
>
>         drm_WARN(&i915->drm, min_size == 0,
>                  "Block %d min_size is zero\n", section_id);

What's the value of section_id that gets printed?

BR,
Jani.


-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
  2022-10-17  8:30 ` Jani Nikula
@ 2022-10-17  8:32   ` Hans de Goede
  2022-10-17  8:39   ` Jani Nikula
  1 sibling, 0 replies; 13+ messages in thread
From: Hans de Goede @ 2022-10-17  8:32 UTC (permalink / raw)
  To: Jani Nikula, intel-gfx, Linux Kernel Mailing List,
	Thorsten Leemhuis (regressions address)

Hi,

On 10/17/22 10:30, Jani Nikula wrote:
> On Thu, 13 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
>> With 6.0 the following WARN triggers:
>> drivers/gpu/drm/i915/display/intel_bios.c:477:
>>
>>         drm_WARN(&i915->drm, min_size == 0,
>>                  "Block %d min_size is zero\n", section_id);
> 
> What's the value of section_id that gets printed?

It is 42.

Regards,

Hans


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
  2022-10-17  8:30 ` Jani Nikula
  2022-10-17  8:32   ` Hans de Goede
@ 2022-10-17  8:39   ` Jani Nikula
  2022-10-17 10:48     ` Hans de Goede
  1 sibling, 1 reply; 13+ messages in thread
From: Jani Nikula @ 2022-10-17  8:39 UTC (permalink / raw)
  To: Hans de Goede, intel-gfx, Linux Kernel Mailing List,
	Thorsten Leemhuis (regressions address)

On Mon, 17 Oct 2022, Jani Nikula <jani.nikula@linux.intel.com> wrote:
> On Thu, 13 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
>> With 6.0 the following WARN triggers:
>> drivers/gpu/drm/i915/display/intel_bios.c:477:
>>
>>         drm_WARN(&i915->drm, min_size == 0,
>>                  "Block %d min_size is zero\n", section_id);
>
> What's the value of section_id that gets printed?

I'm guessing this is [1] fixed by commit d3a7051841f0 ("drm/i915/bios:
Use hardcoded fp_timing size for generating LFP data pointers") in
v6.1-rc1.

I don't think this is the root cause for your issues, but I wonder if
you could try v6.1-rc1 or drm-tip and see if we've fixed the other stuff
already too?

BR,
Jani.


[1] https://gitlab.freedesktop.org/drm/intel/-/issues/6592

-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
  2022-10-17  8:39   ` Jani Nikula
@ 2022-10-17 10:48     ` Hans de Goede
  2022-10-17 11:19       ` Thorsten Leemhuis
  2022-10-17 11:40       ` Jani Nikula
  0 siblings, 2 replies; 13+ messages in thread
From: Hans de Goede @ 2022-10-17 10:48 UTC (permalink / raw)
  To: Jani Nikula, intel-gfx, Linux Kernel Mailing List,
	Thorsten Leemhuis (regressions address)

Hi,

On 10/17/22 10:39, Jani Nikula wrote:
> On Mon, 17 Oct 2022, Jani Nikula <jani.nikula@linux.intel.com> wrote:
>> On Thu, 13 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
>>> With 6.0 the following WARN triggers:
>>> drivers/gpu/drm/i915/display/intel_bios.c:477:
>>>
>>>         drm_WARN(&i915->drm, min_size == 0,
>>>                  "Block %d min_size is zero\n", section_id);
>>
>> What's the value of section_id that gets printed?
> 
> I'm guessing this is [1] fixed by commit d3a7051841f0 ("drm/i915/bios:
> Use hardcoded fp_timing size for generating LFP data pointers") in
> v6.1-rc1.
> 
> I don't think this is the root cause for your issues, but I wonder if
> you could try v6.1-rc1 or drm-tip and see if we've fixed the other stuff
> already too?

6.1-rc1 indeed does not trigger the drm_WARN and for now (couple of
reboots, running for 5 minutes now) it seems stable. 6.0.0 usually
crashed during boot (but not always).

Do you think it would be worthwhile to try 6.0.0 with d3a7051841f0 ?

Any other commits which I can try before I go down the bisect route ?

(I'm assuming this will also affect other users, so we really need
to fix this for 6.0.x before it starts hitting Arch + Fedora users)

Regards,

Hans



> [1] https://gitlab.freedesktop.org/drm/intel/-/issues/6592


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
  2022-10-17 10:48     ` Hans de Goede
@ 2022-10-17 11:19       ` Thorsten Leemhuis
  2022-10-17 13:14         ` Hans de Goede
  2022-10-17 11:40       ` Jani Nikula
  1 sibling, 1 reply; 13+ messages in thread
From: Thorsten Leemhuis @ 2022-10-17 11:19 UTC (permalink / raw)
  To: Hans de Goede, Jani Nikula, intel-gfx, Linux Kernel Mailing List

CCing the regression mailing list, as it should be in the loop for all
regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

On 17.10.22 12:48, Hans de Goede wrote:
> On 10/17/22 10:39, Jani Nikula wrote:
>> On Mon, 17 Oct 2022, Jani Nikula <jani.nikula@linux.intel.com> wrote:
>>> On Thu, 13 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
>>>> With 6.0 the following WARN triggers:
>>>> drivers/gpu/drm/i915/display/intel_bios.c:477:
>>>>
>>>>         drm_WARN(&i915->drm, min_size == 0,
>>>>                  "Block %d min_size is zero\n", section_id);
>>>
>>> What's the value of section_id that gets printed?
>>
>> I'm guessing this is [1] fixed by commit d3a7051841f0 ("drm/i915/bios:
>> Use hardcoded fp_timing size for generating LFP data pointers") in
>> v6.1-rc1.
>>
>> I don't think this is the root cause for your issues, but I wonder if
>> you could try v6.1-rc1 or drm-tip and see if we've fixed the other stuff
>> already too?
> 
> 6.1-rc1 indeed does not trigger the drm_WARN and for now (couple of
> reboots, running for 5 minutes now) it seems stable. 6.0.0 usually
> crashed during boot (but not always).
> 
> Do you think it would be worthwhile to try 6.0.0 with d3a7051841f0 ?
> 
> Any other commits which I can try before I go down the bisect route ?
> 
> (I'm assuming this will also affect other users, so we really need
> to fix this for 6.0.x

+1

> before it starts hitting Arch + Fedora users)

FWIW, I heard both openSUSE Tumbleweed and Arch switched to 6.0.y in the
past few days already.

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
  2022-10-17 10:48     ` Hans de Goede
  2022-10-17 11:19       ` Thorsten Leemhuis
@ 2022-10-17 11:40       ` Jani Nikula
  1 sibling, 0 replies; 13+ messages in thread
From: Jani Nikula @ 2022-10-17 11:40 UTC (permalink / raw)
  To: Hans de Goede, intel-gfx, Linux Kernel Mailing List,
	Thorsten Leemhuis (regressions address)
  Cc: ville.syrjala

On Mon, 17 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
> Hi,
>
> On 10/17/22 10:39, Jani Nikula wrote:
>> On Mon, 17 Oct 2022, Jani Nikula <jani.nikula@linux.intel.com> wrote:
>>> On Thu, 13 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
>>>> With 6.0 the following WARN triggers:
>>>> drivers/gpu/drm/i915/display/intel_bios.c:477:
>>>>
>>>>         drm_WARN(&i915->drm, min_size == 0,
>>>>                  "Block %d min_size is zero\n", section_id);
>>>
>>> What's the value of section_id that gets printed?
>> 
>> I'm guessing this is [1] fixed by commit d3a7051841f0 ("drm/i915/bios:
>> Use hardcoded fp_timing size for generating LFP data pointers") in
>> v6.1-rc1.
>> 
>> I don't think this is the root cause for your issues, but I wonder if
>> you could try v6.1-rc1 or drm-tip and see if we've fixed the other stuff
>> already too?
>
> 6.1-rc1 indeed does not trigger the drm_WARN and for now (couple of
> reboots, running for 5 minutes now) it seems stable. 6.0.0 usually
> crashed during boot (but not always).
>
> Do you think it would be worthwhile to try 6.0.0 with d3a7051841f0 ?

My guess is that d3a7051841f0 is a red herring. Sure, it's a warning
splat that would be nice to get fixed in v6.0, but I doubt it has
relevance to the problems you're seeing.

Cc: Ville, your thoughts?

> Any other commits which I can try before I go down the bisect route ?

Seems pretty vague I'm afraid. I know it's painful, but likely bisect is
the fastest way to pinpoint the issue and get at the root cause.

Also, filing a bug at [1] would help us get more attention.


BR,
Jani.


[1] https://gitlab.freedesktop.org/drm/intel/issues/new


>
> (I'm assuming this will also affect other users, so we really need
> to fix this for 6.0.x before it starts hitting Arch + Fedora users)
>
> Regards,
>
> Hans
>
>
>
>> [1] https://gitlab.freedesktop.org/drm/intel/-/issues/6592
>

-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
  2022-10-17 11:19       ` Thorsten Leemhuis
@ 2022-10-17 13:14         ` Hans de Goede
  2022-10-17 13:35           ` Jani Nikula
  0 siblings, 1 reply; 13+ messages in thread
From: Hans de Goede @ 2022-10-17 13:14 UTC (permalink / raw)
  To: Thorsten Leemhuis, Jani Nikula, intel-gfx, Linux Kernel Mailing List

Hi,

On 10/17/22 13:19, Thorsten Leemhuis wrote:
> CCing the regression mailing list, as it should be in the loop for all
> regressions, as explained here:
> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

Yes sorry about that I meant to Cc the regressions list, not you personally,
but the auto-completion picked the wrong address-book entry
(and I did not notice this).

> On 17.10.22 12:48, Hans de Goede wrote:
>> On 10/17/22 10:39, Jani Nikula wrote:
>>> On Mon, 17 Oct 2022, Jani Nikula <jani.nikula@linux.intel.com> wrote:
>>>> On Thu, 13 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
>>>>> With 6.0 the following WARN triggers:
>>>>> drivers/gpu/drm/i915/display/intel_bios.c:477:
>>>>>
>>>>>         drm_WARN(&i915->drm, min_size == 0,
>>>>>                  "Block %d min_size is zero\n", section_id);
>>>>
>>>> What's the value of section_id that gets printed?
>>>
>>> I'm guessing this is [1] fixed by commit d3a7051841f0 ("drm/i915/bios:
>>> Use hardcoded fp_timing size for generating LFP data pointers") in
>>> v6.1-rc1.
>>>
>>> I don't think this is the root cause for your issues, but I wonder if
>>> you could try v6.1-rc1 or drm-tip and see if we've fixed the other stuff
>>> already too?
>>
>> 6.1-rc1 indeed does not trigger the drm_WARN and for now (couple of
>> reboots, running for 5 minutes now) it seems stable. 6.0.0 usually
>> crashed during boot (but not always).
>>
>> Do you think it would be worthwhile to try 6.0.0 with d3a7051841f0 ?

So I have been trying 6.0.0 with d3a7051841f0 doing a whole bunch of
reboots + general use and that seems stable, then I reverted it and
the very first boot of the kernel with that broke again, so I'm
pretty sure that d3a7051841f0 fixes things.

So d3a7051841f0 seems to do more then just fix the WARN().

So lets try to get d3a7051841f0 added to the official stable series
ASAP (I just noticed that Mark Pearson from Lenovo has already added it
to Fedora's 6.0.2 build.

Regards,

Hans


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
  2022-10-17 13:14         ` Hans de Goede
@ 2022-10-17 13:35           ` Jani Nikula
  2022-10-17 14:32             ` Hans de Goede
  0 siblings, 1 reply; 13+ messages in thread
From: Jani Nikula @ 2022-10-17 13:35 UTC (permalink / raw)
  To: Hans de Goede, Thorsten Leemhuis, intel-gfx, Linux Kernel Mailing List

On Mon, 17 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
> Hi,
>
> On 10/17/22 13:19, Thorsten Leemhuis wrote:
>> CCing the regression mailing list, as it should be in the loop for all
>> regressions, as explained here:
>> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
>
> Yes sorry about that I meant to Cc the regressions list, not you personally,
> but the auto-completion picked the wrong address-book entry
> (and I did not notice this).
>
>> On 17.10.22 12:48, Hans de Goede wrote:
>>> On 10/17/22 10:39, Jani Nikula wrote:
>>>> On Mon, 17 Oct 2022, Jani Nikula <jani.nikula@linux.intel.com> wrote:
>>>>> On Thu, 13 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
>>>>>> With 6.0 the following WARN triggers:
>>>>>> drivers/gpu/drm/i915/display/intel_bios.c:477:
>>>>>>
>>>>>>         drm_WARN(&i915->drm, min_size == 0,
>>>>>>                  "Block %d min_size is zero\n", section_id);
>>>>>
>>>>> What's the value of section_id that gets printed?
>>>>
>>>> I'm guessing this is [1] fixed by commit d3a7051841f0 ("drm/i915/bios:
>>>> Use hardcoded fp_timing size for generating LFP data pointers") in
>>>> v6.1-rc1.
>>>>
>>>> I don't think this is the root cause for your issues, but I wonder if
>>>> you could try v6.1-rc1 or drm-tip and see if we've fixed the other stuff
>>>> already too?
>>>
>>> 6.1-rc1 indeed does not trigger the drm_WARN and for now (couple of
>>> reboots, running for 5 minutes now) it seems stable. 6.0.0 usually
>>> crashed during boot (but not always).
>>>
>>> Do you think it would be worthwhile to try 6.0.0 with d3a7051841f0 ?
>
> So I have been trying 6.0.0 with d3a7051841f0 doing a whole bunch of
> reboots + general use and that seems stable, then I reverted it and
> the very first boot of the kernel with that broke again, so I'm
> pretty sure that d3a7051841f0 fixes things.
>
> So d3a7051841f0 seems to do more then just fix the WARN().

Wow, so I guess we do screw up the parsing royally then. :o

> So lets try to get d3a7051841f0 added to the official stable series
> ASAP (I just noticed that Mark Pearson from Lenovo has already added it
> to Fedora's 6.0.2 build.

I think I'd also pick d3a7051841f0^ i.e. both commits:

d3a7051841f0 ("drm/i915/bios: Use hardcoded fp_timing size for generating LFP data pointers")
4e78d6023c15 ("drm/i915/bios: Validate fp_timing terminator presence")

for stable.

BR,
Jani.


>
> Regards,
>
> Hans
>

-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
  2022-10-17 13:35           ` Jani Nikula
@ 2022-10-17 14:32             ` Hans de Goede
  2022-10-18 10:32               ` Ville Syrjälä
  0 siblings, 1 reply; 13+ messages in thread
From: Hans de Goede @ 2022-10-17 14:32 UTC (permalink / raw)
  To: Jani Nikula, Thorsten Leemhuis, intel-gfx, Linux Kernel Mailing List

Hi,

On 10/17/22 15:35, Jani Nikula wrote:
> On Mon, 17 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
>> Hi,
>>
>> On 10/17/22 13:19, Thorsten Leemhuis wrote:
>>> CCing the regression mailing list, as it should be in the loop for all
>>> regressions, as explained here:
>>> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
>>
>> Yes sorry about that I meant to Cc the regressions list, not you personally,
>> but the auto-completion picked the wrong address-book entry
>> (and I did not notice this).
>>
>>> On 17.10.22 12:48, Hans de Goede wrote:
>>>> On 10/17/22 10:39, Jani Nikula wrote:
>>>>> On Mon, 17 Oct 2022, Jani Nikula <jani.nikula@linux.intel.com> wrote:
>>>>>> On Thu, 13 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
>>>>>>> With 6.0 the following WARN triggers:
>>>>>>> drivers/gpu/drm/i915/display/intel_bios.c:477:
>>>>>>>
>>>>>>>         drm_WARN(&i915->drm, min_size == 0,
>>>>>>>                  "Block %d min_size is zero\n", section_id);
>>>>>>
>>>>>> What's the value of section_id that gets printed?
>>>>>
>>>>> I'm guessing this is [1] fixed by commit d3a7051841f0 ("drm/i915/bios:
>>>>> Use hardcoded fp_timing size for generating LFP data pointers") in
>>>>> v6.1-rc1.
>>>>>
>>>>> I don't think this is the root cause for your issues, but I wonder if
>>>>> you could try v6.1-rc1 or drm-tip and see if we've fixed the other stuff
>>>>> already too?
>>>>
>>>> 6.1-rc1 indeed does not trigger the drm_WARN and for now (couple of
>>>> reboots, running for 5 minutes now) it seems stable. 6.0.0 usually
>>>> crashed during boot (but not always).
>>>>
>>>> Do you think it would be worthwhile to try 6.0.0 with d3a7051841f0 ?
>>
>> So I have been trying 6.0.0 with d3a7051841f0 doing a whole bunch of
>> reboots + general use and that seems stable, then I reverted it and
>> the very first boot of the kernel with that broke again, so I'm
>> pretty sure that d3a7051841f0 fixes things.
>>
>> So d3a7051841f0 seems to do more then just fix the WARN().
> 
> Wow, so I guess we do screw up the parsing royally then. :o

I'm running the kernel with lockdep + list-debugging enabled and
I could not reproduce this (not easily at least) on a standard
Fedora 6.0.0 build without that. So maybe the parsing just manages
to write out of binds a tiny bit which just happens to hit a list_head
somewhere ... ?

Either way things look stable with d3a7051841f0 and it turns out
that Fedora already had that cherry-picked downstream in the
5.19.13 kernel which was stable for me too.

>> So lets try to get d3a7051841f0 added to the official stable series
>> ASAP (I just noticed that Mark Pearson from Lenovo has already added it
>> to Fedora's 6.0.2 build.
> 
> I think I'd also pick d3a7051841f0^ i.e. both commits:
> 
> d3a7051841f0 ("drm/i915/bios: Use hardcoded fp_timing size for generating LFP data pointers")
> 4e78d6023c15 ("drm/i915/bios: Validate fp_timing terminator presence")
> 
> for stable.

That sounds good, can you take care of submitting these to gkh ?

Regards,

Hans


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
  2022-10-17 14:32             ` Hans de Goede
@ 2022-10-18 10:32               ` Ville Syrjälä
  0 siblings, 0 replies; 13+ messages in thread
From: Ville Syrjälä @ 2022-10-18 10:32 UTC (permalink / raw)
  To: Hans de Goede
  Cc: Jani Nikula, Thorsten Leemhuis, intel-gfx, Linux Kernel Mailing List

On Mon, Oct 17, 2022 at 04:32:28PM +0200, Hans de Goede wrote:
> Hi,
> 
> On 10/17/22 15:35, Jani Nikula wrote:
> > On Mon, 17 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
> >> Hi,
> >>
> >> On 10/17/22 13:19, Thorsten Leemhuis wrote:
> >>> CCing the regression mailing list, as it should be in the loop for all
> >>> regressions, as explained here:
> >>> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
> >>
> >> Yes sorry about that I meant to Cc the regressions list, not you personally,
> >> but the auto-completion picked the wrong address-book entry
> >> (and I did not notice this).
> >>
> >>> On 17.10.22 12:48, Hans de Goede wrote:
> >>>> On 10/17/22 10:39, Jani Nikula wrote:
> >>>>> On Mon, 17 Oct 2022, Jani Nikula <jani.nikula@linux.intel.com> wrote:
> >>>>>> On Thu, 13 Oct 2022, Hans de Goede <hdegoede@redhat.com> wrote:
> >>>>>>> With 6.0 the following WARN triggers:
> >>>>>>> drivers/gpu/drm/i915/display/intel_bios.c:477:
> >>>>>>>
> >>>>>>>         drm_WARN(&i915->drm, min_size == 0,
> >>>>>>>                  "Block %d min_size is zero\n", section_id);
> >>>>>>
> >>>>>> What's the value of section_id that gets printed?
> >>>>>
> >>>>> I'm guessing this is [1] fixed by commit d3a7051841f0 ("drm/i915/bios:
> >>>>> Use hardcoded fp_timing size for generating LFP data pointers") in
> >>>>> v6.1-rc1.
> >>>>>
> >>>>> I don't think this is the root cause for your issues, but I wonder if
> >>>>> you could try v6.1-rc1 or drm-tip and see if we've fixed the other stuff
> >>>>> already too?
> >>>>
> >>>> 6.1-rc1 indeed does not trigger the drm_WARN and for now (couple of
> >>>> reboots, running for 5 minutes now) it seems stable. 6.0.0 usually
> >>>> crashed during boot (but not always).
> >>>>
> >>>> Do you think it would be worthwhile to try 6.0.0 with d3a7051841f0 ?
> >>
> >> So I have been trying 6.0.0 with d3a7051841f0 doing a whole bunch of
> >> reboots + general use and that seems stable, then I reverted it and
> >> the very first boot of the kernel with that broke again, so I'm
> >> pretty sure that d3a7051841f0 fixes things.
> >>
> >> So d3a7051841f0 seems to do more then just fix the WARN().
> > 
> > Wow, so I guess we do screw up the parsing royally then. :o
> 
> I'm running the kernel with lockdep + list-debugging enabled and
> I could not reproduce this (not easily at least) on a standard
> Fedora 6.0.0 build without that. So maybe the parsing just manages
> to write out of binds a tiny bit which just happens to hit a list_head
> somewhere ... ?

We don't parse any of the LFP data stuff if we didn't manage
to generate the data ptrs. So can't really see how that would
happen. Another theory might be that something else gets
screwed up if we fail to parse anything, but can't really
think how that would lead to list corruption either.

> 
> Either way things look stable with d3a7051841f0 and it turns out
> that Fedora already had that cherry-picked downstream in the
> 5.19.13 kernel which was stable for me too.
> 
> >> So lets try to get d3a7051841f0 added to the official stable series
> >> ASAP (I just noticed that Mark Pearson from Lenovo has already added it
> >> to Fedora's 6.0.2 build.
> > 
> > I think I'd also pick d3a7051841f0^ i.e. both commits:
> > 
> > d3a7051841f0 ("drm/i915/bios: Use hardcoded fp_timing size for generating LFP data pointers")
> > 4e78d6023c15 ("drm/i915/bios: Validate fp_timing terminator presence")
> > 
> > for stable.

Ack from me.

> 
> That sounds good, can you take care of submitting these to gkh ?
> 
> Regards,
> 
> Hans

-- 
Ville Syrjälä
Intel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-10-18 10:33 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-13 20:33 alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related Hans de Goede
2022-10-15 14:25 ` Hans de Goede
2022-10-17  8:17   ` [Intel-gfx] " Tvrtko Ursulin
2022-10-17  8:30 ` Jani Nikula
2022-10-17  8:32   ` Hans de Goede
2022-10-17  8:39   ` Jani Nikula
2022-10-17 10:48     ` Hans de Goede
2022-10-17 11:19       ` Thorsten Leemhuis
2022-10-17 13:14         ` Hans de Goede
2022-10-17 13:35           ` Jani Nikula
2022-10-17 14:32             ` Hans de Goede
2022-10-18 10:32               ` Ville Syrjälä
2022-10-17 11:40       ` Jani Nikula

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).