intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [Intel-gfx] Regression in linux-next
@ 2023-10-05 15:58 Borah, Chaitanya Kumar
  2023-10-06 20:30 ` Wysocki, Rafael J
  2023-10-20  5:52 ` [Intel-gfx] Regression on linux-next (next-20231016) Borah, Chaitanya Kumar
  0 siblings, 2 replies; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-05 15:58 UTC (permalink / raw)
  To: Wysocki, Rafael J; +Cc: intel-gfx, Kurmi, Suresh Kumar

[-- Attachment #1: Type: text/plain, Size: 2759 bytes --]

Hello Rafael,


Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.

This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.



On next-20231003 [2], we are seeing the following error


```````````````````````````````````````````````````````````````````````````````
<4>[   14.093075] ------------[ cut here ]------------
<4>[   14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18 for_each_thermal_trip+0x83/0x90
<4>[   14.106977] Modules linked in:
<4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W          6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1
<4>[   14.121305] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3323.D89.2309110529 09/11/2023
<4>[   14.134478] RIP: 0010:for_each_thermal_trip+0x83/0x90
<4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90



Details log can be found in [3].



After bisecting the tree, the following patch [4] seems to be causing the regression.


commit d5ea889246b112e228433a5f27f57af90ca0c1fb
Author: Rafael J. Wysocki rafael.j.wysocki@intel.com<mailto:rafael.j.wysocki@intel.com>
Date:   Thu Sep 21 20:02:59 2023 +0200

    ACPI: thermal: Do not use trip indices for cooling device binding

    Rearrange the ACPI thermal driver's callback functions used for cooling
    device binding and unbinding, acpi_thermal_bind_cooling_device() and
    acpi_thermal_unbind_cooling_device(), respectively, so that they use trip
    pointers instead of trip indices which is more straightforward and allows
    the driver to become independent of the ordering of trips in the thermal
    zone structure.

    The general functionality is not expected to be changed.

    Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com<mailto:rafael.j.wysocki@intel.com>
    Reviewed-by: Daniel Lezcano daniel.lezcano@linaro.org<mailto:daniel.lezcano@linaro.org>



We also verified by moving the head of the tree to the previous commit.



Could you please check why this patch causes the regression and if we can find a solution for it soon?


[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003
[3] https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/boot0.txt
[4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

[-- Attachment #2: Type: text/html, Size: 8043 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-05 15:58 [Intel-gfx] Regression in linux-next Borah, Chaitanya Kumar
@ 2023-10-06 20:30 ` Wysocki, Rafael J
  2023-10-09  5:10   ` Borah, Chaitanya Kumar
  2023-10-20  5:52 ` [Intel-gfx] Regression on linux-next (next-20231016) Borah, Chaitanya Kumar
  1 sibling, 1 reply; 26+ messages in thread
From: Wysocki, Rafael J @ 2023-10-06 20:30 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar; +Cc: intel-gfx, Kurmi, Suresh Kumar

[-- Attachment #1: Type: text/plain, Size: 3148 bytes --]

Hi,

On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
>
> Hello Rafael,
>
> Hope you are doing well. I am Chaitanya from the linux graphics team 
> in Intel.
>
> This mail is regarding a regression we are seeing in our CI runs[1] on 
> linux-next repository.
>
Thanks for the report, I think that this is a lockdep assertion failing.

If that is correct, it should be straightforward to fix.

I'll take care of this early next week.

Thanks!


> On next-20231003 [2], we are seeing the following error
>
> ```````````````````````````````````````````````````````````````````````````````
>
> <4>[ 14.093075] ------------[ cut here ]------------
>
> <4>[ 14.097664] WARNING: CPU: 0 PID: 1 at 
> drivers/thermal/thermal_trip.c:18 for_each_thermal_trip+0x83/0x90
>
> <4>[ 14.106977] Modules linked in:
>
> <4>[ 14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 
>       6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1
>
> <4>[ 14.121305] Hardware name: Intel Corporation Meteor Lake Client 
> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS 
> MTLPFWI1.R00.3323.D89.2309110529 09/11/2023
>
> <4>[ 14.134478] RIP: 0010:for_each_thermal_trip+0x83/0x90
>
> <4>[ 14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41 5d 
> c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00 85 
> c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 
> 90 90
>
> Details log can be found in [3].
>
> After bisecting the tree, the following patch [4] seems to be causing 
> the regression.
>
> commit d5ea889246b112e228433a5f27f57af90ca0c1fb
>
> Author: Rafael J. Wysocki rafael.j.wysocki@intel.com
>
> Date:   Thu Sep 21 20:02:59 2023 +0200
>
>     ACPI: thermal: Do not use trip indices for cooling device binding
>
>     Rearrange the ACPI thermal driver's callback functions used for 
> cooling
>
>     device binding and unbinding, acpi_thermal_bind_cooling_device() and
>
>     acpi_thermal_unbind_cooling_device(), respectively, so that they 
> use trip
>
>     pointers instead of trip indices which is more straightforward and 
> allows
>
>     the driver to become independent of the ordering of trips in the 
> thermal
>
>     zone structure.
>
>     The general functionality is not expected to be changed.
>
>     Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com
>
>     Reviewed-by: Daniel Lezcano daniel.lezcano@linaro.org
>
> We also verified by moving the head of the tree to the previous commit.
>
> Could you please check why this patch causes the regression and if we 
> can find a solution for it soon?
>
> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
>
> [2] 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003
>
> [3] 
> https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/boot0.txt
>
> [4] 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb 
> <https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb>
>

[-- Attachment #2: Type: text/html, Size: 9430 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-06 20:30 ` Wysocki, Rafael J
@ 2023-10-09  5:10   ` Borah, Chaitanya Kumar
  2023-10-09 19:23     ` Wysocki, Rafael J
  0 siblings, 1 reply; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-09  5:10 UTC (permalink / raw)
  To: Wysocki, Rafael J; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hello Rafael

>Thanks for the report, I think that this is a lockdep assertion failing.
>If that is correct, it should be straightforward to fix.
>I'll take care of this early next week.
>Thanks!

Thank you for your response.  Please let us know when a fix is available.

Regards

Chaitanya

From: Wysocki, Rafael J <rafael.j.wysocki@intel.com> 
Sent: Saturday, October 7, 2023 2:01 AM
To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
Subject: Re: Regression in linux-next

Hi,
On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
Hello Rafael,
 
Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.
 
Thanks for the report, I think that this is a lockdep assertion failing.
If that is correct, it should be straightforward to fix.
I'll take care of this early next week.
Thanks!

On next-20231003 [2], we are seeing the following error
 
```````````````````````````````````````````````````````````````````````````````
<4>[   14.093075] ------------[ cut here ]------------
<4>[   14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18 for_each_thermal_trip+0x83/0x90
<4>[   14.106977] Modules linked in:
<4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W          6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1
<4>[   14.121305] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3323.D89.2309110529 09/11/2023
<4>[   14.134478] RIP: 0010:for_each_thermal_trip+0x83/0x90
<4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
 
Details log can be found in [3].
 
After bisecting the tree, the following patch [4] seems to be causing the regression.
 
commit d5ea889246b112e228433a5f27f57af90ca0c1fb
Author: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
Date:   Thu Sep 21 20:02:59 2023 +0200
 
    ACPI: thermal: Do not use trip indices for cooling device binding
 
    Rearrange the ACPI thermal driver's callback functions used for cooling
    device binding and unbinding, acpi_thermal_bind_cooling_device() and
    acpi_thermal_unbind_cooling_device(), respectively, so that they use trip
    pointers instead of trip indices which is more straightforward and allows
    the driver to become independent of the ordering of trips in the thermal
    zone structure.
 
    The general functionality is not expected to be changed.
 
    Signed-off-by: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
    Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@linaro.org
 
We also verified by moving the head of the tree to the previous commit.
 
Could you please check why this patch causes the regression and if we can find a solution for it soon?
 
[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003
[3] https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/boot0.txt
[4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-09  5:10   ` Borah, Chaitanya Kumar
@ 2023-10-09 19:23     ` Wysocki, Rafael J
  2023-10-11  4:00       ` Borah, Chaitanya Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Wysocki, Rafael J @ 2023-10-09 19:23 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hi,

On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
> Hello Rafael
>
>> Thanks for the report, I think that this is a lockdep assertion failing.
>> If that is correct, it should be straightforward to fix.
>> I'll take care of this early next week.
>> Thanks!
> Thank you for your response.  Please let us know when a fix is available.

It should be fixed in linux-next from today, by this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=linux-next&id=b44444027ce7714f309e96b804b7fb088a40d708

Thanks!


> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> Sent: Saturday, October 7, 2023 2:01 AM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> Subject: Re: Regression in linux-next
>
> Hi,
> On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
> Hello Rafael,
>   
> Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
> This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.
>   
> Thanks for the report, I think that this is a lockdep assertion failing.
> If that is correct, it should be straightforward to fix.
> I'll take care of this early next week.
> Thanks!
>
> On next-20231003 [2], we are seeing the following error
>   
> ```````````````````````````````````````````````````````````````````````````````
> <4>[   14.093075] ------------[ cut here ]------------
> <4>[   14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18 for_each_thermal_trip+0x83/0x90
> <4>[   14.106977] Modules linked in:
> <4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W          6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1
> <4>[   14.121305] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3323.D89.2309110529 09/11/2023
> <4>[   14.134478] RIP: 0010:for_each_thermal_trip+0x83/0x90
> <4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
>   
> Details log can be found in [3].
>   
> After bisecting the tree, the following patch [4] seems to be causing the regression.
>   
> commit d5ea889246b112e228433a5f27f57af90ca0c1fb
> Author: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
> Date:   Thu Sep 21 20:02:59 2023 +0200
>   
>      ACPI: thermal: Do not use trip indices for cooling device binding
>   
>      Rearrange the ACPI thermal driver's callback functions used for cooling
>      device binding and unbinding, acpi_thermal_bind_cooling_device() and
>      acpi_thermal_unbind_cooling_device(), respectively, so that they use trip
>      pointers instead of trip indices which is more straightforward and allows
>      the driver to become independent of the ordering of trips in the thermal
>      zone structure.
>   
>      The general functionality is not expected to be changed.
>   
>      Signed-off-by: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
>      Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@linaro.org
>   
> We also verified by moving the head of the tree to the previous commit.
>   
> Could you please check why this patch causes the regression and if we can find a solution for it soon?
>   
> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003
> [3] https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/boot0.txt
> [4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-09 19:23     ` Wysocki, Rafael J
@ 2023-10-11  4:00       ` Borah, Chaitanya Kumar
  2023-10-11 16:14         ` Wysocki, Rafael J
  0 siblings, 1 reply; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-11  4:00 UTC (permalink / raw)
  To: Wysocki, Rafael J; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hello Rafael,

> -----Original Message-----
> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> Sent: Tuesday, October 10, 2023 12:54 AM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> Subject: Re: Regression in linux-next
> 
> Hi,
> 
> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
> > Hello Rafael
> >
> >> Thanks for the report, I think that this is a lockdep assertion failing.
> >> If that is correct, it should be straightforward to fix.
> >> I'll take care of this early next week.
> >> Thanks!
> > Thank you for your response.  Please let us know when a fix is available.
> 
> It should be fixed in linux-next from today, by this commit:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> pm.git/commit/?h=linux-
> next&id=b44444027ce7714f309e96b804b7fb088a40d708
> 
> Thanks!

Thanks a lot for the fix. This seems to have fixed the issue in most of the machines but we are still seeing a similar problem in few of the machines.

This has a different call stack but seems to be from the same thermal subsystem. Full logs in [1]

<4>[    4.392015] WARNING: CPU: 1 PID: 306 at drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
<4>[    4.392022] Modules linked in: x86_pkg_temp_thermal coretemp kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801 mei_me pps_core mei i2c_smbus wmi
<4>[    4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-next-20231010-next-20231010-gc0a6edb636cb+ #1
<4>[    4.392061] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018
<4>[    4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
<4>[    4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5 <0f> 0b eb b1 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
<4>[    4.392069] RSP: 0018:ffffc9000156bda8 EFLAGS: 00010246
<4>[    4.392073] RAX: 0000000000000000 RBX: ffff888103828ae8 RCX: 0000000000000001
<4>[    4.392075] RDX: 0000000080000000 RSI: ffffffff823de5ab RDI: ffffffff823fdfba
<4>[    4.392078] RBP: ffff888103a88800 R08: ffff888103828ae8 R09: 0000000000000001
<4>[    4.392080] R10: 0000000000000001 R11: ffff88811494d3c0 R12: ffff888103a88818
<4>[    4.392082] R13: ffff8881108bfa00 R14: ffff888103794408 R15: 0000000000000001
<4>[    4.392084] FS:  00007f1f0d6d28c0(0000) GS:ffff88822e680000(0000) knlGS:0000000000000000
<4>[    4.392087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[    4.392089] CR2: 000055857c50b750 CR3: 0000000111efa005 CR4: 00000000003706f0
<4>[    4.392091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[    4.392093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[    4.392095] Call Trace:
<4>[    4.392097]  <TASK>
<4>[    4.392100]  ? __warn+0x7f/0x170
<4>[    4.392104]  ? thermal_zone_trip_id+0x61/0x70
<4>[    4.392109]  ? report_bug+0x1f8/0x200
<4>[    4.392116]  ? handle_bug+0x3c/0x70
<4>[    4.392119]  ? exc_invalid_op+0x18/0x70
<4>[    4.392123]  ? asm_exc_invalid_op+0x1a/0x20
<4>[    4.392133]  ? thermal_zone_trip_id+0x61/0x70
<4>[    4.392137]  ? thermal_zone_trip_id+0x5d/0x70
<4>[    4.392141]  trip_point_show+0x18/0x40
<4>[    4.392145]  dev_attr_show+0x15/0x60
<4>[    4.392149]  sysfs_kf_seq_show+0xb5/0x100
<4>[    4.392154]  seq_read_iter+0x111/0x450
<4>[    4.392158]  ? check_object+0x133/0x320
<4>[    4.392164]  vfs_read+0x20d/0x300
<4>[    4.392175]  ksys_read+0x64/0xe0
<4>[    4.392180]  do_syscall_64+0x3c/0x90
<4>[    4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<4>[    4.392187] RIP: 0033:0x7f1f0e193392

Can you please check what could be the reason for this issue?

[1] https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc/boot0.txt

Regards

Chaitanya




> 
> 
> > From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> > Sent: Saturday, October 7, 2023 2:01 AM
> > To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> > Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> > <jani.saarinen@intel.com>
> > Subject: Re: Regression in linux-next
> >
> > Hi,
> > On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
> > Hello Rafael,
> >
> > Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> > This mail is regarding a regression we are seeing in our CI runs[1] on linux-
> next repository.
> >
> > Thanks for the report, I think that this is a lockdep assertion failing.
> > If that is correct, it should be straightforward to fix.
> > I'll take care of this early next week.
> > Thanks!
> >
> > On next-20231003 [2], we are seeing the following error
> >
> > ``````````````````````````````````````````````````````````````````````
> > ````````` <4>[   14.093075] ------------[ cut here ]------------ <4>[
> > 14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18
> > for_each_thermal_trip+0x83/0x90 <4>[   14.106977] Modules linked in:
> > <4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W
> > 6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1 <4>[
> > 14.121305] Hardware name: Intel Corporation Meteor Lake Client
> > Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> > MTLPFWI1.R00.3323.D89.2309110529 09/11/2023 <4>[   14.134478] RIP:
> > 0010:for_each_thermal_trip+0x83/0x90
> > <4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41
> > 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00
> > 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> > 90 90 90
> >
> > Details log can be found in [3].
> >
> > After bisecting the tree, the following patch [4] seems to be causing the
> regression.
> >
> > commit d5ea889246b112e228433a5f27f57af90ca0c1fb
> > Author: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
> > Date:   Thu Sep 21 20:02:59 2023 +0200
> >
> >      ACPI: thermal: Do not use trip indices for cooling device binding
> >
> >      Rearrange the ACPI thermal driver's callback functions used for
> > cooling
> >      device binding and unbinding, acpi_thermal_bind_cooling_device()
> > and
> >      acpi_thermal_unbind_cooling_device(), respectively, so that they
> > use trip
> >      pointers instead of trip indices which is more straightforward
> > and allows
> >      the driver to become independent of the ordering of trips in the
> > thermal
> >      zone structure.
> >
> >      The general functionality is not expected to be changed.
> >
> >      Signed-off-by: Rafael J. Wysocki
> > mailto:rafael.j.wysocki@intel.com
> >      Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@linaro.org
> >
> > We also verified by moving the head of the tree to the previous commit.
> >
> > Could you please check why this patch causes the regression and if we can
> find a solution for it soon?
> >
> > [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> > [2]
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
> > mmit/?h=next-20231003 [3]
> > https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/b
> > oot0.txt [4]
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
> > mmit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-11  4:00       ` Borah, Chaitanya Kumar
@ 2023-10-11 16:14         ` Wysocki, Rafael J
  2023-10-11 16:49           ` Borah, Chaitanya Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Wysocki, Rafael J @ 2023-10-11 16:14 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hi,

On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote:
> Hello Rafael,
>
>> -----Original Message-----
>> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
>> Sent: Tuesday, October 10, 2023 12:54 AM
>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
>> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
>> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
>> Subject: Re: Regression in linux-next
>>
>> Hi,
>>
>> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
>>> Hello Rafael
>>>
>>>> Thanks for the report, I think that this is a lockdep assertion failing.
>>>> If that is correct, it should be straightforward to fix.
>>>> I'll take care of this early next week.
>>>> Thanks!
>>> Thank you for your response.  Please let us know when a fix is available.
>> It should be fixed in linux-next from today, by this commit:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
>> pm.git/commit/?h=linux-
>> next&id=b44444027ce7714f309e96b804b7fb088a40d708
>>
>> Thanks!
> Thanks a lot for the fix. This seems to have fixed the issue in most of the machines but we are still seeing a similar problem in few of the machines.

Thanks for reporting this!


> This has a different call stack but seems to be from the same thermal subsystem. Full logs in [1]
>
> <4>[    4.392015] WARNING: CPU: 1 PID: 306 at drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
> <4>[    4.392022] Modules linked in: x86_pkg_temp_thermal coretemp kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801 mei_me pps_core mei i2c_smbus wmi
> <4>[    4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-next-20231010-next-20231010-gc0a6edb636cb+ #1
> <4>[    4.392061] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018
> <4>[    4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
> <4>[    4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5 <0f> 0b eb b1 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
> <4>[    4.392069] RSP: 0018:ffffc9000156bda8 EFLAGS: 00010246
> <4>[    4.392073] RAX: 0000000000000000 RBX: ffff888103828ae8 RCX: 0000000000000001
> <4>[    4.392075] RDX: 0000000080000000 RSI: ffffffff823de5ab RDI: ffffffff823fdfba
> <4>[    4.392078] RBP: ffff888103a88800 R08: ffff888103828ae8 R09: 0000000000000001
> <4>[    4.392080] R10: 0000000000000001 R11: ffff88811494d3c0 R12: ffff888103a88818
> <4>[    4.392082] R13: ffff8881108bfa00 R14: ffff888103794408 R15: 0000000000000001
> <4>[    4.392084] FS:  00007f1f0d6d28c0(0000) GS:ffff88822e680000(0000) knlGS:0000000000000000
> <4>[    4.392087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[    4.392089] CR2: 000055857c50b750 CR3: 0000000111efa005 CR4: 00000000003706f0
> <4>[    4.392091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <4>[    4.392093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> <4>[    4.392095] Call Trace:
> <4>[    4.392097]  <TASK>
> <4>[    4.392100]  ? __warn+0x7f/0x170
> <4>[    4.392104]  ? thermal_zone_trip_id+0x61/0x70
> <4>[    4.392109]  ? report_bug+0x1f8/0x200
> <4>[    4.392116]  ? handle_bug+0x3c/0x70
> <4>[    4.392119]  ? exc_invalid_op+0x18/0x70
> <4>[    4.392123]  ? asm_exc_invalid_op+0x1a/0x20
> <4>[    4.392133]  ? thermal_zone_trip_id+0x61/0x70
> <4>[    4.392137]  ? thermal_zone_trip_id+0x5d/0x70
> <4>[    4.392141]  trip_point_show+0x18/0x40
> <4>[    4.392145]  dev_attr_show+0x15/0x60
> <4>[    4.392149]  sysfs_kf_seq_show+0xb5/0x100
> <4>[    4.392154]  seq_read_iter+0x111/0x450
> <4>[    4.392158]  ? check_object+0x133/0x320
> <4>[    4.392164]  vfs_read+0x20d/0x300
> <4>[    4.392175]  ksys_read+0x64/0xe0
> <4>[    4.392180]  do_syscall_64+0x3c/0x90
> <4>[    4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> <4>[    4.392187] RIP: 0033:0x7f1f0e193392
>
> Can you please check what could be the reason for this issue?

Well, one more unuseful lockdep assertion has been added recently to the 
thermal core, sorry about that.

This commit

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=linux-next&id=108ffd12be24ba1d74b3314df8db32a0a6d55ba5

that will be merged into linux-next tomorrow if all goes well, should 
address this.

Thanks!


> [1] https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc/boot0.txt
>
> Regards
>
> Chaitanya
>
>
>
>
>>
>>> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
>>> Sent: Saturday, October 7, 2023 2:01 AM
>>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
>>> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
>>> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
>>> <jani.saarinen@intel.com>
>>> Subject: Re: Regression in linux-next
>>>
>>> Hi,
>>> On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
>>> Hello Rafael,
>>>
>>> Hope you are doing well. I am Chaitanya from the linux graphics team in
>> Intel.
>>> This mail is regarding a regression we are seeing in our CI runs[1] on linux-
>> next repository.
>>> Thanks for the report, I think that this is a lockdep assertion failing.
>>> If that is correct, it should be straightforward to fix.
>>> I'll take care of this early next week.
>>> Thanks!
>>>
>>> On next-20231003 [2], we are seeing the following error
>>>
>>> ``````````````````````````````````````````````````````````````````````
>>> ````````` <4>[   14.093075] ------------[ cut here ]------------ <4>[
>>> 14.097664] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_trip.c:18
>>> for_each_thermal_trip+0x83/0x90 <4>[   14.106977] Modules linked in:
>>> <4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W
>>> 6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1 <4>[
>>> 14.121305] Hardware name: Intel Corporation Meteor Lake Client
>>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
>>> MTLPFWI1.R00.3323.D89.2309110529 09/11/2023 <4>[   14.134478] RIP:
>>> 0010:for_each_thermal_trip+0x83/0x90
>>> <4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41
>>> 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00
>>> 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
>>> 90 90 90
>>>
>>> Details log can be found in [3].
>>>
>>> After bisecting the tree, the following patch [4] seems to be causing the
>> regression.
>>> commit d5ea889246b112e228433a5f27f57af90ca0c1fb
>>> Author: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
>>> Date:   Thu Sep 21 20:02:59 2023 +0200
>>>
>>>       ACPI: thermal: Do not use trip indices for cooling device binding
>>>
>>>       Rearrange the ACPI thermal driver's callback functions used for
>>> cooling
>>>       device binding and unbinding, acpi_thermal_bind_cooling_device()
>>> and
>>>       acpi_thermal_unbind_cooling_device(), respectively, so that they
>>> use trip
>>>       pointers instead of trip indices which is more straightforward
>>> and allows
>>>       the driver to become independent of the ordering of trips in the
>>> thermal
>>>       zone structure.
>>>
>>>       The general functionality is not expected to be changed.
>>>
>>>       Signed-off-by: Rafael J. Wysocki
>>> mailto:rafael.j.wysocki@intel.com
>>>       Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@linaro.org
>>>
>>> We also verified by moving the head of the tree to the previous commit.
>>>
>>> Could you please check why this patch causes the regression and if we can
>> find a solution for it soon?
>>> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
>>> [2]
>>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
>>> mmit/?h=next-20231003 [3]
>>> https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6/b
>>> oot0.txt [4]
>>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/co
>>> mmit/?h=next-20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-11 16:14         ` Wysocki, Rafael J
@ 2023-10-11 16:49           ` Borah, Chaitanya Kumar
  2023-10-13 14:05             ` Borah, Chaitanya Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-11 16:49 UTC (permalink / raw)
  To: Wysocki, Rafael J; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hello Rafael,

> -----Original Message-----
> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> Sent: Wednesday, October 11, 2023 9:44 PM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> Subject: Re: Regression in linux-next
> 
> Hi,
> 
> On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote:
> > Hello Rafael,
> >
> >> -----Original Message-----
> >> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> >> Sent: Tuesday, October 10, 2023 12:54 AM
> >> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> >> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> >> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> >> <jani.saarinen@intel.com>
> >> Subject: Re: Regression in linux-next
> >>
> >> Hi,
> >>
> >> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
> >>> Hello Rafael
> >>>
> >>>> Thanks for the report, I think that this is a lockdep assertion failing.
> >>>> If that is correct, it should be straightforward to fix.
> >>>> I'll take care of this early next week.
> >>>> Thanks!
> >>> Thank you for your response.  Please let us know when a fix is available.
> >> It should be fixed in linux-next from today, by this commit:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> >> pm.git/commit/?h=linux-
> >> next&id=b44444027ce7714f309e96b804b7fb088a40d708
> >>
> >> Thanks!
> > Thanks a lot for the fix. This seems to have fixed the issue in most of the
> machines but we are still seeing a similar problem in few of the machines.
> 
> Thanks for reporting this!
> 
> 
> > This has a different call stack but seems to be from the same thermal
> > subsystem. Full logs in [1]
> >
> > <4>[    4.392015] WARNING: CPU: 1 PID: 306 at
> drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
> > <4>[    4.392022] Modules linked in: x86_pkg_temp_thermal coretemp
> kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass
> crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801
> mei_me pps_core mei i2c_smbus wmi
> > <4>[    4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-
> next-20231010-next-20231010-gc0a6edb636cb+ #1
> > <4>[    4.392061] Hardware name: System manufacturer System Product
> Name/Z170M-PLUS, BIOS 3610 03/29/2018
> > <4>[    4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
> > <4>[    4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc cc
> cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5 <0f> 0b eb b1
> 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
> > <4>[    4.392069] RSP: 0018:ffffc9000156bda8 EFLAGS: 00010246
> > <4>[    4.392073] RAX: 0000000000000000 RBX: ffff888103828ae8 RCX:
> 0000000000000001
> > <4>[    4.392075] RDX: 0000000080000000 RSI: ffffffff823de5ab RDI:
> ffffffff823fdfba
> > <4>[    4.392078] RBP: ffff888103a88800 R08: ffff888103828ae8 R09:
> 0000000000000001
> > <4>[    4.392080] R10: 0000000000000001 R11: ffff88811494d3c0 R12:
> ffff888103a88818
> > <4>[    4.392082] R13: ffff8881108bfa00 R14: ffff888103794408 R15:
> 0000000000000001
> > <4>[    4.392084] FS:  00007f1f0d6d28c0(0000) GS:ffff88822e680000(0000)
> knlGS:0000000000000000
> > <4>[    4.392087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[    4.392089] CR2: 000055857c50b750 CR3: 0000000111efa005 CR4:
> 00000000003706f0
> > <4>[    4.392091] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> > <4>[    4.392093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> > <4>[    4.392095] Call Trace:
> > <4>[    4.392097]  <TASK>
> > <4>[    4.392100]  ? __warn+0x7f/0x170
> > <4>[    4.392104]  ? thermal_zone_trip_id+0x61/0x70
> > <4>[    4.392109]  ? report_bug+0x1f8/0x200
> > <4>[    4.392116]  ? handle_bug+0x3c/0x70
> > <4>[    4.392119]  ? exc_invalid_op+0x18/0x70
> > <4>[    4.392123]  ? asm_exc_invalid_op+0x1a/0x20
> > <4>[    4.392133]  ? thermal_zone_trip_id+0x61/0x70
> > <4>[    4.392137]  ? thermal_zone_trip_id+0x5d/0x70
> > <4>[    4.392141]  trip_point_show+0x18/0x40
> > <4>[    4.392145]  dev_attr_show+0x15/0x60
> > <4>[    4.392149]  sysfs_kf_seq_show+0xb5/0x100
> > <4>[    4.392154]  seq_read_iter+0x111/0x450
> > <4>[    4.392158]  ? check_object+0x133/0x320
> > <4>[    4.392164]  vfs_read+0x20d/0x300
> > <4>[    4.392175]  ksys_read+0x64/0xe0
> > <4>[    4.392180]  do_syscall_64+0x3c/0x90
> > <4>[    4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > <4>[    4.392187] RIP: 0033:0x7f1f0e193392
> >
> > Can you please check what could be the reason for this issue?
> 
> Well, one more unuseful lockdep assertion has been added recently to the
> thermal core, sorry about that.
> 
> This commit
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> pm.git/commit/?h=linux-
> next&id=108ffd12be24ba1d74b3314df8db32a0a6d55ba5
> 
> that will be merged into linux-next tomorrow if all goes well, should address
> this.

Thank you for the fix. We will wait for it to get merged in linux-next.

Regards

Chaitanya

> 
> Thanks!
> 
> 
> > [1]
> > https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc/b
> > oot0.txt
> >
> > Regards
> >
> > Chaitanya
> >
> >
> >
> >
> >>
> >>> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> >>> Sent: Saturday, October 7, 2023 2:01 AM
> >>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> >>> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> >>> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> >>> <jani.saarinen@intel.com>
> >>> Subject: Re: Regression in linux-next
> >>>
> >>> Hi,
> >>> On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
> >>> Hello Rafael,
> >>>
> >>> Hope you are doing well. I am Chaitanya from the linux graphics team
> >>> in
> >> Intel.
> >>> This mail is regarding a regression we are seeing in our CI runs[1]
> >>> on linux-
> >> next repository.
> >>> Thanks for the report, I think that this is a lockdep assertion failing.
> >>> If that is correct, it should be straightforward to fix.
> >>> I'll take care of this early next week.
> >>> Thanks!
> >>>
> >>> On next-20231003 [2], we are seeing the following error
> >>>
> >>> ````````````````````````````````````````````````````````````````````
> >>> `` ````````` <4>[   14.093075] ------------[ cut here ]------------
> >>> <4>[ 14.097664] WARNING: CPU: 0 PID: 1 at
> >>> drivers/thermal/thermal_trip.c:18
> >>> for_each_thermal_trip+0x83/0x90 <4>[   14.106977] Modules linked in:
> >>> <4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W
> >>> 6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1 <4>[
> >>> 14.121305] Hardware name: Intel Corporation Meteor Lake Client
> >>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> >>> MTLPFWI1.R00.3323.D89.2309110529 09/11/2023 <4>[   14.134478] RIP:
> >>> 0010:for_each_thermal_trip+0x83/0x90
> >>> <4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c 41
> >>> 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2 2d 00
> >>> 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90
> >>> 90
> >>> 90 90 90
> >>>
> >>> Details log can be found in [3].
> >>>
> >>> After bisecting the tree, the following patch [4] seems to be
> >>> causing the
> >> regression.
> >>> commit d5ea889246b112e228433a5f27f57af90ca0c1fb
> >>> Author: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
> >>> Date:   Thu Sep 21 20:02:59 2023 +0200
> >>>
> >>>       ACPI: thermal: Do not use trip indices for cooling device
> >>> binding
> >>>
> >>>       Rearrange the ACPI thermal driver's callback functions used
> >>> for cooling
> >>>       device binding and unbinding,
> >>> acpi_thermal_bind_cooling_device()
> >>> and
> >>>       acpi_thermal_unbind_cooling_device(), respectively, so that
> >>> they use trip
> >>>       pointers instead of trip indices which is more straightforward
> >>> and allows
> >>>       the driver to become independent of the ordering of trips in
> >>> the thermal
> >>>       zone structure.
> >>>
> >>>       The general functionality is not expected to be changed.
> >>>
> >>>       Signed-off-by: Rafael J. Wysocki
> >>> mailto:rafael.j.wysocki@intel.com
> >>>       Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@linaro.org
> >>>
> >>> We also verified by moving the head of the tree to the previous commit.
> >>>
> >>> Could you please check why this patch causes the regression and if
> >>> we can
> >> find a solution for it soon?
> >>> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> >>> [2]
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/
> >>> co
> >>> mmit/?h=next-20231003 [3]
> >>> https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp-6
> >>> /b
> >>> oot0.txt [4]
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/
> >>> co mmit/?h=next-
> 20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression in linux-next
  2023-10-11 16:49           ` Borah, Chaitanya Kumar
@ 2023-10-13 14:05             ` Borah, Chaitanya Kumar
  0 siblings, 0 replies; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-13 14:05 UTC (permalink / raw)
  To: Wysocki, Rafael J; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hello Rafael,

> -----Original Message-----
> From: Borah, Chaitanya Kumar
> Sent: Wednesday, October 11, 2023 10:19 PM
> To: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <Suresh.Kumar.Kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> Subject: RE: Regression in linux-next
> 
> Hello Rafael,
> 
> > -----Original Message-----
> > From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> > Sent: Wednesday, October 11, 2023 9:44 PM
> > To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> > Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> > <jani.saarinen@intel.com>
> > Subject: Re: Regression in linux-next
> >
> > Hi,
> >
> > On 10/11/2023 6:00 AM, Borah, Chaitanya Kumar wrote:
> > > Hello Rafael,
> > >
> > >> -----Original Message-----
> > >> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> > >> Sent: Tuesday, October 10, 2023 12:54 AM
> > >> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> > >> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > >> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> > >> <jani.saarinen@intel.com>
> > >> Subject: Re: Regression in linux-next
> > >>
> > >> Hi,
> > >>
> > >> On 10/9/2023 7:10 AM, Borah, Chaitanya Kumar wrote:
> > >>> Hello Rafael
> > >>>
> > >>>> Thanks for the report, I think that this is a lockdep assertion failing.
> > >>>> If that is correct, it should be straightforward to fix.
> > >>>> I'll take care of this early next week.
> > >>>> Thanks!
> > >>> Thank you for your response.  Please let us know when a fix is available.
> > >> It should be fixed in linux-next from today, by this commit:
> > >>
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> > >> pm.git/commit/?h=linux-
> > >> next&id=b44444027ce7714f309e96b804b7fb088a40d708
> > >>
> > >> Thanks!
> > > Thanks a lot for the fix. This seems to have fixed the issue in most
> > > of the
> > machines but we are still seeing a similar problem in few of the machines.
> >
> > Thanks for reporting this!
> >
> >
> > > This has a different call stack but seems to be from the same
> > > thermal subsystem. Full logs in [1]
> > >
> > > <4>[    4.392015] WARNING: CPU: 1 PID: 306 at
> > drivers/thermal/thermal_trip.c:178 thermal_zone_trip_id+0x61/0x70
> > > <4>[    4.392022] Modules linked in: x86_pkg_temp_thermal coretemp
> > kvm_intel mei_pxp mei_hdcp wmi_bmof kvm e1000e irqbypass
> > crct10dif_pclmul video ptp crc32_pclmul ghash_clmulni_intel i2c_i801
> > mei_me pps_core mei i2c_smbus wmi
> > > <4>[    4.392057] CPU: 1 PID: 306 Comm: thermald Not tainted 6.6.0-rc5-
> > next-20231010-next-20231010-gc0a6edb636cb+ #1
> > > <4>[    4.392061] Hardware name: System manufacturer System Product
> > Name/Z170M-PLUS, BIOS 3610 03/29/2018
> > > <4>[    4.392063] RIP: 0010:thermal_zone_trip_id+0x61/0x70
> > > <4>[    4.392066] Code: 74 0c 83 c0 01 39 c8 75 f0 b8 c3 ff ff ff 5b 5d c3 cc
> cc
> > cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 63 a4 2d 00 85 c0 75 b5
> > <0f> 0b eb b1
> > 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
> > > <4>[    4.392069] RSP: 0018:ffffc9000156bda8 EFLAGS: 00010246
> > > <4>[    4.392073] RAX: 0000000000000000 RBX: ffff888103828ae8 RCX:
> > 0000000000000001
> > > <4>[    4.392075] RDX: 0000000080000000 RSI: ffffffff823de5ab RDI:
> > ffffffff823fdfba
> > > <4>[    4.392078] RBP: ffff888103a88800 R08: ffff888103828ae8 R09:
> > 0000000000000001
> > > <4>[    4.392080] R10: 0000000000000001 R11: ffff88811494d3c0 R12:
> > ffff888103a88818
> > > <4>[    4.392082] R13: ffff8881108bfa00 R14: ffff888103794408 R15:
> > 0000000000000001
> > > <4>[    4.392084] FS:  00007f1f0d6d28c0(0000)
> GS:ffff88822e680000(0000)
> > knlGS:0000000000000000
> > > <4>[    4.392087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > <4>[    4.392089] CR2: 000055857c50b750 CR3: 0000000111efa005
> CR4:
> > 00000000003706f0
> > > <4>[    4.392091] DR0: 0000000000000000 DR1: 0000000000000000
> DR2:
> > 0000000000000000
> > > <4>[    4.392093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > > <4>[    4.392095] Call Trace:
> > > <4>[    4.392097]  <TASK>
> > > <4>[    4.392100]  ? __warn+0x7f/0x170
> > > <4>[    4.392104]  ? thermal_zone_trip_id+0x61/0x70
> > > <4>[    4.392109]  ? report_bug+0x1f8/0x200
> > > <4>[    4.392116]  ? handle_bug+0x3c/0x70
> > > <4>[    4.392119]  ? exc_invalid_op+0x18/0x70
> > > <4>[    4.392123]  ? asm_exc_invalid_op+0x1a/0x20
> > > <4>[    4.392133]  ? thermal_zone_trip_id+0x61/0x70
> > > <4>[    4.392137]  ? thermal_zone_trip_id+0x5d/0x70
> > > <4>[    4.392141]  trip_point_show+0x18/0x40
> > > <4>[    4.392145]  dev_attr_show+0x15/0x60
> > > <4>[    4.392149]  sysfs_kf_seq_show+0xb5/0x100
> > > <4>[    4.392154]  seq_read_iter+0x111/0x450
> > > <4>[    4.392158]  ? check_object+0x133/0x320
> > > <4>[    4.392164]  vfs_read+0x20d/0x300
> > > <4>[    4.392175]  ksys_read+0x64/0xe0
> > > <4>[    4.392180]  do_syscall_64+0x3c/0x90
> > > <4>[    4.392183]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > > <4>[    4.392187] RIP: 0033:0x7f1f0e193392
> > >
> > > Can you please check what could be the reason for this issue?
> >
> > Well, one more unuseful lockdep assertion has been added recently to
> > the thermal core, sorry about that.
> >
> > This commit
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-
> > pm.git/commit/?h=linux-
> > next&id=108ffd12be24ba1d74b3314df8db32a0a6d55ba5
> >
> > that will be merged into linux-next tomorrow if all goes well, should
> > address this.
> 
> Thank you for the fix. We will wait for it to get merged in linux-next.
> 

Happy to let to you know that we did not see these issues in the latest linux-next run.

Thanks a lot of your quick resolutions.

Regards

Chaitanya

> Regards
> 
> Chaitanya
> 
> >
> > Thanks!
> >
> >
> > > [1]
> > > https://intel-gfx-ci.01.org/tree/linux-next/next-20231010/fi-kbl-guc
> > > /b
> > > oot0.txt
> > >
> > > Regards
> > >
> > > Chaitanya
> > >
> > >
> > >
> > >
> > >>
> > >>> From: Wysocki, Rafael J <rafael.j.wysocki@intel.com>
> > >>> Sent: Saturday, October 7, 2023 2:01 AM
> > >>> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> > >>> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > >>> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> > >>> <jani.saarinen@intel.com>
> > >>> Subject: Re: Regression in linux-next
> > >>>
> > >>> Hi,
> > >>> On 10/5/2023 5:58 PM, Borah, Chaitanya Kumar wrote:
> > >>> Hello Rafael,
> > >>>
> > >>> Hope you are doing well. I am Chaitanya from the linux graphics
> > >>> team in
> > >> Intel.
> > >>> This mail is regarding a regression we are seeing in our CI
> > >>> runs[1] on linux-
> > >> next repository.
> > >>> Thanks for the report, I think that this is a lockdep assertion failing.
> > >>> If that is correct, it should be straightforward to fix.
> > >>> I'll take care of this early next week.
> > >>> Thanks!
> > >>>
> > >>> On next-20231003 [2], we are seeing the following error
> > >>>
> > >>> ``````````````````````````````````````````````````````````````````
> > >>> `` `` ````````` <4>[   14.093075] ------------[ cut here
> > >>> ]------------ <4>[ 14.097664] WARNING: CPU: 0 PID: 1 at
> > >>> drivers/thermal/thermal_trip.c:18
> > >>> for_each_thermal_trip+0x83/0x90 <4>[   14.106977] Modules linked in:
> > >>> <4>[   14.110017] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G
> > >>> W 6.6.0-rc4-next-20231003-next-20231003-gc9f2baaa18b5+ #1 <4>[
> > >>> 14.121305] Hardware name: Intel Corporation Meteor Lake Client
> > >>> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> > >>> MTLPFWI1.R00.3323.D89.2309110529 09/11/2023 <4>[   14.134478]
> RIP:
> > >>> 0010:for_each_thermal_trip+0x83/0x90
> > >>> <4>[   14.139496] Code: 5c 41 5d c3 cc cc cc cc 5b 31 c0 5d 41 5c
> > >>> 41 5d c3 cc cc cc cc 48 8d bf f0 05 00 00 be ff ff ff ff e8 21 a2
> > >>> 2d 00
> > >>> 85 c0 75 9a <0f> 0b eb 96 66 0f 1f 84 00 00 00 00 00 90 90 90 90
> > >>> 90
> > >>> 90
> > >>> 90 90 90
> > >>>
> > >>> Details log can be found in [3].
> > >>>
> > >>> After bisecting the tree, the following patch [4] seems to be
> > >>> causing the
> > >> regression.
> > >>> commit d5ea889246b112e228433a5f27f57af90ca0c1fb
> > >>> Author: Rafael J. Wysocki mailto:rafael.j.wysocki@intel.com
> > >>> Date:   Thu Sep 21 20:02:59 2023 +0200
> > >>>
> > >>>       ACPI: thermal: Do not use trip indices for cooling device
> > >>> binding
> > >>>
> > >>>       Rearrange the ACPI thermal driver's callback functions used
> > >>> for cooling
> > >>>       device binding and unbinding,
> > >>> acpi_thermal_bind_cooling_device()
> > >>> and
> > >>>       acpi_thermal_unbind_cooling_device(), respectively, so that
> > >>> they use trip
> > >>>       pointers instead of trip indices which is more
> > >>> straightforward and allows
> > >>>       the driver to become independent of the ordering of trips in
> > >>> the thermal
> > >>>       zone structure.
> > >>>
> > >>>       The general functionality is not expected to be changed.
> > >>>
> > >>>       Signed-off-by: Rafael J. Wysocki
> > >>> mailto:rafael.j.wysocki@intel.com
> > >>>       Reviewed-by: Daniel Lezcano mailto:daniel.lezcano@linaro.org
> > >>>
> > >>> We also verified by moving the head of the tree to the previous commit.
> > >>>
> > >>> Could you please check why this patch causes the regression and if
> > >>> we can
> > >> find a solution for it soon?
> > >>> [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
> > >>> [2]
> > >>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.gi
> > >>> t/
> > >>> co
> > >>> mmit/?h=next-20231003 [3]
> > >>> https://intel-gfx-ci.01.org/tree/linux-next/next-20231003/bat-mtlp
> > >>> -6
> > >>> /b
> > >>> oot0.txt [4]
> > >>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.gi
> > >>> t/
> > >>> co mmit/?h=next-
> > 20231003&id=d5ea889246b112e228433a5f27f57af90ca0c1fb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Intel-gfx] Regression on linux-next (next-20231016)
  2023-10-05 15:58 [Intel-gfx] Regression in linux-next Borah, Chaitanya Kumar
  2023-10-06 20:30 ` Wysocki, Rafael J
@ 2023-10-20  5:52 ` Borah, Chaitanya Kumar
  2023-10-20  6:38   ` Lorenzo Stoakes
  2023-10-25  6:32   ` [Intel-gfx] Regression on linux-next (next-20231013) Borah, Chaitanya Kumar
  1 sibling, 2 replies; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-20  5:52 UTC (permalink / raw)
  To: lstoakes; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hello Lorenzo,

Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.

This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.

Since the version next-20231016 [2], we are seeing the following error
```````````````````````````````````````````````````````````````````````````````
<6>[    4.550196] e1000e 0000:00:1f.6 enp0s31f6: renamed from eth0
<1>[    4.581173] BUG: kernel NULL pointer dereference, address: 00000000000001b8
<1>[    4.581178] #PF: supervisor read access in kernel mode
<1>[    4.581180] #PF: error_code(0x0000) - not-present page
<6>[    4.581182] PGD 0 P4D 0 
<4>[    4.581184] Oops: 0000 [#1] PREEMPT SMP NOPTI
<4>[    4.581186] CPU: 6 PID: 460 Comm: apache2 Not tainted 6.6.0-rc6-next-20231016-next-20231016-g4d0515b235de+ #1
<4>[    4.581189] Hardware name: Intel Corporation Raptor Lake Client Platform/RPL-S ADP-S DDR5 UDIMM CRB, BIOS RPLSFWI1.R00.3157.A00.2204200131 04/20/2022
<4>[    4.581193] RIP: 0010:mmap_region+0x803/0xa50
`````````````````````````````````````````````````````````````````````````````````

Details log can be found in [3].

After bisecting the tree, the following patch [4] seems to be causing the regression.

`````````````````````````````````````````````````````````````````````````````````````````````````````````
1db41d29b79ad271674081c752961edd064bbbac is the first bad commit
commit 1db41d29b79ad271674081c752961edd064bbbac
Author: Lorenzo Stoakes lstoakes@gmail.com
Date:   Thu Oct 12 18:04:30 2023 +0100

    mm: perform the mapping_map_writable() check after call_mmap()

    In order for a F_SEAL_WRITE sealed memfd mapping to have an opportunity to
    clear VM_MAYWRITE, we must be able to invoke the appropriate
    vm_ops->mmap() handler to do so.  We would otherwise fail the
    mapping_map_writable() check before we had the opportunity to avoid it.

    This patch moves this check after the call_mmap() invocation.  Only memfd
    actively denies write access causing a potential failure here (in
    memfd_add_seals()), so there should be no impact on non-memfd cases.

    This patch makes the userland-visible change that MAP_SHARED, PROT_READ
    mappings of an F_SEAL_WRITE sealed memfd mapping will now succeed.

    There is a delicate situation with cleanup paths assuming that a writable
    mapping must have occurred in circumstances where it may now not have.  In
    order to ensure we do not accidentally mark a writable file unwritable by
    mistake, we explicitly track whether we have a writable mapping and unmap
    only if we do.
`````````````````````````````````````````````````````````````````````````````````````````````````````````

We also verified that reverting  the patch fixes the issue.

We didn't see the issue on next-20231018. Is there a fix already available for this? If not, could you please check why this patch causes the regression and if we can find a solution for it soon?

[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231016
[3] https://intel-gfx-ci.01.org/tree/linux-next/next-20231016/bat-rpls-1/boot0.txt 
[4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231016&id=1db41d29b79ad271674081c752961edd064bbbac

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression on linux-next (next-20231016)
  2023-10-20  5:52 ` [Intel-gfx] Regression on linux-next (next-20231016) Borah, Chaitanya Kumar
@ 2023-10-20  6:38   ` Lorenzo Stoakes
  2023-10-20  7:58     ` Borah, Chaitanya Kumar
  2023-10-25  6:32   ` [Intel-gfx] Regression on linux-next (next-20231013) Borah, Chaitanya Kumar
  1 sibling, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2023-10-20  6:38 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar; +Cc: intel-gfx, Kurmi, Suresh Kumar

On Fri, 20 Oct 2023 at 06:52, Borah, Chaitanya Kumar
<chaitanya.kumar.borah@intel.com> wrote:
>
> Hello Lorenzo,
>
> Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
>
> This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.
>

Thanks for reporting :) It is reassuring that this has been picked up
from multiple sources.

[snip]

> We didn't see the issue on next-20231018. Is there a fix already available for this? If not, could you please check why this patch causes the regression and if we can find a solution for it soon?

This is because I submitted a fix on Monday [0] which has now been
taken into the weds revision of -next which resolves this issue
altogether, so this regression -> not regression is expected and
intentional.

Apologies for the noise!

[0]:https://lore.kernel.org/all/c9eb4cc6-7db4-4c2b-838d-43a0b319a4f0@lucifer.local/

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression on linux-next (next-20231016)
  2023-10-20  6:38   ` Lorenzo Stoakes
@ 2023-10-20  7:58     ` Borah, Chaitanya Kumar
  0 siblings, 0 replies; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-20  7:58 UTC (permalink / raw)
  To: Lorenzo Stoakes; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hello Lorenzo,

> -----Original Message-----
> From: Lorenzo Stoakes <lstoakes@gmail.com>
> Sent: Friday, October 20, 2023 12:08 PM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> Subject: Re: Regression on linux-next (next-20231016)
> 
> On Fri, 20 Oct 2023 at 06:52, Borah, Chaitanya Kumar
> <chaitanya.kumar.borah@intel.com> wrote:
> >
> > Hello Lorenzo,
> >
> > Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> >
> > This mail is regarding a regression we are seeing in our CI runs[1] on linux-
> next repository.
> >
> 
> Thanks for reporting :) It is reassuring that this has been picked up from
> multiple sources.
> 
> [snip]
> 
> > We didn't see the issue on next-20231018. Is there a fix already available for
> this? If not, could you please check why this patch causes the regression and if
> we can find a solution for it soon?
> 
> This is because I submitted a fix on Monday [0] which has now been taken into
> the weds revision of -next which resolves this issue altogether, so this
> regression -> not regression is expected and intentional.
> 
> Apologies for the noise!
> 

No problem! Thank you for the fix and a quick response.

Regards

Chaitanya

> [0]:https://lore.kernel.org/all/c9eb4cc6-7db4-4c2b-838d-
> 43a0b319a4f0@lucifer.local/
> 
> Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Intel-gfx] Regression on linux-next (next-20231013)
  2023-10-20  5:52 ` [Intel-gfx] Regression on linux-next (next-20231016) Borah, Chaitanya Kumar
  2023-10-20  6:38   ` Lorenzo Stoakes
@ 2023-10-25  6:32   ` Borah, Chaitanya Kumar
  2023-10-25  7:32     ` Christian Brauner
  2023-11-09 17:00     ` [Intel-gfx] Regression on linux-next (next-20231107) Borah, Chaitanya Kumar
  1 sibling, 2 replies; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-25  6:32 UTC (permalink / raw)
  To: brauner; +Cc: intel-gfx, Kurmi, Suresh Kumar

 Hello Christian,
 
 Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
 
 This mail is regarding a regression we are seeing in our CI runs[1] on linux-next
 repository.
 
 Since the version next-20231013 [2], we are seeing the following RCU splat
 ```````````````````````````````````````````````````````````````````````````````
 <3> [511.395679] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
<3> [511.395716] rcu: 	Tasks blocked on level-1 rcu_node (CPUs 0-9): P6238
<3> [511.395934] rcu: 	(detected by 16, t=65002 jiffies, g=123977, q=439 ncpus=20)
<6> [511.395944] task:i915_selftest   state:R  running task     stack:10568 pid:6238  tgid:6238  ppid:1001   flags:0x00004002
 `````````````````````````````````````````````````````````````````````````````````
 
 Details log can be found in [3].
 
 After bisecting the tree, the following patch [4] seems to be the first "bad" commit
 
 `````````````````````````````````````````````````````````````````````````````````````````````````````````
commit 3a77344f50d847d51abb8629a6f181cb21684157
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Sep 29 08:45:59 2023 +0200

    file: convert to SLAB_TYPESAFE_BY_RCU
`````````````````````````````````````````````````````````````````````````````````````````````````````````
 
 We also verified that if we reset the tree to the parent commit the issue is not seen.
 
Could you please check how the commit results in the issue?

Thank you.

Regards

Chaitanya

 [1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
 [2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231013
 [3] https://intel-gfx-ci.01.org/tree/linux-next/next-20231013/bat-dg2-11/igt@i915_selftest@live@mman.html
 [4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231013&id=3a77344f50d847d51abb8629a6f181cb21684157

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression on linux-next (next-20231013)
  2023-10-25  6:32   ` [Intel-gfx] Regression on linux-next (next-20231013) Borah, Chaitanya Kumar
@ 2023-10-25  7:32     ` Christian Brauner
  2023-10-25 13:44       ` Borah, Chaitanya Kumar
  2023-11-09 17:00     ` [Intel-gfx] Regression on linux-next (next-20231107) Borah, Chaitanya Kumar
  1 sibling, 1 reply; 26+ messages in thread
From: Christian Brauner @ 2023-10-25  7:32 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar; +Cc: intel-gfx, Kurmi, Suresh Kumar

On Wed, Oct 25, 2023 at 06:32:01AM +0000, Borah, Chaitanya Kumar wrote:
>  Hello Christian,
>  
>  Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
>  
>  This mail is regarding a regression we are seeing in our CI runs[1] on linux-next
>  repository.

Any chance I can reproduce this locally?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression on linux-next (next-20231013)
  2023-10-25  7:32     ` Christian Brauner
@ 2023-10-25 13:44       ` Borah, Chaitanya Kumar
  2023-10-26 10:14         ` Borah, Chaitanya Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-25 13:44 UTC (permalink / raw)
  To: Christian Brauner; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hello Christian,

> -----Original Message-----
> From: Christian Brauner <brauner@kernel.org>
> Sent: Wednesday, October 25, 2023 1:02 PM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> Subject: Re: Regression on linux-next (next-20231013)
> 
> On Wed, Oct 25, 2023 at 06:32:01AM +0000, Borah, Chaitanya Kumar wrote:
> >  Hello Christian,
> >
> >  Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> >
> >  This mail is regarding a regression we are seeing in our CI runs[1]
> > on linux-next  repository.
> 
> Any chance I can reproduce this locally?

Thank you for your response.

I see that you have already floated a patch [1] to fix the issue. We will test it and get back to you ASAP.

In case, you still need it.

If you happen to have a device with intel CPU on it (we are seeing it in machines as old as Gen3[2]), you can follow the below steps.

1. Get the latest drm-tip from https://cgit.freedesktop.org/drm-tip/ and install it on the machine

2. Get IGT suite from https://gitlab.freedesktop.org/drm/igt-gpu-tools

3. Build the test suite
    You can use the instructions in the README.md file for building the suite.

    We use ubuntu and I generally do the following

	a) Make sure the packages listed in Dockerfile.build-debian-minimal and Dockerfile.build-debian are installed.
	b) meson build && ninja -C build

4. If everything goes fine, there should be a "build" folder created within the base folder of your repository
    Then run the test using the following command.
	
	sudo build/tests/i915_selftest --run-subtest live

Regards

Chaitanya


[1] https://lore.kernel.org/intel-gfx/20231025-formfrage-watscheln-84526cd3bd7d@brauner/
[2] http://gfx-ci.igk.intel.com/tree/linux-next/igt@i915_selftest@live@mman.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression on linux-next (next-20231013)
  2023-10-25 13:44       ` Borah, Chaitanya Kumar
@ 2023-10-26 10:14         ` Borah, Chaitanya Kumar
  2023-10-26 12:16           ` Christian Brauner
  0 siblings, 1 reply; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-10-26 10:14 UTC (permalink / raw)
  To: Christian Brauner; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hello Christian,

> -----Original Message-----
> From: Borah, Chaitanya Kumar
> Sent: Wednesday, October 25, 2023 7:15 PM
> To: Christian Brauner <brauner@kernel.org>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <Suresh.Kumar.Kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> Subject: RE: Regression on linux-next (next-20231013)
> 
> Hello Christian,
> 
> > -----Original Message-----
> > From: Christian Brauner <brauner@kernel.org>
> > Sent: Wednesday, October 25, 2023 1:02 PM
> > To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> > Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> > <jani.saarinen@intel.com>
> > Subject: Re: Regression on linux-next (next-20231013)
> >
> > On Wed, Oct 25, 2023 at 06:32:01AM +0000, Borah, Chaitanya Kumar wrote:
> > >  Hello Christian,
> > >
> > >  Hope you are doing well. I am Chaitanya from the linux graphics
> > > team in
> > Intel.
> > >
> > >  This mail is regarding a regression we are seeing in our CI runs[1]
> > > on linux-next  repository.
> >
> > Any chance I can reproduce this locally?
> 
> Thank you for your response.
> 
> I see that you have already floated a patch [1] to fix the issue. We will test it
> and get back to you ASAP.

The solution is working for us.

Also, linux-next turned green.

http://gfx-ci.igk.intel.com/tree/linux-next/igt@i915_selftest@live@mman.html

Thank you.

Regards

Chaitanya

> 
> In case, you still need it.
> 
> If you happen to have a device with intel CPU on it (we are seeing it in
> machines as old as Gen3[2]), you can follow the below steps.
> 
> 1. Get the latest drm-tip from https://cgit.freedesktop.org/drm-tip/ and install
> it on the machine
> 
> 2. Get IGT suite from https://gitlab.freedesktop.org/drm/igt-gpu-tools
> 
> 3. Build the test suite
>     You can use the instructions in the README.md file for building the suite.
> 
>     We use ubuntu and I generally do the following
> 
> 	a) Make sure the packages listed in Dockerfile.build-debian-minimal
> and Dockerfile.build-debian are installed.
> 	b) meson build && ninja -C build
> 
> 4. If everything goes fine, there should be a "build" folder created within the
> base folder of your repository
>     Then run the test using the following command.
> 
> 	sudo build/tests/i915_selftest --run-subtest live
> 
> Regards
> 
> Chaitanya
> 
> 
> [1] https://lore.kernel.org/intel-gfx/20231025-formfrage-watscheln-
> 84526cd3bd7d@brauner/
> [2] http://gfx-ci.igk.intel.com/tree/linux-
> next/igt@i915_selftest@live@mman.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression on linux-next (next-20231013)
  2023-10-26 10:14         ` Borah, Chaitanya Kumar
@ 2023-10-26 12:16           ` Christian Brauner
  0 siblings, 0 replies; 26+ messages in thread
From: Christian Brauner @ 2023-10-26 12:16 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar; +Cc: intel-gfx, Kurmi, Suresh Kumar

On Thu, Oct 26, 2023 at 10:14:23AM +0000, Borah, Chaitanya Kumar wrote:
> Hello Christian,
> 
> > -----Original Message-----
> > From: Borah, Chaitanya Kumar
> > Sent: Wednesday, October 25, 2023 7:15 PM
> > To: Christian Brauner <brauner@kernel.org>
> > Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > <Suresh.Kumar.Kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> > Subject: RE: Regression on linux-next (next-20231013)
> > 
> > Hello Christian,
> > 
> > > -----Original Message-----
> > > From: Christian Brauner <brauner@kernel.org>
> > > Sent: Wednesday, October 25, 2023 1:02 PM
> > > To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> > > Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > > <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> > > <jani.saarinen@intel.com>
> > > Subject: Re: Regression on linux-next (next-20231013)
> > >
> > > On Wed, Oct 25, 2023 at 06:32:01AM +0000, Borah, Chaitanya Kumar wrote:
> > > >  Hello Christian,
> > > >
> > > >  Hope you are doing well. I am Chaitanya from the linux graphics
> > > > team in
> > > Intel.
> > > >
> > > >  This mail is regarding a regression we are seeing in our CI runs[1]
> > > > on linux-next  repository.
> > >
> > > Any chance I can reproduce this locally?
> > 
> > Thank you for your response.
> > 
> > I see that you have already floated a patch [1] to fix the issue. We will test it
> > and get back to you ASAP.
> 
> The solution is working for us.
> 
> Also, linux-next turned green.

Great! That already has the final version of the patch.

> http://gfx-ci.igk.intel.com/tree/linux-next/igt@i915_selftest@live@mman.html
> 
> Thank you.

Thanks for the report!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Intel-gfx] Regression on linux-next (next-20231107)
  2023-10-25  6:32   ` [Intel-gfx] Regression on linux-next (next-20231013) Borah, Chaitanya Kumar
  2023-10-25  7:32     ` Christian Brauner
@ 2023-11-09 17:00     ` Borah, Chaitanya Kumar
  2023-11-09 20:40       ` Krister Johansen
  2023-12-04 17:17       ` [Intel-gfx] Regression on linux-next (next-20231130) Borah, Chaitanya Kumar
  1 sibling, 2 replies; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-11-09 17:00 UTC (permalink / raw)
  To: kjlx; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hello Krister,
 
Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
 
This mail is regarding a regression we are seeing in our CI runs[1] for some machines (dg2 and adl-p) on linux-next  repository.

Since the version next-20231107 [2], we are seeing the following error
```````````````````````````````````````````````````````````````````````````````
<4>[   32.015910] stack segment: 0000 [#1] PREEMPT SMP NOPTI
<4>[   32.021048] CPU: 15 PID: 766 Comm: fusermount Not tainted 6.6.0-next-20231107-next-20231107-g5cd631a52568+ #1
<4>[   32.031135] Hardware name: Intel Corporation Raptor Lake Client Platform/RPL-S ADP-S DDR5 UDIMM CRB, BIOS RPLSFWI1.R00.4221.A00.2305271351 05/27/2023
<4>[   32.044657] RIP: 0010:fuse_evict_inode+0x61/0x150 [fuse]
`````````````````````````````````````````````````````````````````````````````````

Details log can be found in [3].

After bisecting the tree, the following patch [4] seems to be the first "bad" commit

 `````````````````````````````````````````````````````````````````````````````````````````````````````````
513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5 is the first bad commit
commit 513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5
Author: Krister Johansen kjlx@templeofstupid.com
Date:   Fri Nov 3 10:39:47 2023 -0700

    fuse: share lookup state between submount and its parent

    Fuse submounts do not perform a lookup for the nodeid that they inherit
    from their parent.  Instead, the code decrements the nlookup on the
    submount's fuse_inode when it is instantiated, and no forget is
    performed when a submount root is evicted.

    Trouble arises when the submount's parent is evicted despite the
    submount itself being in use.  In this author's case, the submount was
    in a container and deatched from the initial mount namespace via a
    MNT_DEATCH operation.  When memory pressure triggered the shrinker, the
    inode from the parent was evicted, which triggered enough forgets to
    render the submount's nodeid invalid.

    Since submounts should still function, even if their parent goes away,
    solve this problem by sharing refcounted state between the parent and
    its submount.  When all of the references on this shared state reach
    zero, it's safe to forget the final lookup of the fuse nodeid.

 `````````````````````````````````````````````````````````````````````````````````````````````````````````
 
We also verified that if we revert the patch the issue is not seen.

Could you please check why the patch causes this regression and provide a fix if necessary?

Thank you.

Regards

Chaitanya

[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231107
[3] http://gfx-ci.igk.intel.com/tree/linux-next/next-20231109/bat-dg2-14/boot0.txt
[4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231107&id=513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression on linux-next (next-20231107)
  2023-11-09 17:00     ` [Intel-gfx] Regression on linux-next (next-20231107) Borah, Chaitanya Kumar
@ 2023-11-09 20:40       ` Krister Johansen
  2023-11-10  3:38         ` Borah, Chaitanya Kumar
  2023-12-04 17:17       ` [Intel-gfx] Regression on linux-next (next-20231130) Borah, Chaitanya Kumar
  1 sibling, 1 reply; 26+ messages in thread
From: Krister Johansen @ 2023-11-09 20:40 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar
  Cc: Miklos Szeredi, intel-gfx, kjlx, Kurmi, Suresh Kumar

Hi Chaitanya,

On Thu, Nov 09, 2023 at 05:00:09PM +0000, Borah, Chaitanya Kumar wrote:
> Hello Krister,
>  
> Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
>  
> This mail is regarding a regression we are seeing in our CI runs[1] for some machines (dg2 and adl-p) on linux-next  repository.
> 
> Since the version next-20231107 [2], we are seeing the following error
> ```````````````````````````````````````````````````````````````````````````````
> <4>[   32.015910] stack segment: 0000 [#1] PREEMPT SMP NOPTI
> <4>[   32.021048] CPU: 15 PID: 766 Comm: fusermount Not tainted 6.6.0-next-20231107-next-20231107-g5cd631a52568+ #1
> <4>[   32.031135] Hardware name: Intel Corporation Raptor Lake Client Platform/RPL-S ADP-S DDR5 UDIMM CRB, BIOS RPLSFWI1.R00.4221.A00.2305271351 05/27/2023
> <4>[   32.044657] RIP: 0010:fuse_evict_inode+0x61/0x150 [fuse]
> `````````````````````````````````````````````````````````````````````````````````
> 
> Details log can be found in [3].
> 
> After bisecting the tree, the following patch [4] seems to be the first "bad" commit
> 
>  `````````````````````````````````````````````````````````````````````````````````````````````````````````
> 513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5 is the first bad commit
> commit 513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5
> Author: Krister Johansen kjlx@templeofstupid.com
> Date:   Fri Nov 3 10:39:47 2023 -0700
> 
>     fuse: share lookup state between submount and its parent
> 
>     Fuse submounts do not perform a lookup for the nodeid that they inherit
>     from their parent.  Instead, the code decrements the nlookup on the
>     submount's fuse_inode when it is instantiated, and no forget is
>     performed when a submount root is evicted.
> 
>     Trouble arises when the submount's parent is evicted despite the
>     submount itself being in use.  In this author's case, the submount was
>     in a container and deatched from the initial mount namespace via a
>     MNT_DEATCH operation.  When memory pressure triggered the shrinker, the
>     inode from the parent was evicted, which triggered enough forgets to
>     render the submount's nodeid invalid.
> 
>     Since submounts should still function, even if their parent goes away,
>     solve this problem by sharing refcounted state between the parent and
>     its submount.  When all of the references on this shared state reach
>     zero, it's safe to forget the final lookup of the fuse nodeid.
> 
>  `````````````````````````````````````````````````````````````````````````````````````````````````````````
>  
> We also verified that if we revert the patch the issue is not seen.
> 
> Could you please check why the patch causes this regression and provide a fix if necessary?

Apologies for the inconvenience.  I've reproduced the problem, tested a
fix, and am in the process of preparing patches to send to Miklos.  I'll
cc the people on this e-mail in that thread.

> [3] http://gfx-ci.igk.intel.com/tree/linux-next/next-20231109/bat-dg2-14/boot0.txt

This link didn't resolve in DNS when I tried to access it.  I needed to
use intel-gfx-ci.01.org as the hostname instead.

Thanks,

-K

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression on linux-next (next-20231107)
  2023-11-09 20:40       ` Krister Johansen
@ 2023-11-10  3:38         ` Borah, Chaitanya Kumar
  2023-11-13  6:21           ` Borah, Chaitanya Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-11-10  3:38 UTC (permalink / raw)
  To: Krister Johansen; +Cc: Miklos Szeredi, intel-gfx, Kurmi, Suresh Kumar

Hello Krister,

> -----Original Message-----
> From: Krister Johansen <kjlx@templeofstupid.com>
> Sent: Friday, November 10, 2023 2:10 AM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: kjlx@templeofstupid.com; intel-gfx@lists.freedesktop.org; Kurmi, Suresh
> Kumar <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> <jani.saarinen@intel.com>; Miklos Szeredi <mszeredi@redhat.com>
> Subject: Re: Regression on linux-next (next-20231107)
> 
> Hi Chaitanya,
> 
> On Thu, Nov 09, 2023 at 05:00:09PM +0000, Borah, Chaitanya Kumar wrote:
> > Hello Krister,
> >
> > Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> >
> > This mail is regarding a regression we are seeing in our CI runs[1] for some
> machines (dg2 and adl-p) on linux-next  repository.
> >
> > Since the version next-20231107 [2], we are seeing the following error
> > ```````````````````````````````````````````````````````````````````````````````
> > <4>[   32.015910] stack segment: 0000 [#1] PREEMPT SMP NOPTI
> > <4>[   32.021048] CPU: 15 PID: 766 Comm: fusermount Not tainted 6.6.0-
> next-20231107-next-20231107-g5cd631a52568+ #1
> > <4>[   32.031135] Hardware name: Intel Corporation Raptor Lake Client
> Platform/RPL-S ADP-S DDR5 UDIMM CRB, BIOS
> RPLSFWI1.R00.4221.A00.2305271351 05/27/2023
> > <4>[   32.044657] RIP: 0010:fuse_evict_inode+0x61/0x150 [fuse]
> > ``````````````````````````````````````````````````````````````````````
> > ```````````
> >
> > Details log can be found in [3].
> >
> > After bisecting the tree, the following patch [4] seems to be the
> > first "bad" commit
> >
> >
> > ``````````````````````````````````````````````````````````````````````
> > ```````````````````````````````````
> > 513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5 is the first bad commit
> > commit 513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5
> > Author: Krister Johansen kjlx@templeofstupid.com
> > Date:   Fri Nov 3 10:39:47 2023 -0700
> >
> >     fuse: share lookup state between submount and its parent
> >
> >     Fuse submounts do not perform a lookup for the nodeid that they inherit
> >     from their parent.  Instead, the code decrements the nlookup on the
> >     submount's fuse_inode when it is instantiated, and no forget is
> >     performed when a submount root is evicted.
> >
> >     Trouble arises when the submount's parent is evicted despite the
> >     submount itself being in use.  In this author's case, the submount was
> >     in a container and deatched from the initial mount namespace via a
> >     MNT_DEATCH operation.  When memory pressure triggered the shrinker,
> the
> >     inode from the parent was evicted, which triggered enough forgets to
> >     render the submount's nodeid invalid.
> >
> >     Since submounts should still function, even if their parent goes away,
> >     solve this problem by sharing refcounted state between the parent and
> >     its submount.  When all of the references on this shared state reach
> >     zero, it's safe to forget the final lookup of the fuse nodeid.
> >
> >
> > ``````````````````````````````````````````````````````````````````````
> > ```````````````````````````````````
> >
> > We also verified that if we revert the patch the issue is not seen.
> >
> > Could you please check why the patch causes this regression and provide a
> fix if necessary?
> 
> Apologies for the inconvenience.  I've reproduced the problem, tested a fix,
> and am in the process of preparing patches to send to Miklos.  I'll cc the
> people on this e-mail in that thread.
> 
> > [3]
> > http://gfx-ci.igk.intel.com/tree/linux-next/next-20231109/bat-dg2-14/b
> > oot0.txt
> 
> This link didn't resolve in DNS when I tried to access it.  I needed to use intel-
> gfx-ci.01.org as the hostname instead.
> 

My bad. I realized it too late. Hope you found the logs. If not here they are.

https://intel-gfx-ci.01.org/tree/linux-next/next-20231109/bat-dg2-14/boot0.txt

Regards

Chaitanya
> Thanks,
> 
> -K

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression on linux-next (next-20231107)
  2023-11-10  3:38         ` Borah, Chaitanya Kumar
@ 2023-11-13  6:21           ` Borah, Chaitanya Kumar
       [not found]             ` <20231114174121.GA2064@templeofstupid.com>
  0 siblings, 1 reply; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-11-13  6:21 UTC (permalink / raw)
  To: Krister Johansen; +Cc: Miklos Szeredi, intel-gfx, Kurmi, Suresh Kumar

Hello Krister,

Any luck with this?

> -----Original Message-----
> From: Borah, Chaitanya Kumar
> Sent: Friday, November 10, 2023 9:09 AM
> To: Krister Johansen <kjlx@templeofstupid.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <Suresh.Kumar.Kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>;
> Miklos Szeredi <mszeredi@redhat.com>
> Subject: RE: Regression on linux-next (next-20231107)
> 
> Hello Krister,
> 
> > -----Original Message-----
> > From: Krister Johansen <kjlx@templeofstupid.com>
> > Sent: Friday, November 10, 2023 2:10 AM
> > To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> > Cc: kjlx@templeofstupid.com; intel-gfx@lists.freedesktop.org; Kurmi,
> > Suresh Kumar <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> > <jani.saarinen@intel.com>; Miklos Szeredi <mszeredi@redhat.com>
> > Subject: Re: Regression on linux-next (next-20231107)
> >
> > Hi Chaitanya,
> >
> > On Thu, Nov 09, 2023 at 05:00:09PM +0000, Borah, Chaitanya Kumar wrote:
> > > Hello Krister,
> > >
> > > Hope you are doing well. I am Chaitanya from the linux graphics team
> > > in
> > Intel.
> > >
> > > This mail is regarding a regression we are seeing in our CI runs[1]
> > > for some
> > machines (dg2 and adl-p) on linux-next  repository.
> > >
> > > Since the version next-20231107 [2], we are seeing the following
> > > error ```````````````````````````````````````````````````````````````````````````````
> > > <4>[   32.015910] stack segment: 0000 [#1] PREEMPT SMP NOPTI
> > > <4>[   32.021048] CPU: 15 PID: 766 Comm: fusermount Not tainted 6.6.0-
> > next-20231107-next-20231107-g5cd631a52568+ #1
> > > <4>[   32.031135] Hardware name: Intel Corporation Raptor Lake Client
> > Platform/RPL-S ADP-S DDR5 UDIMM CRB, BIOS
> > RPLSFWI1.R00.4221.A00.2305271351 05/27/2023
> > > <4>[   32.044657] RIP: 0010:fuse_evict_inode+0x61/0x150 [fuse]
> > > ````````````````````````````````````````````````````````````````````
> > > ``
> > > ```````````
> > >
> > > Details log can be found in [3].
> > >
> > > After bisecting the tree, the following patch [4] seems to be the
> > > first "bad" commit
> > >
> > >
> > > ````````````````````````````````````````````````````````````````````
> > > ``
> > > ```````````````````````````````````
> > > 513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5 is the first bad commit
> > > commit 513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5
> > > Author: Krister Johansen kjlx@templeofstupid.com
> > > Date:   Fri Nov 3 10:39:47 2023 -0700
> > >
> > >     fuse: share lookup state between submount and its parent
> > >
> > >     Fuse submounts do not perform a lookup for the nodeid that they
> inherit
> > >     from their parent.  Instead, the code decrements the nlookup on the
> > >     submount's fuse_inode when it is instantiated, and no forget is
> > >     performed when a submount root is evicted.
> > >
> > >     Trouble arises when the submount's parent is evicted despite the
> > >     submount itself being in use.  In this author's case, the submount was
> > >     in a container and deatched from the initial mount namespace via a
> > >     MNT_DEATCH operation.  When memory pressure triggered the
> > > shrinker,
> > the
> > >     inode from the parent was evicted, which triggered enough forgets to
> > >     render the submount's nodeid invalid.
> > >
> > >     Since submounts should still function, even if their parent goes away,
> > >     solve this problem by sharing refcounted state between the parent and
> > >     its submount.  When all of the references on this shared state reach
> > >     zero, it's safe to forget the final lookup of the fuse nodeid.
> > >
> > >
> > > ````````````````````````````````````````````````````````````````````
> > > ``
> > > ```````````````````````````````````
> > >
> > > We also verified that if we revert the patch the issue is not seen.
> > >
> > > Could you please check why the patch causes this regression and
> > > provide a
> > fix if necessary?
> >
> > Apologies for the inconvenience.  I've reproduced the problem, tested
> > a fix, and am in the process of preparing patches to send to Miklos.
> > I'll cc the people on this e-mail in that thread.
> >
> > > [3]
> > > http://gfx-ci.igk.intel.com/tree/linux-next/next-20231109/bat-dg2-14
> > > /b
> > > oot0.txt
> >
> > This link didn't resolve in DNS when I tried to access it.  I needed
> > to use intel- gfx-ci.01.org as the hostname instead.
> >
> 
> My bad. I realized it too late. Hope you found the logs. If not here they are.
> 
> https://intel-gfx-ci.01.org/tree/linux-next/next-20231109/bat-dg2-
> 14/boot0.txt
> 
> Regards
> 
> Chaitanya
> > Thanks,
> >
> > -K

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression on linux-next (next-20231107)
       [not found]             ` <20231114174121.GA2064@templeofstupid.com>
@ 2023-11-15  4:33               ` Borah, Chaitanya Kumar
  0 siblings, 0 replies; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-11-15  4:33 UTC (permalink / raw)
  To: Krister Johansen; +Cc: Miklos Szeredi, intel-gfx, Kurmi, Suresh Kumar

Hello Krister,

> -----Original Message-----
> From: Krister Johansen <kjlx@templeofstupid.com>
> Sent: Tuesday, November 14, 2023 11:11 PM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: Krister Johansen <kjlx@templeofstupid.com>; intel-
> gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>;
> Miklos Szeredi <mszeredi@redhat.com>
> Subject: Re: Regression on linux-next (next-20231107)
> 
> Hi Chaitanya,
> 
> On Mon, Nov 13, 2023 at 06:21:57AM +0000, Borah, Chaitanya Kumar wrote:
> > Hello Krister,
> >
> > Any luck with this?
> >
> > > -----Original Message-----
> > > From: Borah, Chaitanya Kumar
> > > Sent: Friday, November 10, 2023 9:09 AM
> > > To: Krister Johansen <kjlx@templeofstupid.com>
> > > Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> > > <Suresh.Kumar.Kurmi@intel.com>; Saarinen, Jani
> > > <jani.saarinen@intel.com>; Miklos Szeredi <mszeredi@redhat.com>
> > > Subject: RE: Regression on linux-next (next-20231107)
> > >
> > > Hello Krister,
> > >
> > > > -----Original Message-----
> > > > From: Krister Johansen <kjlx@templeofstupid.com>
> > > > Sent: Friday, November 10, 2023 2:10 AM
> > > > To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> > > > Cc: kjlx@templeofstupid.com; intel-gfx@lists.freedesktop.org;
> > > > Kurmi, Suresh Kumar <suresh.kumar.kurmi@intel.com>; Saarinen, Jani
> > > > <jani.saarinen@intel.com>; Miklos Szeredi <mszeredi@redhat.com>
> > > > Subject: Re: Regression on linux-next (next-20231107)
> > > >
> > > > Hi Chaitanya,
> > > >
> > > > On Thu, Nov 09, 2023 at 05:00:09PM +0000, Borah, Chaitanya Kumar
> wrote:
> > > > > Hello Krister,
> > > > >
> > > > > Hope you are doing well. I am Chaitanya from the linux graphics
> > > > > team in
> > > > Intel.
> > > > >
> > > > > This mail is regarding a regression we are seeing in our CI
> > > > > runs[1] for some
> > > > machines (dg2 and adl-p) on linux-next  repository.
> > > > >
> > > > > Since the version next-20231107 [2], we are seeing the following
> > > > > error ```````````````````````````````````````````````````````````````````````````````
> > > > > <4>[   32.015910] stack segment: 0000 [#1] PREEMPT SMP NOPTI
> > > > > <4>[   32.021048] CPU: 15 PID: 766 Comm: fusermount Not tainted
> 6.6.0-
> > > > next-20231107-next-20231107-g5cd631a52568+ #1
> > > > > <4>[   32.031135] Hardware name: Intel Corporation Raptor Lake Client
> > > > Platform/RPL-S ADP-S DDR5 UDIMM CRB, BIOS
> > > > RPLSFWI1.R00.4221.A00.2305271351 05/27/2023
> > > > > <4>[   32.044657] RIP: 0010:fuse_evict_inode+0x61/0x150 [fuse]
> > > > > ````````````````````````````````````````````````````````````````
> > > > > ````
> > > > > ``
> > > > > ```````````
> > > > >
> > > > > Details log can be found in [3].
> > > > >
> > > > > After bisecting the tree, the following patch [4] seems to be
> > > > > the first "bad" commit
> > > > >
> > > > >
> > > > > ````````````````````````````````````````````````````````````````
> > > > > ````
> > > > > ``
> > > > > ```````````````````````````````````
> > > > > 513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5 is the first bad commit
> > > > > commit 513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5
> > > > > Author: Krister Johansen kjlx@templeofstupid.com
> > > > > Date:   Fri Nov 3 10:39:47 2023 -0700
> > > > >
> > > > >     fuse: share lookup state between submount and its parent
> > > > >
> > > > >     Fuse submounts do not perform a lookup for the nodeid that
> > > > > they
> > > inherit
> > > > >     from their parent.  Instead, the code decrements the nlookup on the
> > > > >     submount's fuse_inode when it is instantiated, and no forget is
> > > > >     performed when a submount root is evicted.
> > > > >
> > > > >     Trouble arises when the submount's parent is evicted despite the
> > > > >     submount itself being in use.  In this author's case, the submount
> was
> > > > >     in a container and deatched from the initial mount namespace via a
> > > > >     MNT_DEATCH operation.  When memory pressure triggered the
> > > > > shrinker,
> > > > the
> > > > >     inode from the parent was evicted, which triggered enough forgets
> to
> > > > >     render the submount's nodeid invalid.
> > > > >
> > > > >     Since submounts should still function, even if their parent goes away,
> > > > >     solve this problem by sharing refcounted state between the parent
> and
> > > > >     its submount.  When all of the references on this shared state reach
> > > > >     zero, it's safe to forget the final lookup of the fuse nodeid.
> > > > >
> > > > >
> > > > > ````````````````````````````````````````````````````````````````
> > > > > ````
> > > > > ``
> > > > > ```````````````````````````````````
> > > > >
> > > > > We also verified that if we revert the patch the issue is not seen.
> > > > >
> > > > > Could you please check why the patch causes this regression and
> > > > > provide a
> > > > fix if necessary?
> > > >
> > > > Apologies for the inconvenience.  I've reproduced the problem,
> > > > tested a fix, and am in the process of preparing patches to send to
> Miklos.
> > > > I'll cc the people on this e-mail in that thread.
> > > >
> > > > > [3]
> > > > > http://gfx-ci.igk.intel.com/tree/linux-next/next-20231109/bat-dg
> > > > > 2-14
> > > > > /b
> > > > > oot0.txt
> > > >
> > > > This link didn't resolve in DNS when I tried to access it.  I
> > > > needed to use intel- gfx-ci.01.org as the hostname instead.
> > > >
> > >
> > > My bad. I realized it too late. Hope you found the logs. If not here they
> are.
> > >
> > > https://intel-gfx-ci.01.org/tree/linux-next/next-20231109/bat-dg2-
> > > 14/boot0.txt
> 
> Yes, I sent Miklos a patch for this on the 9th.  That was pulled into fuse/for-
> next.  You can either apply this patch directly:
> 
> https://lore.kernel.org/linux-fsdevel/CAJfpegtOKLDy-
> j=oi8BsT+xjFnO+Mk7=8VxSDuyi-
> bxhRSGMKQ@mail.gmail.com/T/#m1116af8fd8428f2871d527b7fc5d6351bd6f
> 199a
> 
> Or sync with a version of linux-next that contains the fix, which should be at
> least the 11/10 branch.
> 

Thanks a lot for the fix. Issue is resolved for us now.

Regards

Chaitanya

> Thanks,
> 
> -K

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Intel-gfx] Regression on linux-next (next-20231130)
  2023-11-09 17:00     ` [Intel-gfx] Regression on linux-next (next-20231107) Borah, Chaitanya Kumar
  2023-11-09 20:40       ` Krister Johansen
@ 2023-12-04 17:17       ` Borah, Chaitanya Kumar
  2023-12-04 18:11         ` Berg, Johannes
  2024-01-31  5:34         ` Regression on drm-tip Borah, Chaitanya Kumar
  1 sibling, 2 replies; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-12-04 17:17 UTC (permalink / raw)
  To: Berg, Johannes; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hello Johannes,

Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.

This mail is regarding a regression we are seeing in our CI runs[1] on linux-next repository.

Since the version next-20231130 [2], we are seeing the following regression

 `````````````````````````````````````````````````````````````````````````````````
<4> [198.663557] ======================================================
<4> [198.663559] WARNING: possible circular locking dependency detected
<4> [198.663562] 6.7.0-rc4-next-20231204-next-20231204-g629a3b49f3f9+ #1 Not tainted
<4> [198.663566] ------------------------------------------------------
<4> [198.663568] core_hotunplug/5433 is trying to acquire lock:
<4> [198.663571] ffff8881481b5068 (debugfs:i915_lpsp_capability#7){++++}-{0:0}, at: remove_one+0x56/0x160
<4> [198.663580] 
but task is already holding lock:
<4> [198.663583] ffff88810ef2e9d0 (&sb->s_type->i_mutex_key#2){++++}-{3:3}, at: simple_recursive_removal+0x1a1/0x2e0
<4> [198.663591] 
which lock already depends on the new lock.
<4> [198.663594] 
the existing dependency chain (in reverse order) is:
 `````````````````````````````````````````````````````````````````````````````````
Details log can be found in [3].

Locally we have seen a slightly different version of the issue

[  663.199573] core_hotunplug/1735 is trying to acquire lock:
[  663.199574] ffff888133406e68 (debugfs:i915_pipe){++++}-{0:0}, at: remove_one+0x56/0x160
 
After bisecting the tree, the following patch [4] seems to be the first "bad"
commit

`````````````````````````````````````````````````````````````````````````````````````````````````````````
commit f4acfcd4deb158b96595250cc332901b282d15b0
Author: Johannes Berg johannes.berg@intel.com
Date:   Fri Nov 24 17:25:25 2023 +0100

    debugfs: annotate debugfs handlers vs. removal with lockdep

    When you take a lock in a debugfs handler but also try
    to remove the debugfs file under that lock, things can
    deadlock since the removal has to wait for all users
    to finish.

    Add lockdep annotations in debugfs_file_get()/_put()
    to catch such issues.

    Acked-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
    Signed-off-by: Johannes Berg johannes.berg@intel.com

fs/debugfs/file.c     | 10 ++++++++++
fs/debugfs/inode.c    | 12 ++++++++++++
fs/debugfs/internal.h |  6 ++++++
3 files changed, 28 insertions(+)
`````````````````````````````````````````````````````````````````````````````````````````````````````````

We also verified that if we revert the patch the issue is not seen.

Could you please check why the patch causes this regression and provide a fix
if necessary?

Thank you.

Regards

Chaitanya

[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231130
[3] https://intel-gfx-ci.01.org/tree/linux-next/next-20231204/bat-dg2-9/igt@core_hotunplug@unbind-rebind.html
[4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231130&id=f4acfcd4deb158b96595250cc332901b282d15b0

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression on linux-next (next-20231130)
  2023-12-04 17:17       ` [Intel-gfx] Regression on linux-next (next-20231130) Borah, Chaitanya Kumar
@ 2023-12-04 18:11         ` Berg, Johannes
  2023-12-05  6:14           ` Borah, Chaitanya Kumar
  2024-01-31  5:34         ` Regression on drm-tip Borah, Chaitanya Kumar
  1 sibling, 1 reply; 26+ messages in thread
From: Berg, Johannes @ 2023-12-04 18:11 UTC (permalink / raw)
  To: Borah, Chaitanya Kumar; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hi,

> [snip lockdep report]

> commit f4acfcd4deb158b96595250cc332901b282d15b0
> Author: Johannes Berg johannes.berg@intel.com
> Date:   Fri Nov 24 17:25:25 2023 +0100
> 
>     debugfs: annotate debugfs handlers vs. removal with lockdep

Yes, obviously, since before that there was no lockdep class "debugfs:i915_pipe" 😊

> We also verified that if we revert the patch the issue is not seen.
> 
> Could you please check why the patch causes this regression and provide a fix
> if necessary?

First off, I already sent a revert, which should be included in the next -rc. Anyway this patch shouldn't have been included in the -rc cycle, I just erroneously included it with some bugfixes (that patch-wise had a dependency).

Secondly, we did find a false positive in another case, and yours seems to be the same or similar, due to seq_file not differentiating between the file instances.

That's a bit unfortunate, because we _did_ have actual deadlocks in wireless with debugfs_remove() being called on a file while holding a lock that the file also acquires, which can lead to a deadlock. Unless we differentiate seq_file instances though, there doesn't seem to be a good way to annotate that in debugfs, as this report and others show.

johannes
-- 

Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva  
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Intel-gfx] Regression on linux-next (next-20231130)
  2023-12-04 18:11         ` Berg, Johannes
@ 2023-12-05  6:14           ` Borah, Chaitanya Kumar
  0 siblings, 0 replies; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2023-12-05  6:14 UTC (permalink / raw)
  To: Berg, Johannes; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hello Johannes,

> -----Original Message-----
> From: Berg, Johannes <johannes.berg@intel.com>
> Sent: Monday, December 4, 2023 11:41 PM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>
> Subject: RE: Regression on linux-next (next-20231130)
> 
> Hi,
> 
> > [snip lockdep report]
> 
> > commit f4acfcd4deb158b96595250cc332901b282d15b0
> > Author: Johannes Berg johannes.berg@intel.com
> > Date:   Fri Nov 24 17:25:25 2023 +0100
> >
> >     debugfs: annotate debugfs handlers vs. removal with lockdep
> 
> Yes, obviously, since before that there was no lockdep class
> "debugfs:i915_pipe" 😊
> 
> > We also verified that if we revert the patch the issue is not seen.
> >
> > Could you please check why the patch causes this regression and
> > provide a fix if necessary?
> 
> First off, I already sent a revert, which should be included in the next -rc.
> Anyway this patch shouldn't have been included in the -rc cycle, I just
> erroneously included it with some bugfixes (that patch-wise had a
> dependency).
> 
> Secondly, we did find a false positive in another case, and yours seems to be
> the same or similar, due to seq_file not differentiating between the file
> instances.
> 
> That's a bit unfortunate, because we _did_ have actual deadlocks in wireless
> with debugfs_remove() being called on a file while holding a lock that the file
> also acquires, which can lead to a deadlock. Unless we differentiate seq_file
> instances though, there doesn't seem to be a good way to annotate that in
> debugfs, as this report and others show.

Thank you for the confirmation. We will wait for the revert to be included in linux-next.
Sounds like a useful addition. Hopefully, we can find a way for both of them to co-exist.

Regards

Chaitanya

> 
> johannes
> --


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Regression on drm-tip
  2023-12-04 17:17       ` [Intel-gfx] Regression on linux-next (next-20231130) Borah, Chaitanya Kumar
  2023-12-04 18:11         ` Berg, Johannes
@ 2024-01-31  5:34         ` Borah, Chaitanya Kumar
       [not found]           ` <b77d8588-6809-416c-b598-7a33a672c1e7@opensource.cirrus.com>
  1 sibling, 1 reply; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2024-01-31  5:34 UTC (permalink / raw)
  To: rf; +Cc: intel-gfx, Kurmi, Suresh Kumar

Hello Richard,

Hope you are doing well. I am Chaitanya from the Linux graphics team in Intel.

This mail is regarding a regression we are seeing in our CI runs[1] on drm-tip[2] repository.
These are captured by gitlab issues[3].

We bisected the issue and have found the following commit to be the first bad commit.
`````````````````````````````````````````````````````````````````````````````````````````````````````````
commit a0b84213f947176ddcd0e96e0751a109f28cde21
Author: Richard Fitzgerald rf@opensource.cirrus.com
Date:   Mon Dec 18 15:17:29 2023 +0000

    kunit: Fix NULL-dereference in kunit_init_suite() if suite->log is NULL

    suite->log must be checked for NULL before passing it to
    string_stream_clear(). This was done in kunit_init_test() but was missing
    from kunit_init_suite().

    Signed-off-by: Richard Fitzgerald rf@opensource.cirrus.com
    Fixes: 6d696c4695c5 ("kunit: add ability to run tests after boot using debugfs")
    Reviewed-by: Rae Moar rmoar@google.com
    Acked-by: David Gow davidgow@google.com
    Reviewed-by: Muhammad Usama Anjum usama.anjum@collabora.com
    Signed-off-by: Shuah Khan skhan@linuxfoundation.org

lib/kunit/test.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
`````````````````````````````````````````````````````````````````````````````````````````````````````````
We tried reverting the patch and the original issue is not seen but it results in NULL pointer deference[4] which I am guessing is expected.

Could you please check why the patch causes this regression and provide a fix if necessary?
 
[1] https://intel-gfx-ci.01.org/tree/drm-tip/index.html?testfilter=drm
[2] https://cgit.freedesktop.org/drm-tip/
[3] https://gitlab.freedesktop.org/drm/intel/-/issues/10140
      https://gitlab.freedesktop.org/drm/intel/-/issues/10143
[4]
	[  179.849411] [IGT] drm_buddy: executing
	[  179.856385] [IGT] drm_buddy: starting subtest drm_buddy
	[  179.862594] KTAP version 1
	[  179.862600] 1..1
	[  179.863375] BUG: kernel NULL pointer dereference, address: 0000000000000030
	[  179.863381] #PF: supervisor read access in kernel mode
	[  179.863384] #PF: error_code(0x0000) - not-present page
	[  179.863387] PGD 0 P4D 0
	[  179.863391] Oops: 0000 [#1] PREEMPT SMP NOPTI
	[  179.863395] CPU: 1 PID: 1319 Comm: drm_buddy Not tainted 6.8.0-rc1-bisecttrail015 #16
	[  179.863398] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3471.D81.2311291340 11/29/2023
	[  179.863400] RIP: 0010:__lock_acquire+0x71f/0x2300
	[  179.863408] Code: 84 03 06 00 00 44 8b 15 27 f6 72 01 45 85 d2 0f 84 9c 00 00 00 f6 45 22 10 0f 84 63 03 00 00 41 bf 01 00 00 00 e9 8a 00 00 00 <48> 81 3f 40 d7 fa 82 41 b9 00 00 00 00 45 0f 	45 c8 83 fe 01 0f 87
	...
	[  179.863445] PKRU: 55555554
	[  179.863448] Call Trace:
	[  179.863450]  <TASK>
	[  179.863453]  ? __die_body+0x1a/0x60
	[  179.863459]  ? page_fault_oops+0x156/0x450
	[  179.863465]  ? do_user_addr_fault+0x65/0x9e0
	[  179.863472]  ? exc_page_fault+0x68/0x1a0
	[  179.863479]  ? asm_exc_page_fault+0x26/0x30
	[  179.863487]  ? __lock_acquire+0x71f/0x2300
	[  179.863493]  ? __pfx_do_sync_core+0x10/0x10
	[  179.863500]  lock_acquire+0xd8/0x2d0
	[  179.863505]  ? string_stream_clear+0x29/0xb0 [kunit]
	[  179.863523]  _raw_spin_lock+0x2e/0x40
	[  179.863528]  ? string_stream_clear+0x29/0xb0 [kunit]
	[  179.863540]  string_stream_clear+0x29/0xb0 [kunit]
	[  179.863554]  __kunit_test_suites_init+0x7e/0xe0 [kunit]
	[  179.863568]  kunit_module_notify+0x20f/0x220 [kunit]
	[  179.863583]  notifier_call_chain+0x46/0x130
	[  179.863591]  notifier_call_chain_robust+0x3e/0x90
	[  179.863598]  blocking_notifier_call_chain_robust+0x42/0x60
	[  179.863605]  load_module+0x1bcd/0x1f80
	[  179.863617]  ? init_module_from_file+0x86/0xd0
	[  179.863621]  init_module_from_file+0x86/0xd0
	[  179.863629]  idempotent_init_module+0x17c/0x230
	[  179.863637]  __x64_sys_finit_module+0x56/0xb0
	[  179.863642]  do_syscall_64+0x6f/0x140
	[  179.863649]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
	[  179.863654] RIP: 0033:0x7f0e6676195d

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: Regression on drm-tip
       [not found]           ` <b77d8588-6809-416c-b598-7a33a672c1e7@opensource.cirrus.com>
@ 2024-02-01  5:13             ` Borah, Chaitanya Kumar
  0 siblings, 0 replies; 26+ messages in thread
From: Borah, Chaitanya Kumar @ 2024-02-01  5:13 UTC (permalink / raw)
  To: Richard Fitzgerald
  Cc: David Gow, intel-gfx, linux-kselftest, Kurmi, Suresh Kumar, kunit-dev

> -----Original Message-----
> From: Richard Fitzgerald <rf@opensource.cirrus.com>
> Sent: Wednesday, January 31, 2024 4:05 PM
> To: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Kurmi, Suresh Kumar
> <suresh.kumar.kurmi@intel.com>; Saarinen, Jani <jani.saarinen@intel.com>;
> David Gow <davidgow@google.com>; kunit-dev@googlegroups.com; linux-
> kselftest@vger.kernel.org
> Subject: Re: Regression on drm-tip
> 
> On 31/1/24 05:34, Borah, Chaitanya Kumar wrote:
> > Hello Richard,
> >
> > Hope you are doing well. I am Chaitanya from the Linux graphics team in
> Intel.
> >
> > This mail is regarding a regression we are seeing in our CI runs[1] on drm-
> tip[2] repository.
> > These are captured by gitlab issues[3].
> >
> > We bisected the issue and have found the following commit to be the first
> bad commit.
> > ``````````````````````````````````````````````````````````````````````
> > ```````````````````````````````````
> > commit a0b84213f947176ddcd0e96e0751a109f28cde21
> > Author: Richard Fitzgerald rf@opensource.cirrus.com
> > Date:   Mon Dec 18 15:17:29 2023 +0000
> >
> >      kunit: Fix NULL-dereference in kunit_init_suite() if suite->log
> > is NULL
> >
> >      suite->log must be checked for NULL before passing it to
> >      string_stream_clear(). This was done in kunit_init_test() but was missing
> >      from kunit_init_suite().
> >
> >      Signed-off-by: Richard Fitzgerald rf@opensource.cirrus.com
> >      Fixes: 6d696c4695c5 ("kunit: add ability to run tests after boot using
> debugfs")
> >      Reviewed-by: Rae Moar rmoar@google.com
> >      Acked-by: David Gow davidgow@google.com
> >      Reviewed-by: Muhammad Usama Anjum usama.anjum@collabora.com
> >      Signed-off-by: Shuah Khan skhan@linuxfoundation.org
> >
> > lib/kunit/test.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> > ``````````````````````````````````````````````````````````````````````
> > ```````````````````````````````````
> > We tried reverting the patch and the original issue is not seen but it results
> in NULL pointer deference[4] which I am guessing is expected.
> >
> > Could you please check why the patch causes this regression and provide a
> fix if necessary?
> >
> > [1] https://intel-gfx-ci.01.org/tree/drm-tip/index.html?testfilter=drm
> > [2] https://cgit.freedesktop.org/drm-tip/
> > [3] https://gitlab.freedesktop.org/drm/intel/-/issues/10140
> >        https://gitlab.freedesktop.org/drm/intel/-/issues/10143
> > [4]
> > 	[  179.849411] [IGT] drm_buddy: executing
> > 	[  179.856385] [IGT] drm_buddy: starting subtest drm_buddy
> > 	[  179.862594] KTAP version 1
> > 	[  179.862600] 1..1
> > 	[  179.863375] BUG: kernel NULL pointer dereference, address:
> 0000000000000030
> > 	[  179.863381] #PF: supervisor read access in kernel mode
> > 	[  179.863384] #PF: error_code(0x0000) - not-present page
> > 	[  179.863387] PGD 0 P4D 0
> > 	[  179.863391] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > 	[  179.863395] CPU: 1 PID: 1319 Comm: drm_buddy Not tainted 6.8.0-
> rc1-bisecttrail015 #16
> > 	[  179.863398] Hardware name: Intel Corporation Meteor Lake Client
> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS
> MTLPFWI1.R00.3471.D81.2311291340 11/29/2023
> > 	[  179.863400] RIP: 0010:__lock_acquire+0x71f/0x2300
> > 	[  179.863408] Code: 84 03 06 00 00 44 8b 15 27 f6 72 01 45 85 d2 0f
> 84 9c 00 00 00 f6 45 22 10 0f 84 63 03 00 00 41 bf 01 00 00 00 e9 8a 00 00 00
> <48> 81 3f 40 d7 fa 82 41 b9 00 00 00 00 45 0f 	45 c8 83 fe 01 0f 87
> > 	...
> > 	[  179.863445] PKRU: 55555554
> > 	[  179.863448] Call Trace:
> > 	[  179.863450]  <TASK>
> > 	[  179.863453]  ? __die_body+0x1a/0x60
> > 	[  179.863459]  ? page_fault_oops+0x156/0x450
> > 	[  179.863465]  ? do_user_addr_fault+0x65/0x9e0
> > 	[  179.863472]  ? exc_page_fault+0x68/0x1a0
> > 	[  179.863479]  ? asm_exc_page_fault+0x26/0x30
> > 	[  179.863487]  ? __lock_acquire+0x71f/0x2300
> > 	[  179.863493]  ? __pfx_do_sync_core+0x10/0x10
> > 	[  179.863500]  lock_acquire+0xd8/0x2d0
> > 	[  179.863505]  ? string_stream_clear+0x29/0xb0 [kunit]
> > 	[  179.863523]  _raw_spin_lock+0x2e/0x40
> > 	[  179.863528]  ? string_stream_clear+0x29/0xb0 [kunit]
> > 	[  179.863540]  string_stream_clear+0x29/0xb0 [kunit]
> > 	[  179.863554]  __kunit_test_suites_init+0x7e/0xe0 [kunit]
> > 	[  179.863568]  kunit_module_notify+0x20f/0x220 [kunit]
> > 	[  179.863583]  notifier_call_chain+0x46/0x130
> > 	[  179.863591]  notifier_call_chain_robust+0x3e/0x90
> > 	[  179.863598]  blocking_notifier_call_chain_robust+0x42/0x60
> > 	[  179.863605]  load_module+0x1bcd/0x1f80
> > 	[  179.863617]  ? init_module_from_file+0x86/0xd0
> > 	[  179.863621]  init_module_from_file+0x86/0xd0
> > 	[  179.863629]  idempotent_init_module+0x17c/0x230
> > 	[  179.863637]  __x64_sys_finit_module+0x56/0xb0
> > 	[  179.863642]  do_syscall_64+0x6f/0x140
> > 	[  179.863649]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > 	[  179.863654] RIP: 0033:0x7f0e6676195d
> 
> Looking at the gitlab bug reports compared to the crash log above:
> 
> [3] You have hit a failure on the 3rd test case:
> 
>      <6> [59.039608] [IGT] drm_buddy: starting dynamic subtest
>      drm_test_buddy_alloc_limit
>      <6> [59.077701] KTAP version 1
>      <6> [59.077705] 1..1
>      <6> [59.078487]     KTAP version 1
>      <6> [59.078494]     # Subtest: drm_buddy
>      <6> [59.078496]     # module: drm_buddy_test
>      <6> [59.078498]     1..4
>      <6> [59.079321]     ok 1 drm_test_buddy_alloc_limit
>      <6> [59.079973]     ok 2 drm_test_buddy_alloc_optimistic
>      <6> [59.080479] [IGT] drm_buddy: finished subtest
>      drm_test_buddy_alloc_limit, SUCCESS
> 
> When you revert my NULL-dereference bugfix, you are hitting the NULL
> dereference crash immediately, before executing the test case that is causing
> [3].
> 
>      > [  179.862594] KTAP version 1
>      > [  179.862600] 1..1
>      > [  179.863375] BUG: kernel NULL pointer dereference
> 
> So, my commit is not causing your [3]. It is allowing you to reach your test
> case that is causing [3].

Understood. I think we pulled the trigger too soon on this one.

I see that David has sent a quick patch. We will check if that helps.

Regards

Chaitanya

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-02-01  5:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-05 15:58 [Intel-gfx] Regression in linux-next Borah, Chaitanya Kumar
2023-10-06 20:30 ` Wysocki, Rafael J
2023-10-09  5:10   ` Borah, Chaitanya Kumar
2023-10-09 19:23     ` Wysocki, Rafael J
2023-10-11  4:00       ` Borah, Chaitanya Kumar
2023-10-11 16:14         ` Wysocki, Rafael J
2023-10-11 16:49           ` Borah, Chaitanya Kumar
2023-10-13 14:05             ` Borah, Chaitanya Kumar
2023-10-20  5:52 ` [Intel-gfx] Regression on linux-next (next-20231016) Borah, Chaitanya Kumar
2023-10-20  6:38   ` Lorenzo Stoakes
2023-10-20  7:58     ` Borah, Chaitanya Kumar
2023-10-25  6:32   ` [Intel-gfx] Regression on linux-next (next-20231013) Borah, Chaitanya Kumar
2023-10-25  7:32     ` Christian Brauner
2023-10-25 13:44       ` Borah, Chaitanya Kumar
2023-10-26 10:14         ` Borah, Chaitanya Kumar
2023-10-26 12:16           ` Christian Brauner
2023-11-09 17:00     ` [Intel-gfx] Regression on linux-next (next-20231107) Borah, Chaitanya Kumar
2023-11-09 20:40       ` Krister Johansen
2023-11-10  3:38         ` Borah, Chaitanya Kumar
2023-11-13  6:21           ` Borah, Chaitanya Kumar
     [not found]             ` <20231114174121.GA2064@templeofstupid.com>
2023-11-15  4:33               ` Borah, Chaitanya Kumar
2023-12-04 17:17       ` [Intel-gfx] Regression on linux-next (next-20231130) Borah, Chaitanya Kumar
2023-12-04 18:11         ` Berg, Johannes
2023-12-05  6:14           ` Borah, Chaitanya Kumar
2024-01-31  5:34         ` Regression on drm-tip Borah, Chaitanya Kumar
     [not found]           ` <b77d8588-6809-416c-b598-7a33a672c1e7@opensource.cirrus.com>
2024-02-01  5:13             ` Borah, Chaitanya Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).