All of lore.kernel.org
 help / color / mirror / Atom feed
* drm/msm: DisplayPort regressions in 6.8-rc1
@ 2024-02-13 11:42 Johan Hovold
  2024-02-13 18:00 ` Abhinav Kumar
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Johan Hovold @ 2024-02-13 11:42 UTC (permalink / raw)
  To: Rob Clark, Abhinav Kumar, Dmitry Baryshkov, Kuogee Hsieh
  Cc: Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter,
	Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel,
	freedreno, linux-arm-msm, regressions, linux-kernel

Hi,

Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
not always show up on boot.

The logs indicate problems with the runtime PM and eDP rework that went
into 6.8-rc1:

	[    6.006236] Console: switching to colour dummy device 80x25
	[    6.007542] [drm:dpu_kms_hw_init:1048] dpu hardware revision:0x80000000
	[    6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16
	[    6.007934] [drm:dp_bridge_init [msm]] *ERROR* failed to attach panel bridge: -16
	[    6.007983] msm_dpu ae01000.display-controller: [drm:msm_dp_modeset_init [msm]] *ERROR* failed to create dp bridge: -16
	[    6.008030] [drm:_dpu_kms_initialize_displayport:588] [dpu error]modeset_init failed for DP, rc = -16
	[    6.008050] [drm:_dpu_kms_setup_displays:681] [dpu error]initialize_DP failed, rc = -16
	[    6.008068] [drm:dpu_kms_hw_init:1153] [dpu error]modeset init failed: -16
	[    6.008388] msm_dpu ae01000.display-controller: [drm:msm_drm_kms_init [msm]] *ERROR* kms hw init failed: -16
	
and this can also manifest itself as a NULL-pointer dereference:

	[    7.339447] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
	
	[    7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]
	[    7.686415] lr : drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
	
	[    7.769039] Call trace:
	[    7.771564]  drm_bridge_attach+0x70/0x1a8 [drm]
	[    7.776234]  drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
	[    7.781782]  drm_bridge_attach+0x80/0x1a8 [drm]
	[    7.786454]  dp_bridge_init+0xa8/0x15c [msm]
	[    7.790856]  msm_dp_modeset_init+0x28/0xc4 [msm]
	[    7.795617]  _dpu_kms_drm_obj_init+0x19c/0x680 [msm]
	[    7.800731]  dpu_kms_hw_init+0x348/0x4c4 [msm]
	[    7.805306]  msm_drm_kms_init+0x84/0x324 [msm]
	[    7.809891]  msm_drm_bind+0x1d8/0x3a8 [msm]
	[    7.814196]  try_to_bring_up_aggregate_device+0x1f0/0x2f8
	[    7.819747]  __component_add+0xa4/0x18c
	[    7.823703]  component_add+0x14/0x20
	[    7.827389]  dp_display_probe+0x47c/0x568 [msm]
	[    7.832052]  platform_probe+0x68/0xd8

Users have also reported random crashes at boot since 6.8-rc1, and I've
been able to trigger hard crashes twice when testing an external display
(USB-C/DP), which may also be related to the DP regressions.

I've opened an issue here:

	https://gitlab.freedesktop.org/drm/msm/-/issues/51

but I also want Thorsten's help to track this so that it gets fixed
before 6.8 is released.

#regzbot introduced: v6.7..v6.8-rc1

The following series is likely the culprit:

	https://lore.kernel.org/all/1701472789-25951-1-git-send-email-quic_khsieh@quicinc.com/

Johan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: drm/msm: DisplayPort regressions in 6.8-rc1
  2024-02-13 11:42 drm/msm: DisplayPort regressions in 6.8-rc1 Johan Hovold
@ 2024-02-13 18:00 ` Abhinav Kumar
  2024-02-14 13:18   ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-14 13:52   ` Johan Hovold
  2024-02-17 15:22 ` drm/msm: DisplayPort regressions " Johan Hovold
  2024-02-23  7:49 ` Linux regression tracking #update (Thorsten Leemhuis)
  2 siblings, 2 replies; 11+ messages in thread
From: Abhinav Kumar @ 2024-02-13 18:00 UTC (permalink / raw)
  To: Johan Hovold, Rob Clark, Dmitry Baryshkov, Kuogee Hsieh
  Cc: Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter,
	Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel,
	freedreno, linux-arm-msm, regressions, linux-kernel

Hi Johan

Thanks for the report.

I do agree that pm runtime eDP driver got merged that time but I think 
the issue is either a combination of that along with DRM aux bridge 
https://patchwork.freedesktop.org/series/122584/ OR just the latter as 
even that went in around the same time.

Thats why perhaps this issue was not seen with the chromebooks we tested 
on as they do not use pmic_glink (aux bridge).

So we will need to debug this on sc8280xp specifically or an equivalent 
device which uses aux bridge.

Thanks

Abhinav

On 2/13/2024 3:42 AM, Johan Hovold wrote:
> Hi,
> 
> Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
> not always show up on boot.
> 
> The logs indicate problems with the runtime PM and eDP rework that went
> into 6.8-rc1:
> 
> 	[    6.006236] Console: switching to colour dummy device 80x25
> 	[    6.007542] [drm:dpu_kms_hw_init:1048] dpu hardware revision:0x80000000
> 	[    6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16
> 	[    6.007934] [drm:dp_bridge_init [msm]] *ERROR* failed to attach panel bridge: -16
> 	[    6.007983] msm_dpu ae01000.display-controller: [drm:msm_dp_modeset_init [msm]] *ERROR* failed to create dp bridge: -16
> 	[    6.008030] [drm:_dpu_kms_initialize_displayport:588] [dpu error]modeset_init failed for DP, rc = -16
> 	[    6.008050] [drm:_dpu_kms_setup_displays:681] [dpu error]initialize_DP failed, rc = -16
> 	[    6.008068] [drm:dpu_kms_hw_init:1153] [dpu error]modeset init failed: -16
> 	[    6.008388] msm_dpu ae01000.display-controller: [drm:msm_drm_kms_init [msm]] *ERROR* kms hw init failed: -16
> 	
> and this can also manifest itself as a NULL-pointer dereference:
> 
> 	[    7.339447] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> 	
> 	[    7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]
> 	[    7.686415] lr : drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
> 	
> 	[    7.769039] Call trace:
> 	[    7.771564]  drm_bridge_attach+0x70/0x1a8 [drm]
> 	[    7.776234]  drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
> 	[    7.781782]  drm_bridge_attach+0x80/0x1a8 [drm]
> 	[    7.786454]  dp_bridge_init+0xa8/0x15c [msm]
> 	[    7.790856]  msm_dp_modeset_init+0x28/0xc4 [msm]
> 	[    7.795617]  _dpu_kms_drm_obj_init+0x19c/0x680 [msm]
> 	[    7.800731]  dpu_kms_hw_init+0x348/0x4c4 [msm]
> 	[    7.805306]  msm_drm_kms_init+0x84/0x324 [msm]
> 	[    7.809891]  msm_drm_bind+0x1d8/0x3a8 [msm]
> 	[    7.814196]  try_to_bring_up_aggregate_device+0x1f0/0x2f8
> 	[    7.819747]  __component_add+0xa4/0x18c
> 	[    7.823703]  component_add+0x14/0x20
> 	[    7.827389]  dp_display_probe+0x47c/0x568 [msm]
> 	[    7.832052]  platform_probe+0x68/0xd8
> 
> Users have also reported random crashes at boot since 6.8-rc1, and I've
> been able to trigger hard crashes twice when testing an external display
> (USB-C/DP), which may also be related to the DP regressions.
> 
> I've opened an issue here:
> 
> 	https://gitlab.freedesktop.org/drm/msm/-/issues/51
> 
> but I also want Thorsten's help to track this so that it gets fixed
> before 6.8 is released.
> 
> #regzbot introduced: v6.7..v6.8-rc1
> 
> The following series is likely the culprit:
> 
> 	https://lore.kernel.org/all/1701472789-25951-1-git-send-email-quic_khsieh@quicinc.com/
> 
> Johan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: drm/msm: DisplayPort regressions in 6.8-rc1
  2024-02-13 18:00 ` Abhinav Kumar
@ 2024-02-14 13:18   ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-14 13:52   ` Johan Hovold
  1 sibling, 0 replies; 11+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-02-14 13:18 UTC (permalink / raw)
  To: Abhinav Kumar, Johan Hovold, Rob Clark, Dmitry Baryshkov, Kuogee Hsieh
  Cc: Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter,
	Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel,
	freedreno, linux-arm-msm, regressions, linux-kernel

On 13.02.24 19:00, Abhinav Kumar wrote:
> 
> Thanks for the report.
> 
> I do agree that pm runtime eDP driver got merged that time but I think
> the issue is either a combination of that along with DRM aux bridge
> https://patchwork.freedesktop.org/series/122584/ OR just the latter as
> even that went in around the same time.

In that case allow me a stupid question from the cheap seats:

Is there anything affected users can do to help getting us closer to the
real problem? Like testing a specific commit or two before or after the
merge of one of those features for example? That might help to rule out
a few things.

Ciao, Thorsten

> Thats why perhaps this issue was not seen with the chromebooks we tested
> on as they do not use pmic_glink (aux bridge).
> 
> So we will need to debug this on sc8280xp specifically or an equivalent
> device which uses aux bridge.
> 
> On 2/13/2024 3:42 AM, Johan Hovold wrote:
>> Hi,
>>
>> Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
>> not always show up on boot.
>>
>> The logs indicate problems with the runtime PM and eDP rework that went
>> into 6.8-rc1:
>>
>>     [    6.006236] Console: switching to colour dummy device 80x25
>>     [    6.007542] [drm:dpu_kms_hw_init:1048] dpu hardware
>> revision:0x80000000
>>     [    6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to
>> attach bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16
>>     [    6.007934] [drm:dp_bridge_init [msm]] *ERROR* failed to attach
>> panel bridge: -16
>>     [    6.007983] msm_dpu ae01000.display-controller:
>> [drm:msm_dp_modeset_init [msm]] *ERROR* failed to create dp bridge: -16
>>     [    6.008030] [drm:_dpu_kms_initialize_displayport:588] [dpu
>> error]modeset_init failed for DP, rc = -16
>>     [    6.008050] [drm:_dpu_kms_setup_displays:681] [dpu
>> error]initialize_DP failed, rc = -16
>>     [    6.008068] [drm:dpu_kms_hw_init:1153] [dpu error]modeset init
>> failed: -16
>>     [    6.008388] msm_dpu ae01000.display-controller:
>> [drm:msm_drm_kms_init [msm]] *ERROR* kms hw init failed: -16
>>     
>> and this can also manifest itself as a NULL-pointer dereference:
>>
>>     [    7.339447] Unable to handle kernel NULL pointer dereference at
>> virtual address 0000000000000000
>>     
>>     [    7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]
>>     [    7.686415] lr : drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
>>     
>>     [    7.769039] Call trace:
>>     [    7.771564]  drm_bridge_attach+0x70/0x1a8 [drm]
>>     [    7.776234]  drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
>>     [    7.781782]  drm_bridge_attach+0x80/0x1a8 [drm]
>>     [    7.786454]  dp_bridge_init+0xa8/0x15c [msm]
>>     [    7.790856]  msm_dp_modeset_init+0x28/0xc4 [msm]
>>     [    7.795617]  _dpu_kms_drm_obj_init+0x19c/0x680 [msm]
>>     [    7.800731]  dpu_kms_hw_init+0x348/0x4c4 [msm]
>>     [    7.805306]  msm_drm_kms_init+0x84/0x324 [msm]
>>     [    7.809891]  msm_drm_bind+0x1d8/0x3a8 [msm]
>>     [    7.814196]  try_to_bring_up_aggregate_device+0x1f0/0x2f8
>>     [    7.819747]  __component_add+0xa4/0x18c
>>     [    7.823703]  component_add+0x14/0x20
>>     [    7.827389]  dp_display_probe+0x47c/0x568 [msm]
>>     [    7.832052]  platform_probe+0x68/0xd8
>>
>> Users have also reported random crashes at boot since 6.8-rc1, and I've
>> been able to trigger hard crashes twice when testing an external display
>> (USB-C/DP), which may also be related to the DP regressions.
>>
>> I've opened an issue here:
>>
>>     https://gitlab.freedesktop.org/drm/msm/-/issues/51
>>
>> but I also want Thorsten's help to track this so that it gets fixed
>> before 6.8 is released.
>>
>> #regzbot introduced: v6.7..v6.8-rc1
>>
>> The following series is likely the culprit:
>>
>>     https://lore.kernel.org/all/1701472789-25951-1-git-send-email-quic_khsieh@quicinc.com/
>>
>> Johan
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: drm/msm: DisplayPort regressions in 6.8-rc1
  2024-02-13 18:00 ` Abhinav Kumar
  2024-02-14 13:18   ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-02-14 13:52   ` Johan Hovold
  2024-02-17 15:14     ` Johan Hovold
  1 sibling, 1 reply; 11+ messages in thread
From: Johan Hovold @ 2024-02-14 13:52 UTC (permalink / raw)
  To: Abhinav Kumar
  Cc: Rob Clark, Dmitry Baryshkov, Kuogee Hsieh, Sean Paul,
	Marijn Suijten, David Airlie, Daniel Vetter, Bjorn Andersson,
	quic_jesszhan, quic_sbillaka, dri-devel, freedreno,
	linux-arm-msm, regressions, linux-kernel

On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:

> I do agree that pm runtime eDP driver got merged that time but I think 
> the issue is either a combination of that along with DRM aux bridge 
> https://patchwork.freedesktop.org/series/122584/ OR just the latter as 
> even that went in around the same time.

Yes, indeed there was a lot of changes that went into the MSM drm driver
in 6.8-rc1 and since I have not tried to debug this myself I can't say
for sure which change or changes that triggered this regression (or
possibly regressions).

The fact that the USB-C/DP PHY appears to be involved
(/soc@0/phy@88eb000) could indeed point to the series you mentioned.

> Thats why perhaps this issue was not seen with the chromebooks we tested 
> on as they do not use pmic_glink (aux bridge).
> 
> So we will need to debug this on sc8280xp specifically or an equivalent 
> device which uses aux bridge.

I've hit the NULL-pointer deference three times now in the last few days
on the sc8280xp CRD. But since it doesn't trigger on every boot it seems
you need to go back to the series that could potentially have caused
this regression and review them again. There's clearly something quite
broken here.

> On 2/13/2024 3:42 AM, Johan Hovold wrote:

> > Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
> > not always show up on boot.

> > 	[    6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16
	
> > and this can also manifest itself as a NULL-pointer dereference:
> > 
> > 	[    7.339447] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> > 	
> > 	[    7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]
> > 	[    7.686415] lr : drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
> > 	
> > 	[    7.769039] Call trace:
> > 	[    7.771564]  drm_bridge_attach+0x70/0x1a8 [drm]
> > 	[    7.776234]  drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
> > 	[    7.781782]  drm_bridge_attach+0x80/0x1a8 [drm]
> > 	[    7.786454]  dp_bridge_init+0xa8/0x15c [msm]
> > 	[    7.790856]  msm_dp_modeset_init+0x28/0xc4 [msm]
> > 	[    7.795617]  _dpu_kms_drm_obj_init+0x19c/0x680 [msm]
> > 	[    7.800731]  dpu_kms_hw_init+0x348/0x4c4 [msm]
> > 	[    7.805306]  msm_drm_kms_init+0x84/0x324 [msm]
> > 	[    7.809891]  msm_drm_bind+0x1d8/0x3a8 [msm]
> > 	[    7.814196]  try_to_bring_up_aggregate_device+0x1f0/0x2f8
> > 	[    7.819747]  __component_add+0xa4/0x18c
> > 	[    7.823703]  component_add+0x14/0x20
> > 	[    7.827389]  dp_display_probe+0x47c/0x568 [msm]
> > 	[    7.832052]  platform_probe+0x68/0xd8
> > 
> > Users have also reported random crashes at boot since 6.8-rc1, and I've
> > been able to trigger hard crashes twice when testing an external display
> > (USB-C/DP), which may also be related to the DP regressions.

Johan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: drm/msm: DisplayPort regressions in 6.8-rc1
  2024-02-14 13:52   ` Johan Hovold
@ 2024-02-17 15:14     ` Johan Hovold
  2024-02-19 10:41       ` drm/msm: Second DisplayPort regression " Johan Hovold
  0 siblings, 1 reply; 11+ messages in thread
From: Johan Hovold @ 2024-02-17 15:14 UTC (permalink / raw)
  To: Abhinav Kumar
  Cc: Rob Clark, Dmitry Baryshkov, Kuogee Hsieh, Sean Paul,
	Marijn Suijten, David Airlie, Daniel Vetter, Bjorn Andersson,
	quic_jesszhan, quic_sbillaka, dri-devel, freedreno,
	linux-arm-msm, regressions, linux-kernel

On Wed, Feb 14, 2024 at 02:52:06PM +0100, Johan Hovold wrote:
> On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:
> 
> > I do agree that pm runtime eDP driver got merged that time but I think 
> > the issue is either a combination of that along with DRM aux bridge 
> > https://patchwork.freedesktop.org/series/122584/ OR just the latter as 
> > even that went in around the same time.
> 
> Yes, indeed there was a lot of changes that went into the MSM drm driver
> in 6.8-rc1 and since I have not tried to debug this myself I can't say
> for sure which change or changes that triggered this regression (or
> possibly regressions).
> 
> The fact that the USB-C/DP PHY appears to be involved
> (/soc@0/phy@88eb000) could indeed point to the series you mentioned.
> 
> > Thats why perhaps this issue was not seen with the chromebooks we tested 
> > on as they do not use pmic_glink (aux bridge).
> > 
> > So we will need to debug this on sc8280xp specifically or an equivalent 
> > device which uses aux bridge.
> 
> I've hit the NULL-pointer deference three times now in the last few days
> on the sc8280xp CRD. But since it doesn't trigger on every boot it seems
> you need to go back to the series that could potentially have caused
> this regression and review them again. There's clearly something quite
> broken here.

Since Dmitry had trouble reproducing this issue I took a closer look at
the DRM aux bridge series that Abhinav pointed and was able to track
down the bridge regressions and come up with a reproducer. I just posted
a series fixing this here:

	https://lore.kernel.org/lkml/20240217150228.5788-1-johan+linaro@kernel.org/

As I mentioned in the cover letter, I am still seeing intermittent hard
resets around the time that the DRM subsystem is initialising, which
suggests that we may be dealing with two separate DRM regressions here
however.

If the hard resets are triggered by something like unclocked hardware,
perhaps that bit could this be related to the runtime PM rework?

Johan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: drm/msm: DisplayPort regressions in 6.8-rc1
  2024-02-13 11:42 drm/msm: DisplayPort regressions in 6.8-rc1 Johan Hovold
  2024-02-13 18:00 ` Abhinav Kumar
@ 2024-02-17 15:22 ` Johan Hovold
  2024-02-23  7:49 ` Linux regression tracking #update (Thorsten Leemhuis)
  2 siblings, 0 replies; 11+ messages in thread
From: Johan Hovold @ 2024-02-17 15:22 UTC (permalink / raw)
  To: Rob Clark, Abhinav Kumar, Dmitry Baryshkov, Kuogee Hsieh
  Cc: Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter,
	Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel,
	freedreno, linux-arm-msm, regressions, linux-kernel

On Tue, Feb 13, 2024 at 12:42:17PM +0100, Johan Hovold wrote:

> Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
> not always show up on boot.
> 
> The logs indicate problems with the runtime PM and eDP rework that went
> into 6.8-rc1:
> 
> 	[    6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16
	
> and this can also manifest itself as a NULL-pointer dereference:
> 
> 	[    7.339447] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> 	
> 	[    7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]

#regzbot ^introduced: 2bcca96abfbf

It looks like it may have been possible to hit this also before commit
2bcca96abfbf ("soc: qcom: pmic-glink: switch to DRM_AUX_HPD_BRIDGE") and
the transparent bridge rework in 6.8-rc1 even if that has not yet been
confirmed.

The above is what made this trigger since 6.8-rc1 however.

Johan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* drm/msm: Second DisplayPort regression in 6.8-rc1
  2024-02-17 15:14     ` Johan Hovold
@ 2024-02-19 10:41       ` Johan Hovold
  2024-02-19 13:38         ` Johan Hovold
  2024-02-20 21:19         ` Abhinav Kumar
  0 siblings, 2 replies; 11+ messages in thread
From: Johan Hovold @ 2024-02-19 10:41 UTC (permalink / raw)
  To: Abhinav Kumar, Rob Clark, Dmitry Baryshkov, Kuogee Hsieh
  Cc: Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter,
	Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel,
	freedreno, linux-arm-msm, regressions, linux-kernel

On Sat, Feb 17, 2024 at 04:14:58PM +0100, Johan Hovold wrote:
> On Wed, Feb 14, 2024 at 02:52:06PM +0100, Johan Hovold wrote:
> > On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:

> Since Dmitry had trouble reproducing this issue I took a closer look at
> the DRM aux bridge series that Abhinav pointed and was able to track
> down the bridge regressions and come up with a reproducer. I just posted
> a series fixing this here:
> 
> 	https://lore.kernel.org/lkml/20240217150228.5788-1-johan+linaro@kernel.org/
> 
> As I mentioned in the cover letter, I am still seeing intermittent hard
> resets around the time that the DRM subsystem is initialising, which
> suggests that we may be dealing with two separate DRM regressions here
> however.
> 
> If the hard resets are triggered by something like unclocked hardware,
> perhaps that bit could this be related to the runtime PM rework?

It seems my initial suspicion that at least some of these regressions
were related to the runtime PM work was correct. The hard resets happens
when the DP controller is runtime suspended after being probed:

[   16.748475] bus: 'platform': __driver_probe_device: matched device ae00000.display-subsystem with driver msm-mdss
[   16.759444] msm-mdss ae00000.display-subsystem: Adding to iommu group 21
[   16.795226] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu
[   16.807542] probe of ae01000.display-controller returned -517 after 3 usecs
[   16.821552] bus: 'platform': __driver_probe_device: matched device ae90000.displayport-controller with driver msm-dp-display
[   16.837749] probe of ae90000.displayport-controller returned -517 after 1 usecs
[  OK  ] Listening on Load/Save RF Kill Swit[   16.854659] bus: 'platform': __dch Status /dev/rfkill Watch.
[   16.868458] probe of ae98000.displayport-controller returned -517 after 2 usecs
[   16.880012] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
[   16.891856] probe of aea0000.displayport-controller returned -517 after 2 usecs
[   16.903825] probe of ae00000.display-subsystem returned 0 after 144497 usecs
[   16.911636] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu
[   16.942092] probe of ae01000.display-controller returned 0 after 19593 usecs
         Starting Load/Save Screen Backligh…rightness[   16.959146] bus: 'platform': _ of backlight:backlight...
[   16.995355] msm-dp-display ae90000.displayport-controller: dp_display_probe - probe tail
[   17.004032] probe of ae90000.displayport-controller returned 0 after 30225 usecs
[   17.012308] bus: 'platform': __driver_probe_device: matched device ae98000.displayport-controller with driver msm-dp-display
[   17.050193] msm-dp-display ae98000.displayport-controller: dp_display_probe - probe tail
         Starting Network Name Resolution...
[   17.058925] probe of ae98000.displayport-controller returned 0 after 34774 usecs
[   17.074925] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
[        Starting Network Time Synchronization...
[   17.112000] msm-dp-display aea0000.displayport-controller: dp_display_probe - populate aux bus
[   17.125208] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_resume
         Starting Record System Boot/Shutdown in UTMP...
         Starting Virtual Console Setup...
[  OK  ] Finished Load/Save Screen Backlight Brightness of backlight:backlight.
[   17.197909] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_suspend
[   17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - Optional Info
Log Type: B - Since Boot(Power On Reset),  D - Delta,  S - Statistic
S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
S - IMAGE_VARIANT_STRING=SocMakenaWP
S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92

  < machine is reset by hypervisor >

Presumably the reset happens when controller is being shut down while
still being used by the EFI framebuffer.

In the cases where the machines survives boot, the controller is never
suspended.

When investigating this I've also seen intermittent:

	[drm:dp_display_probe [msm]] *ERROR* device tree parsing failed

which also appears to be related to the runtime PM rework:

	https://lore.kernel.org/lkml/1701472789-25951-1-git-send-email-quic_khsieh@quicinc.com/

I believe this is enough evidence to conclude that this second
regression is introduced by commit 5814b8bf086a ("drm/msm/dp:
incorporate pm_runtime framework into DP driver"):

#regzbot introduced: 5814b8bf086a

Has anyone given some thought to how the framebuffer handover is
supposed to work? It seems we're currently just relying on luck with
timing.

Johan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: drm/msm: Second DisplayPort regression in 6.8-rc1
  2024-02-19 10:41       ` drm/msm: Second DisplayPort regression " Johan Hovold
@ 2024-02-19 13:38         ` Johan Hovold
  2024-02-20 21:19         ` Abhinav Kumar
  1 sibling, 0 replies; 11+ messages in thread
From: Johan Hovold @ 2024-02-19 13:38 UTC (permalink / raw)
  To: Abhinav Kumar, Rob Clark, Dmitry Baryshkov, Kuogee Hsieh
  Cc: Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter,
	Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel,
	freedreno, linux-arm-msm, regressions, linux-kernel

On Mon, Feb 19, 2024 at 11:41:41AM +0100, Johan Hovold wrote:

> It seems my initial suspicion that at least some of these regressions
> were related to the runtime PM work was correct. The hard resets happens
> when the DP controller is runtime suspended after being probed:
 
> [   17.074925] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
> [   17.112000] msm-dp-display aea0000.displayport-controller: dp_display_probe - populate aux bus
> [   17.125208] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_resume
> [   17.197909] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_suspend
> [   17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - Optional Info
> Log Type: B - Since Boot(Power On Reset),  D - Delta,  S - Statistic
> S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
> S - IMAGE_VARIANT_STRING=SocMakenaWP
> S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92
> 
>   < machine is reset by hypervisor >
> 
> Presumably the reset happens when controller is being shut down while
> still being used by the EFI framebuffer.
> 
> In the cases where the machines survives boot, the controller is never
> suspended.
> 
> When investigating this I've also seen intermittent:
> 
> 	[drm:dp_display_probe [msm]] *ERROR* device tree parsing failed

Note that there are further indications there may be more than one bug
here too.

I definitely see hard resets when dp_pm_runtime_suspend() is shutting
down the eDP PHY, but there are occasional resets also if I instrument
DP controller probe() to resume and then prevent the controller from
suspending until after a timeout (e.g. to be used as a temporary
workaround):

[   15.676495] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
[   15.769392] msm-dp-display aea0000.displayport-controller: dp_display_probe - populate aux bus
[   15.778808] msm-dp-display aea0000.displayport-controller: dp_display_probe - scheduling handover
[   15.789931] probe of aea0000.displayport-controller returned 0 after 91121 usecs
[   15.790460] bus: 'dp-aux': __driver_probe_device: matched device aux-aea0000.displayport-controller with driver panel-simple-dp-aux
Format: Log Type - Time(microsec) - Message - Optional Info
Log Type: B - Since Boot(Power On Reset),  D - Delta,  S - Statistic
S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1

I'll wait for the maintainers and authors of this code to comment, but
it seems the runtime PM work is broken in multiple ways.

Johan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: drm/msm: Second DisplayPort regression in 6.8-rc1
  2024-02-19 10:41       ` drm/msm: Second DisplayPort regression " Johan Hovold
  2024-02-19 13:38         ` Johan Hovold
@ 2024-02-20 21:19         ` Abhinav Kumar
  2024-02-21  8:05           ` Johan Hovold
  1 sibling, 1 reply; 11+ messages in thread
From: Abhinav Kumar @ 2024-02-20 21:19 UTC (permalink / raw)
  To: Johan Hovold, Rob Clark, Dmitry Baryshkov, Kuogee Hsieh
  Cc: Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter,
	Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel,
	freedreno, linux-arm-msm, regressions, linux-kernel

Hi Johan

On 2/19/2024 2:41 AM, Johan Hovold wrote:
> On Sat, Feb 17, 2024 at 04:14:58PM +0100, Johan Hovold wrote:
>> On Wed, Feb 14, 2024 at 02:52:06PM +0100, Johan Hovold wrote:
>>> On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:
> 
>> Since Dmitry had trouble reproducing this issue I took a closer look at
>> the DRM aux bridge series that Abhinav pointed and was able to track
>> down the bridge regressions and come up with a reproducer. I just posted
>> a series fixing this here:
>>
>> 	https://lore.kernel.org/lkml/20240217150228.5788-1-johan+linaro@kernel.org/
>>
>> As I mentioned in the cover letter, I am still seeing intermittent hard
>> resets around the time that the DRM subsystem is initialising, which
>> suggests that we may be dealing with two separate DRM regressions here
>> however.
>>
>> If the hard resets are triggered by something like unclocked hardware,
>> perhaps that bit could this be related to the runtime PM rework?
> 
> It seems my initial suspicion that at least some of these regressions
> were related to the runtime PM work was correct. The hard resets happens
> when the DP controller is runtime suspended after being probed:
> 
> [   16.748475] bus: 'platform': __driver_probe_device: matched device ae00000.display-subsystem with driver msm-mdss
> [   16.759444] msm-mdss ae00000.display-subsystem: Adding to iommu group 21
> [   16.795226] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu
> [   16.807542] probe of ae01000.display-controller returned -517 after 3 usecs
> [   16.821552] bus: 'platform': __driver_probe_device: matched device ae90000.displayport-controller with driver msm-dp-display
> [   16.837749] probe of ae90000.displayport-controller returned -517 after 1 usecs
> [  OK  ] Listening on Load/Save RF Kill Swit[   16.854659] bus: 'platform': __dch Status /dev/rfkill Watch.
> [   16.868458] probe of ae98000.displayport-controller returned -517 after 2 usecs
> [   16.880012] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
> [   16.891856] probe of aea0000.displayport-controller returned -517 after 2 usecs
> [   16.903825] probe of ae00000.display-subsystem returned 0 after 144497 usecs
> [   16.911636] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu
> [   16.942092] probe of ae01000.display-controller returned 0 after 19593 usecs
>           Starting Load/Save Screen Backligh…rightness[   16.959146] bus: 'platform': _ of backlight:backlight...
> [   16.995355] msm-dp-display ae90000.displayport-controller: dp_display_probe - probe tail
> [   17.004032] probe of ae90000.displayport-controller returned 0 after 30225 usecs
> [   17.012308] bus: 'platform': __driver_probe_device: matched device ae98000.displayport-controller with driver msm-dp-display
> [   17.050193] msm-dp-display ae98000.displayport-controller: dp_display_probe - probe tail
>           Starting Network Name Resolution...
> [   17.058925] probe of ae98000.displayport-controller returned 0 after 34774 usecs
> [   17.074925] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
> [        Starting Network Time Synchronization...
> [   17.112000] msm-dp-display aea0000.displayport-controller: dp_display_probe - populate aux bus
> [   17.125208] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_resume
>           Starting Record System Boot/Shutdown in UTMP...
>           Starting Virtual Console Setup...
> [  OK  ] Finished Load/Save Screen Backlight Brightness of backlight:backlight.
> [   17.197909] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_suspend
> [   17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - Optional Info
> Log Type: B - Since Boot(Power On Reset),  D - Delta,  S - Statistic
> S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
> S - IMAGE_VARIANT_STRING=SocMakenaWP
> S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92
> 
>    < machine is reset by hypervisor >
> 
> Presumably the reset happens when controller is being shut down while
> still being used by the EFI framebuffer.
> 

I am not sure if we can conclude like that. Even if we shut off the 
controller when the framebuffer was still being fetched that should only 
cause a blank screen and not a reset because we really don't trigger a 
new register write / read while its fetching so as such there is no new 
hardware access.

One thing I must accept is that there are two differences between 
sc8280xp where we are hitting these resets and sc7180/sc7280 chromebooks 
where we tested it more thoroughly without any such issues:

1) with the chromebooks we have depthcharge and not the QC UEFI.

If we are suspecting a hand-off issue here, will it help if we try to 
disable the display in EFI by using "fastboot oem select-display-panel 
none" (assuming this is a fastboot enabled device) and see if you still 
hit the reset issue?

2) chromebooks used "internal_hpd" whereas the pmic_glink method used in 
the sc8280xp.

I am still checking if there are any code paths in the eDP/DP driver 
left exposed due to this difference with pm_runtime which can cause 
this. I am wondering if some sort of drm tracing will help to narrow 
down the reset point.

> In the cases where the machines survives boot, the controller is never
> suspended.
> 
> When investigating this I've also seen intermittent:
> 
> 	[drm:dp_display_probe [msm]] *ERROR* device tree parsing failed
> 

So this error I think is because in dp_parser_parse() ---> 
dp_parser_ctrl_res(), we also have a devm_phy_get().

This can return -EDEFER if the phy driver has not yet probed.

I checked the other things inside dp_parser_parse(), others calls seem 
to be purely DT parsing except this one. I think to avoid the confusion, 
we should move devm_phy_get() outside of DT parsing into a separate call 
or atleast add an error log inside devm_phy_get() failure below to 
indicate that it deferred

         io->phy = devm_phy_get(&pdev->dev, "dp");
         if (IS_ERR(io->phy))
                 return PTR_ERR(io->phy);

If my hypothesis is correct on this, then this error log (even though 
misleading) should be harmless for this issue because if we hit 
DRM_ERROR("device tree parsing failed\n"); we will skip the 
devm_pm_runtime_enable().

> which also appears to be related to the runtime PM rework:
> 
> 	https://lore.kernel.org/lkml/1701472789-25951-1-git-send-email-quic_khsieh@quicinc.com/
> 
> I believe this is enough evidence to conclude that this second
> regression is introduced by commit 5814b8bf086a ("drm/msm/dp:
> incorporate pm_runtime framework into DP driver"):
> 
> #regzbot introduced: 5814b8bf086a
> 
> Has anyone given some thought to how the framebuffer handover is
> supposed to work? It seems we're currently just relying on luck with
> timing.
> 


> Johan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: drm/msm: Second DisplayPort regression in 6.8-rc1
  2024-02-20 21:19         ` Abhinav Kumar
@ 2024-02-21  8:05           ` Johan Hovold
  0 siblings, 0 replies; 11+ messages in thread
From: Johan Hovold @ 2024-02-21  8:05 UTC (permalink / raw)
  To: Abhinav Kumar
  Cc: Rob Clark, Dmitry Baryshkov, Kuogee Hsieh, Sean Paul,
	Marijn Suijten, David Airlie, Daniel Vetter, Bjorn Andersson,
	quic_jesszhan, quic_sbillaka, dri-devel, freedreno,
	linux-arm-msm, regressions, linux-kernel

On Tue, Feb 20, 2024 at 01:19:54PM -0800, Abhinav Kumar wrote:
> On 2/19/2024 2:41 AM, Johan Hovold wrote:

> > It seems my initial suspicion that at least some of these regressions
> > were related to the runtime PM work was correct. The hard resets happens
> > when the DP controller is runtime suspended after being probed:

> > [   17.074925] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
> > [        Starting Network Time Synchronization...
> > [   17.112000] msm-dp-display aea0000.displayport-controller: dp_display_probe - populate aux bus
> > [   17.125208] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_resume
> >           Starting Record System Boot/Shutdown in UTMP...
> >           Starting Virtual Console Setup...
> > [  OK  ] Finished Load/Save Screen Backlight Brightness of backlight:backlight.
> > [   17.197909] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_suspend
> > [   17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - Optional Info
> > Log Type: B - Since Boot(Power On Reset),  D - Delta,  S - Statistic
> > S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
> > S - IMAGE_VARIANT_STRING=SocMakenaWP
> > S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92
> > 
> >    < machine is reset by hypervisor >
> > 
> > Presumably the reset happens when controller is being shut down while
> > still being used by the EFI framebuffer.
> 
> I am not sure if we can conclude like that. Even if we shut off the 
> controller when the framebuffer was still being fetched that should only 
> cause a blank screen and not a reset because we really don't trigger a 
> new register write / read while its fetching so as such there is no new 
> hardware access.

It specifically looks like the reset happens when shutting down the PHY,
that is, the call to dp_display_host_phy_exit(dp) in
dp_pm_runtime_suspend() never returns.

That seems like more than a coincidence to me.
 
> One thing I must accept is that there are two differences between 
> sc8280xp where we are hitting these resets and sc7180/sc7280 chromebooks 
> where we tested it more thoroughly without any such issues:
> 
> 1) with the chromebooks we have depthcharge and not the QC UEFI.
> 
> If we are suspecting a hand-off issue here, will it help if we try to 
> disable the display in EFI by using "fastboot oem select-display-panel 
> none" (assuming this is a fastboot enabled device) and see if you still 
> hit the reset issue?

No, we don't have fastboot.

But as I mentioned I still do see resets when I instrument the code to
not shut down the display, which could indicate more than one issue
here.

> 2) chromebooks used "internal_hpd" whereas the pmic_glink method used in 
> the sc8280xp.
> 
> I am still checking if there are any code paths in the eDP/DP driver 
> left exposed due to this difference with pm_runtime which can cause 
> this. I am wondering if some sort of drm tracing will help to narrow 
> down the reset point.
> 
> > In the cases where the machines survives boot, the controller is never
> > suspended.
> > 
> > When investigating this I've also seen intermittent:
> > 
> > 	[drm:dp_display_probe [msm]] *ERROR* device tree parsing failed
> 
> So this error I think is because in dp_parser_parse() ---> 
> dp_parser_ctrl_res(), we also have a devm_phy_get().
> 
> This can return -EDEFER if the phy driver has not yet probed.
> 
> I checked the other things inside dp_parser_parse(), others calls seem 
> to be purely DT parsing except this one. I think to avoid the confusion, 
> we should move devm_phy_get() outside of DT parsing into a separate call 
> or atleast add an error log inside devm_phy_get() failure below to 
> indicate that it deferred
> 
>          io->phy = devm_phy_get(&pdev->dev, "dp");
>          if (IS_ERR(io->phy))
>                  return PTR_ERR(io->phy);
> 
> If my hypothesis is correct on this, then this error log (even though 
> misleading) should be harmless for this issue because if we hit 
> DRM_ERROR("device tree parsing failed\n"); we will skip the 
> devm_pm_runtime_enable().

Yeah, this seems to be the case as boot appears to recover from this, so
this may indeed be a probe deferral.

Probe deferrals should not be logged as errors however, so the fix is
not to add another error message but rather to suppress the current one
(e.g. using dev_err_probe()).

> > Has anyone given some thought to how the framebuffer handover is
> > supposed to work? It seems we're currently just relying on luck with
> > timing.

Any comments to this? It seems we should not be shutting down (runtime
suspend) the display during boot as can currently happen.

Johan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: drm/msm: DisplayPort regressions in 6.8-rc1
  2024-02-13 11:42 drm/msm: DisplayPort regressions in 6.8-rc1 Johan Hovold
  2024-02-13 18:00 ` Abhinav Kumar
  2024-02-17 15:22 ` drm/msm: DisplayPort regressions " Johan Hovold
@ 2024-02-23  7:49 ` Linux regression tracking #update (Thorsten Leemhuis)
  2 siblings, 0 replies; 11+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2024-02-23  7:49 UTC (permalink / raw)
  To: regressions

On 13.02.24 12:42, Johan Hovold wrote:
> 
> Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
> not always show up on boot.
>

Having two problems discussed in one thread is confusing and not
supported by rezbot (not totally sure, but I guess it likely
shouldn't!). So tell regzbot to handle this like the initial regression;
I'll send a second mail to ensure the other one it tracked, too.

#regzbot introduced: 2bcca96abfbf
#regzbot title: drm as well as soc: qcom: internal eDP display on the
Lenovo ThinkPad X13s does not always show up on boot
#regzbot monitor:
https://lore.kernel.org/all/ZctVmLK4zTwcpW3A@hovoldconsulting.com/
#regzbot from: Johan Hovold <johan+linaro@kernel.org>
#regzbot fix: soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free
#regzbot monitor:
https://lore.kernel.org/lkml/20240217150228.5788-1-johan%2Blinaro@kernel.org/
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-02-23  7:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-13 11:42 drm/msm: DisplayPort regressions in 6.8-rc1 Johan Hovold
2024-02-13 18:00 ` Abhinav Kumar
2024-02-14 13:18   ` Linux regression tracking (Thorsten Leemhuis)
2024-02-14 13:52   ` Johan Hovold
2024-02-17 15:14     ` Johan Hovold
2024-02-19 10:41       ` drm/msm: Second DisplayPort regression " Johan Hovold
2024-02-19 13:38         ` Johan Hovold
2024-02-20 21:19         ` Abhinav Kumar
2024-02-21  8:05           ` Johan Hovold
2024-02-17 15:22 ` drm/msm: DisplayPort regressions " Johan Hovold
2024-02-23  7:49 ` Linux regression tracking #update (Thorsten Leemhuis)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.