Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."
       [not found] <CAHk-=wjs4AjAKJ26W69xcMB7snFc+0u+rbgA+Tj0S1GvwY2T3Q@mail.gmail.com>
@ 2021-12-20 19:25 ` Daniel Vetter
       [not found] ` <20211220213254.GA7250@ideak-desk.fi.intel.com>
  1 sibling, 0 replies; 10+ messages in thread
From: Daniel Vetter @ 2021-12-20 19:25 UTC (permalink / raw)
  To: Linus Torvalds, amd-gfx list, Wentland, Harry,
	Christian König, airlied
  Cc: Alex Deucher, Imre Deak, Kai-Heng Feng

Adding more amdgpu folks.

Smells like this runtime pm leak is papering over an issue in the
amdgpu driver. Other bug reports are also only about amdgpu.ko it
seems, would an unconditional pm_runtime_get_sync() in
amdgpu_pci_probe() also work? If you do it before the
drm_aperture_remove_conflicting_pci_framebuffers() which throws out
efifb it should be equivalent to the efifb revert.

That would avoid the revert for everyone else, but revert's fine too
imo this close to holidays. I don't think actually fixing this rpm fun
in a late -rc is realistic, these tend to be very gnarly.
-Daniel

On Mon, Dec 20, 2021 at 6:38 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So my lovely eldest daughter came back from NYC for xmas, and two days
> later was informed that she had been in contact with somebody who
> tested positive. And sure enough, so did she.
>
> I'm all vaxxed up and boostered, and feeling fine with just a slightly
> sore throat and no other symptoms, but I did test positive too a few
> days after that. And as a result I'm distancing in my office. It turns
> out that is not so hugely different from my normal life.
>
> One difference *is* that I now have an inflatable mattress and am
> sleeping here by my desktop, and as a result last night I noticed that
> my monitors didn't go to sleep. I had to turn them off manually to get
> the backlight to go away.
>
> So this morning I start to search for the reason, and find this listed
> by the regression bot, and sure enough, reverting commit 55285e21f045
> ("fbdev/efifb: Release PCI device's runtime PM ref during FB destroy")
> makes it all work again.
>
> I've reverted it in my private testing tree, in the hope that somebody
> figures out the actual real cause. But if that doesn't happen, I
> expect to revert it in my main git tree too before rc7.
>
> So this is mainly just a heads-up. I'm not seeing anything interesting
> in the kernel logs, apart from the usual unending stream of
>
>   [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
>   [drm] UVD and UVD ENC initialized successfully.
>   [drm] VCE initialized successfully.
>   [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
>   [drm] UVD and UVD ENC initialized successfully.
>   [drm] VCE initialized successfully.
>
> that happens randomly (sometimes every minute, sometimes hours apart)
> with or without that commit.
>
> This is some random Radeon card with two monitors attached (PCI ID
> 1002:67df rev e7, subsystem ID 1da2:e353).
>
> The symptoms are exactly as reported elsewhere:
>
>       https://lore.kernel.org/linux-fbdev/8a27c986-4767-bd29-2073-6c4ffed49bba@jetfuse.net/
>       https://linux-regtracking.leemhuis.info/regzbot/regression/8a27c986-4767-bd29-2073-6c4ffed49bba@jetfuse.net/
>       https://bugzilla.kernel.org/show_bug.cgi?id=215203
>
> ie I can lock my desktop, wait a few seconds, see the "No signal,
> enabling power management" on my monitors, but when that actually
> happens, the desktop lockscreen just comes right back.
>
> I suspect/assume that the screen off event triggers some status event
> on the GPU side, and now some pm logic then just wakes things right up
> again.
>
>                Linus



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."
       [not found]   ` <CAHk-=winh9=DS2ZJZbgwTFS3r3oWfrZcM9MedQ4dKzsGW8QaTA@mail.gmail.com>
@ 2021-12-20 22:21     ` Linus Torvalds
  2021-12-21  7:51       ` Christian König
  2021-12-21 17:03       ` Deucher, Alexander
  2021-12-21 17:01     ` Deucher, Alexander
  1 sibling, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2021-12-20 22:21 UTC (permalink / raw)
  To: Imre Deak, Christian König, Pan, Xinhui, Wentland, Harry
  Cc: Alex Deucher, Daniel Vetter, Kai-Heng Feng, amd-gfx list

[ Adding back in more amd people and the amd list, the people Daniel
added seem to have gotten lost again, but I think people at least saw
my original report thanks to Daniel ]

With "amdgpu.runpm=0", things are better, but not perfect. With that I
can lock the screen, and it has to go through *two* cycles of "No
signal, turning off", but on the second cycle it does finally work.

This was exposed by commit 55285e21f045 ("fbdev/efifb: Release PCI
device's runtime PM ref during FB destroy"), probably because that
made runtime PM actually potentially work, but it is then broken on
amdgpu.

Absolutely nothing odd in my setup. Two monitors, one GPU. PCI ID
1002:67df rev e7, subsystem ID 1da2:e353.

I'd expect pretty much any amdgpu person to see this.

On Mon, Dec 20, 2021 at 2:04 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Note: on my machine, I get that
>
>    amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
>
> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> and it's only that BACO case that is broken.

Hmm. The *documentation* says:

    PX runtime pm
        2 = force enable with BAMACO,
        1 = force enable with BACO,
        0 = disable,
        -1 = PX only default

but the code actually makes anything != 0 enable it, except on VEGA20
and ARCTURUS, where it needs to be positive.

My card is apparently "POLARIS10", whatever that means, which means
that any non-zero value of amdgpu_runtime_pm will enable runtime PM as
long as "amdgpu_device_supports_baco()" is true. Which it is.

Whatever. Now I'm just kwetching about the documentation not matching
what I see the code doing, which is never a great sign when things
don't work.

              Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."
  2021-12-20 22:21     ` Linus Torvalds
@ 2021-12-21  7:51       ` Christian König
  2021-12-21 17:03       ` Deucher, Alexander
  1 sibling, 0 replies; 10+ messages in thread
From: Christian König @ 2021-12-21  7:51 UTC (permalink / raw)
  To: Linus Torvalds, Imre Deak, Wentland, Harry
  Cc: Alex Deucher, Daniel Vetter, Kai-Heng Feng, Pan, Xinhui, amd-gfx list

Good morning guys,

first of all get better soon Linus.

I'm unfortunately not the best expert for runtime power management 
(Alex) nor display (Harry), but from the lack of their response I guess 
that they are already on vacation. So maybe take everything I explain 
here with a grain of salt.

Then for the background we have two separate power management features 
here which doesn't seem to work as they should.

The first buggy one is runtime power management, which is what commit 
55285e21f045 surfaces. My educated guess is that the now corrected 
reference counting turns of the GPU before userspace has a chance to 
send a signal to the monitor to turn of it's backlight. Double checking 
the code I can see the correct calls to pm_runtime_get_*() and 
pm_runtime_put_*() in amdgpu_dm_atomic_commit_tail(), but to be honest 
that function seems to be quite a mess.

A trace of what exactly happens during PM autosuspend might help here. 
Daniel do you know any tracepoint for that?

Then we have DPMS, which is basically the way of telling the monitor to 
shut of it's backlight. When this doesn't work as expected (e.g. you 
need *two* cycles) then it can as well be that userspace is not sending 
the right command.

When you use X you could double check with "xset dpms force off" and 
"xset dpms force suspend". At least with my monitor it turns of the 
backlight in both cases, but maybe your hardware behaves differently.

Regards,
Christian.

Am 20.12.21 um 23:21 schrieb Linus Torvalds:
> [ Adding back in more amd people and the amd list, the people Daniel
> added seem to have gotten lost again, but I think people at least saw
> my original report thanks to Daniel ]
>
> With "amdgpu.runpm=0", things are better, but not perfect. With that I
> can lock the screen, and it has to go through *two* cycles of "No
> signal, turning off", but on the second cycle it does finally work.
>
> This was exposed by commit 55285e21f045 ("fbdev/efifb: Release PCI
> device's runtime PM ref during FB destroy"), probably because that
> made runtime PM actually potentially work, but it is then broken on
> amdgpu.
>
> Absolutely nothing odd in my setup. Two monitors, one GPU. PCI ID
> 1002:67df rev e7, subsystem ID 1da2:e353.
>
> I'd expect pretty much any amdgpu person to see this.
>
> On Mon, Dec 20, 2021 at 2:04 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> Note: on my machine, I get that
>>
>>     amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
>>
>> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
>> and it's only that BACO case that is broken.
> Hmm. The *documentation* says:
>
>      PX runtime pm
>          2 = force enable with BAMACO,
>          1 = force enable with BACO,
>          0 = disable,
>          -1 = PX only default
>
> but the code actually makes anything != 0 enable it, except on VEGA20
> and ARCTURUS, where it needs to be positive.
>
> My card is apparently "POLARIS10", whatever that means, which means
> that any non-zero value of amdgpu_runtime_pm will enable runtime PM as
> long as "amdgpu_device_supports_baco()" is true. Which it is.
>
> Whatever. Now I'm just kwetching about the documentation not matching
> what I see the code doing, which is never a great sign when things
> don't work.
>
>                Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."
       [not found]   ` <CAHk-=winh9=DS2ZJZbgwTFS3r3oWfrZcM9MedQ4dKzsGW8QaTA@mail.gmail.com>
  2021-12-20 22:21     ` Linus Torvalds
@ 2021-12-21 17:01     ` Deucher, Alexander
  2021-12-21 18:47       ` Deucher, Alexander
  1 sibling, 1 reply; 10+ messages in thread
From: Deucher, Alexander @ 2021-12-21 17:01 UTC (permalink / raw)
  To: Linus Torvalds, Imre Deak, amd-gfx; +Cc: Daniel Vetter, Kai-Heng Feng

[Public]

> -----Original Message-----
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Monday, December 20, 2021 5:05 PM
> To: Imre Deak <imre.deak@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Kai-Heng Feng
> <kai.heng.feng@canonical.com>
> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> PCI device ..."
> 
> On Mon, Dec 20, 2021 at 1:33 PM Imre Deak <imre.deak@intel.com> wrote:
> >
> > amdgpu.runpm=0
> 
> Hmmm.
> 
> This does seem to "work", but not very well.
> 
> With this, what seems to happen is odd: I lock the screen, wait, it goes "No
> signal, shutting down", but then doesn't actually shut down but stays black
> (with the backlight on). After _another_ five seconds or so, the monitor goes
> "No signal, shutting down" _again_, and at that point it actually does it.
> 
> So it solves my immediate problem - in that yes, the backlight finally does
> turn off in the end - but it does seem to be still broken.
> 
> I'm very surprised if no AMD drm developers can see this exact same thing.
> This is a very simple setup. The only possibly slightly less common thing is that
> I have two monitors, but while that is not necessarily the _most_ common
> setup in an absolute sense, I'd expect it to be very common among DRM
> developers..
> 
> I guess I can just change the revert to just a
> 
>     -int amdgpu_runtime_pm = -1;
>     +int amdgpu_runtime_pm = 0;
> 
> instead. The auto-detect is apparently broken. Maybe it should only kick in
> for LVDS screens on actual laptops?
> 
> Note: on my machine, I get that
> 
>    amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
> 
> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> and it's only that BACO case that is broken.
> 
> I have no idea what any of those three things are - I'm just looking at the
> uses of that amdgpu_runtime_pm variable.
> 
> amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> default, tell me something else to try.

For a little background, runtime PM support was added about 10 year ago originally to support laptops with multiple GPUs (integrated and discrete).  It's not specific to the display hardware.  When the GPU is idle, it can be powered down completely.  In the case of these laptops, it's D3 cold (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off) which powers off the dGPU completely (i.e., it disappears from the bus).  A few years ago we extended this to support desktop dGPUs as well which support their own version of runtime D3 (called BACO in AMD parlance - Bus Active, Chip Off).  The driver can put the chip into a low power state where everything except the bus interface is powered down (to avoid the device disappearing from the bus).  So this has worked for almost 2 years now on BACO capable parts and for a decade or more on BOCO systems.  Unfortunately, changing the default runpm parameter setting would cause a flood of bug reports about runtime power management breaking and suddenly systems are using more power.

Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).  Runtime pm was working on amdgpu prior to that commit.  Is it possible there is still some race between when amdgpu takes over from efifb?  Does it work properly when all pm_runtime calls in efifb are removed or if efifb is not enabled?  Runtime pm for Polaris boards has been enabled by default since 4fdda2e66de0b which predates both of those patches.

Alex

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."
  2021-12-20 22:21     ` Linus Torvalds
  2021-12-21  7:51       ` Christian König
@ 2021-12-21 17:03       ` Deucher, Alexander
  1 sibling, 0 replies; 10+ messages in thread
From: Deucher, Alexander @ 2021-12-21 17:03 UTC (permalink / raw)
  To: Linus Torvalds, Imre Deak, Koenig, Christian, Pan, Xinhui,
	Wentland, Harry
  Cc: Daniel Vetter, Kai-Heng Feng, amd-gfx list

[Public]

> -----Original Message-----
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Monday, December 20, 2021 5:21 PM
> To: Imre Deak <imre.deak@intel.com>; Koenig, Christian
> <Christian.Koenig@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>;
> Wentland, Harry <Harry.Wentland@amd.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Kai-Heng Feng
> <kai.heng.feng@canonical.com>; amd-gfx list <amd-
> gfx@lists.freedesktop.org>
> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> PCI device ..."
> 
> [ Adding back in more amd people and the amd list, the people Daniel added
> seem to have gotten lost again, but I think people at least saw my original
> report thanks to Daniel ]
> 
> With "amdgpu.runpm=0", things are better, but not perfect. With that I can
> lock the screen, and it has to go through *two* cycles of "No signal, turning
> off", but on the second cycle it does finally work.
> 
> This was exposed by commit 55285e21f045 ("fbdev/efifb: Release PCI
> device's runtime PM ref during FB destroy"), probably because that made
> runtime PM actually potentially work, but it is then broken on amdgpu.
> 
> Absolutely nothing odd in my setup. Two monitors, one GPU. PCI ID
> 1002:67df rev e7, subsystem ID 1da2:e353.
> 
> I'd expect pretty much any amdgpu person to see this.
> 
> On Mon, Dec 20, 2021 at 2:04 PM Linus Torvalds <torvalds@linux-
> foundation.org> wrote:
> >
> > Note: on my machine, I get that
> >
> >    amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
> >
> > so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> > and it's only that BACO case that is broken.
> 
> Hmm. The *documentation* says:
> 
>     PX runtime pm
>         2 = force enable with BAMACO,
>         1 = force enable with BACO,
>         0 = disable,
>         -1 = PX only default
> 
> but the code actually makes anything != 0 enable it, except on VEGA20 and
> ARCTURUS, where it needs to be positive.
> 
> My card is apparently "POLARIS10", whatever that means, which means that
> any non-zero value of amdgpu_runtime_pm will enable runtime PM as long
> as "amdgpu_device_supports_baco()" is true. Which it is.
> 
> Whatever. Now I'm just kwetching about the documentation not matching
> what I see the code doing, which is never a great sign when things don't
> work.

Apologies on the documentation.  -1 is the default and is enabled for all dGPUs which support runtime D3.  It was never fixed up when we extended support for runtime pm beyond PX/HG laptops.  Fixed up the documentation here:
https://patchwork.freedesktop.org/patch/467681/

Alex

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."
  2021-12-21 17:01     ` Deucher, Alexander
@ 2021-12-21 18:47       ` Deucher, Alexander
  2021-12-21 21:18         ` Alex Deucher
  2021-12-23  4:07         ` Alex Deucher
  0 siblings, 2 replies; 10+ messages in thread
From: Deucher, Alexander @ 2021-12-21 18:47 UTC (permalink / raw)
  To: Linus Torvalds, Imre Deak, amd-gfx, Lyude Paul
  Cc: Daniel Vetter, Kai-Heng Feng

[Public]

> -----Original Message-----
> From: Deucher, Alexander
> Sent: Tuesday, December 21, 2021 12:01 PM
> To: Linus Torvalds <torvalds@linux-foundation.org>; Imre Deak
> <imre.deak@intel.com>; amd-gfx@lists.freedesktop.org
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>; Kai-Heng Feng
> <kai.heng.feng@canonical.com>
> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> PCI device ..."
> 
> [Public]
> 
> > -----Original Message-----
> > From: Linus Torvalds <torvalds@linux-foundation.org>
> > Sent: Monday, December 20, 2021 5:05 PM
> > To: Imre Deak <imre.deak@intel.com>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>; Deucher, Alexander
> > <Alexander.Deucher@amd.com>; Kai-Heng Feng
> > <kai.heng.feng@canonical.com>
> > Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> > Release PCI device ..."
> >
> > On Mon, Dec 20, 2021 at 1:33 PM Imre Deak <imre.deak@intel.com>
> wrote:
> > >
> > > amdgpu.runpm=0
> >
> > Hmmm.
> >
> > This does seem to "work", but not very well.
> >
> > With this, what seems to happen is odd: I lock the screen, wait, it
> > goes "No signal, shutting down", but then doesn't actually shut down
> > but stays black (with the backlight on). After _another_ five seconds
> > or so, the monitor goes "No signal, shutting down" _again_, and at that
> point it actually does it.
> >
> > So it solves my immediate problem - in that yes, the backlight finally
> > does turn off in the end - but it does seem to be still broken.
> >
> > I'm very surprised if no AMD drm developers can see this exact same thing.
> > This is a very simple setup. The only possibly slightly less common
> > thing is that I have two monitors, but while that is not necessarily
> > the _most_ common setup in an absolute sense, I'd expect it to be very
> > common among DRM developers..
> >
> > I guess I can just change the revert to just a
> >
> >     -int amdgpu_runtime_pm = -1;
> >     +int amdgpu_runtime_pm = 0;
> >
> > instead. The auto-detect is apparently broken. Maybe it should only
> > kick in for LVDS screens on actual laptops?
> >
> > Note: on my machine, I get that
> >
> >    amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
> >
> > so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> > and it's only that BACO case that is broken.
> >
> > I have no idea what any of those three things are - I'm just looking
> > at the uses of that amdgpu_runtime_pm variable.
> >
> > amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> > default, tell me something else to try.
> 
> For a little background, runtime PM support was added about 10 year ago
> originally to support laptops with multiple GPUs (integrated and discrete).
> It's not specific to the display hardware.  When the GPU is idle, it can be
> powered down completely.  In the case of these laptops, it's D3 cold
> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> which powers off the dGPU completely (i.e., it disappears from the bus).  A
> few years ago we extended this to support desktop dGPUs as well which
> support their own version of runtime D3 (called BACO in AMD parlance - Bus
> Active, Chip Off).  The driver can put the chip into a low power state where
> everything except the bus interface is powered down (to avoid the device
> disappearing from the bus).  So this has worked for almost 2 years now on
> BACO capable parts and for a decade or more on BOCO systems.
> Unfortunately, changing the default runpm parameter setting would cause a
> flood of bug reports about runtime power management breaking and
> suddenly systems are using more power.
> 
> Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> Runtime pm was working on amdgpu prior to that commit.  Is it possible
> there is still some race between when amdgpu takes over from efifb?  Does
> it work properly when all pm_runtime calls in efifb are removed or if efifb is
> not enabled?  Runtime pm for Polaris boards has been enabled by default
> since 4fdda2e66de0b which predates both of those patches.

Thinking about this more, I wonder if there was some change in some userspace component which was hidden by the changes in 55285e21f045 and a6c0fd3d5a8b.  E.g., some desktop component started polling for display changes or GPU temperature or something like that and when a6c0fd3d5a8b was in place the GPU never entered runtime suspend.  Then when 55285e21f045 was applied, it unmasked the new behavior in the userpace component.

What should happen is that when all of the displays blank, assuming the GPU is otherwise idle, the GPU will runtime suspend after  seconds.  When you move the mouse or hit the keyboard, that should trigger the GPU should runtime resume and then the displays will be re-enabled.

In the behavior you are seeing, when the displays come back on after they blank are you seeing the device resume from runtime suspend?  On resume from suspend (runtime or system) we issue a hotplug notification to userspace in case any displays changed during suspend when the GPU was powered down (and hence could not detect a hotplug event).  Perhaps that event is triggering userspace to reprobe and re-enable the displays shortly after resume from runtime suspend due to some other event that caused the device to runtime resume.  Does something like this help by any chance?
https://bugzilla.kernel.org/attachment.cgi?id=300103&action=diff&collapsed=&headers=1&format=raw

Alex

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."
  2021-12-21 18:47       ` Deucher, Alexander
@ 2021-12-21 21:18         ` Alex Deucher
  2021-12-21 22:09           ` Harry Wentland
  2021-12-23  4:07         ` Alex Deucher
  1 sibling, 1 reply; 10+ messages in thread
From: Alex Deucher @ 2021-12-21 21:18 UTC (permalink / raw)
  To: Deucher, Alexander, Wentland, Harry
  Cc: Daniel Vetter, Imre Deak, amd-gfx, Kai-Heng Feng, Linus Torvalds

On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
<Alexander.Deucher@amd.com> wrote:
>
> [Public]
>
> > -----Original Message-----
> > From: Deucher, Alexander
> > Sent: Tuesday, December 21, 2021 12:01 PM
> > To: Linus Torvalds <torvalds@linux-foundation.org>; Imre Deak
> > <imre.deak@intel.com>; amd-gfx@lists.freedesktop.org
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>; Kai-Heng Feng
> > <kai.heng.feng@canonical.com>
> > Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> > PCI device ..."
> >
> > [Public]
> >
> > > -----Original Message-----
> > > From: Linus Torvalds <torvalds@linux-foundation.org>
> > > Sent: Monday, December 20, 2021 5:05 PM
> > > To: Imre Deak <imre.deak@intel.com>
> > > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>; Deucher, Alexander
> > > <Alexander.Deucher@amd.com>; Kai-Heng Feng
> > > <kai.heng.feng@canonical.com>
> > > Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> > > Release PCI device ..."
> > >
> > > On Mon, Dec 20, 2021 at 1:33 PM Imre Deak <imre.deak@intel.com>
> > wrote:
> > > >
> > > > amdgpu.runpm=0
> > >
> > > Hmmm.
> > >
> > > This does seem to "work", but not very well.
> > >
> > > With this, what seems to happen is odd: I lock the screen, wait, it
> > > goes "No signal, shutting down", but then doesn't actually shut down
> > > but stays black (with the backlight on). After _another_ five seconds
> > > or so, the monitor goes "No signal, shutting down" _again_, and at that
> > point it actually does it.
> > >
> > > So it solves my immediate problem - in that yes, the backlight finally
> > > does turn off in the end - but it does seem to be still broken.
> > >
> > > I'm very surprised if no AMD drm developers can see this exact same thing.
> > > This is a very simple setup. The only possibly slightly less common
> > > thing is that I have two monitors, but while that is not necessarily
> > > the _most_ common setup in an absolute sense, I'd expect it to be very
> > > common among DRM developers..
> > >
> > > I guess I can just change the revert to just a
> > >
> > >     -int amdgpu_runtime_pm = -1;
> > >     +int amdgpu_runtime_pm = 0;
> > >
> > > instead. The auto-detect is apparently broken. Maybe it should only
> > > kick in for LVDS screens on actual laptops?
> > >
> > > Note: on my machine, I get that
> > >
> > >    amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
> > >
> > > so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> > > and it's only that BACO case that is broken.
> > >
> > > I have no idea what any of those three things are - I'm just looking
> > > at the uses of that amdgpu_runtime_pm variable.
> > >
> > > amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> > > default, tell me something else to try.
> >
> > For a little background, runtime PM support was added about 10 year ago
> > originally to support laptops with multiple GPUs (integrated and discrete).
> > It's not specific to the display hardware.  When the GPU is idle, it can be
> > powered down completely.  In the case of these laptops, it's D3 cold
> > (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> > which powers off the dGPU completely (i.e., it disappears from the bus).  A
> > few years ago we extended this to support desktop dGPUs as well which
> > support their own version of runtime D3 (called BACO in AMD parlance - Bus
> > Active, Chip Off).  The driver can put the chip into a low power state where
> > everything except the bus interface is powered down (to avoid the device
> > disappearing from the bus).  So this has worked for almost 2 years now on
> > BACO capable parts and for a decade or more on BOCO systems.
> > Unfortunately, changing the default runpm parameter setting would cause a
> > flood of bug reports about runtime power management breaking and
> > suddenly systems are using more power.
> >
> > Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> > Runtime pm was working on amdgpu prior to that commit.  Is it possible
> > there is still some race between when amdgpu takes over from efifb?  Does
> > it work properly when all pm_runtime calls in efifb are removed or if efifb is
> > not enabled?  Runtime pm for Polaris boards has been enabled by default
> > since 4fdda2e66de0b which predates both of those patches.
>
> Thinking about this more, I wonder if there was some change in some userspace component which was hidden by the changes in 55285e21f045 and a6c0fd3d5a8b.  E.g., some desktop component started polling for display changes or GPU temperature or something like that and when a6c0fd3d5a8b was in place the GPU never entered runtime suspend.  Then when 55285e21f045 was applied, it unmasked the new behavior in the userpace component.
>
> What should happen is that when all of the displays blank, assuming the GPU is otherwise idle, the GPU will runtime suspend after  seconds.  When you move the mouse or hit the keyboard, that should trigger the GPU should runtime resume and then the displays will be re-enabled.
>
> In the behavior you are seeing, when the displays come back on after they blank are you seeing the device resume from runtime suspend?  On resume from suspend (runtime or system) we issue a hotplug notification to userspace in case any displays changed during suspend when the GPU was powered down (and hence could not detect a hotplug event).  Perhaps that event is triggering userspace to reprobe and re-enable the displays shortly after resume from runtime suspend due to some other event that caused the device to runtime resume.  Does something like this help by any chance?
> https://bugzilla.kernel.org/attachment.cgi?id=300103&action=diff&collapsed=&headers=1&format=raw

I'm not seeing this on my system, and another user tried the patch and
it fixed his system, so it indeed seems to be something along the
lines of what I described above.  Something in userspace seems to be
accessing the GPU causing it to runtime resume.  On resume the driver
then sends a hotplug event to userspace (in case anything display
related changed while the GPU was suspended).  The desktop manager
then sees the hotplug event and reprobes and lights up the displays
again.  @Wentland, Harry, I guess the display code may already handle
this (seeing if anything has changed on resume since suspend), so we
can probably drop the call from the generic amdgpu resume code?  Or if
not, it should be pretty straightforward to fix that in
dm_suspend()/dm_resume().

Alex

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."
  2021-12-21 21:18         ` Alex Deucher
@ 2021-12-21 22:09           ` Harry Wentland
  2021-12-21 22:18             ` Alex Deucher
  0 siblings, 1 reply; 10+ messages in thread
From: Harry Wentland @ 2021-12-21 22:09 UTC (permalink / raw)
  To: Alex Deucher, Deucher, Alexander
  Cc: Daniel Vetter, Imre Deak, amd-gfx, Kai-Heng Feng, Linus Torvalds



On 2021-12-21 16:18, Alex Deucher wrote:
> On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
> <Alexander.Deucher@amd.com> wrote:
>>
>> [Public]
>>
>>> -----Original Message-----
>>> From: Deucher, Alexander
>>> Sent: Tuesday, December 21, 2021 12:01 PM
>>> To: Linus Torvalds <torvalds@linux-foundation.org>; Imre Deak
>>> <imre.deak@intel.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>; Kai-Heng Feng
>>> <kai.heng.feng@canonical.com>
>>> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
>>> PCI device ..."
>>>
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Linus Torvalds <torvalds@linux-foundation.org>
>>>> Sent: Monday, December 20, 2021 5:05 PM
>>>> To: Imre Deak <imre.deak@intel.com>
>>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>; Deucher, Alexander
>>>> <Alexander.Deucher@amd.com>; Kai-Heng Feng
>>>> <kai.heng.feng@canonical.com>
>>>> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
>>>> Release PCI device ..."
>>>>
>>>> On Mon, Dec 20, 2021 at 1:33 PM Imre Deak <imre.deak@intel.com>
>>> wrote:
>>>>>
>>>>> amdgpu.runpm=0
>>>>
>>>> Hmmm.
>>>>
>>>> This does seem to "work", but not very well.
>>>>
>>>> With this, what seems to happen is odd: I lock the screen, wait, it
>>>> goes "No signal, shutting down", but then doesn't actually shut down
>>>> but stays black (with the backlight on). After _another_ five seconds
>>>> or so, the monitor goes "No signal, shutting down" _again_, and at that
>>> point it actually does it.
>>>>
>>>> So it solves my immediate problem - in that yes, the backlight finally
>>>> does turn off in the end - but it does seem to be still broken.
>>>>
>>>> I'm very surprised if no AMD drm developers can see this exact same thing.
>>>> This is a very simple setup. The only possibly slightly less common
>>>> thing is that I have two monitors, but while that is not necessarily
>>>> the _most_ common setup in an absolute sense, I'd expect it to be very
>>>> common among DRM developers..
>>>>
>>>> I guess I can just change the revert to just a
>>>>
>>>>     -int amdgpu_runtime_pm = -1;
>>>>     +int amdgpu_runtime_pm = 0;
>>>>
>>>> instead. The auto-detect is apparently broken. Maybe it should only
>>>> kick in for LVDS screens on actual laptops?
>>>>
>>>> Note: on my machine, I get that
>>>>
>>>>    amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
>>>>
>>>> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
>>>> and it's only that BACO case that is broken.
>>>>
>>>> I have no idea what any of those three things are - I'm just looking
>>>> at the uses of that amdgpu_runtime_pm variable.
>>>>
>>>> amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
>>>> default, tell me something else to try.
>>>
>>> For a little background, runtime PM support was added about 10 year ago
>>> originally to support laptops with multiple GPUs (integrated and discrete).
>>> It's not specific to the display hardware.  When the GPU is idle, it can be
>>> powered down completely.  In the case of these laptops, it's D3 cold
>>> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
>>> which powers off the dGPU completely (i.e., it disappears from the bus).  A
>>> few years ago we extended this to support desktop dGPUs as well which
>>> support their own version of runtime D3 (called BACO in AMD parlance - Bus
>>> Active, Chip Off).  The driver can put the chip into a low power state where
>>> everything except the bus interface is powered down (to avoid the device
>>> disappearing from the bus).  So this has worked for almost 2 years now on
>>> BACO capable parts and for a decade or more on BOCO systems.
>>> Unfortunately, changing the default runpm parameter setting would cause a
>>> flood of bug reports about runtime power management breaking and
>>> suddenly systems are using more power.
>>>
>>> Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
>>> Runtime pm was working on amdgpu prior to that commit.  Is it possible
>>> there is still some race between when amdgpu takes over from efifb?  Does
>>> it work properly when all pm_runtime calls in efifb are removed or if efifb is
>>> not enabled?  Runtime pm for Polaris boards has been enabled by default
>>> since 4fdda2e66de0b which predates both of those patches.
>>
>> Thinking about this more, I wonder if there was some change in some userspace component which was hidden by the changes in 55285e21f045 and a6c0fd3d5a8b.  E.g., some desktop component started polling for display changes or GPU temperature or something like that and when a6c0fd3d5a8b was in place the GPU never entered runtime suspend.  Then when 55285e21f045 was applied, it unmasked the new behavior in the userpace component.
>>
>> What should happen is that when all of the displays blank, assuming the GPU is otherwise idle, the GPU will runtime suspend after  seconds.  When you move the mouse or hit the keyboard, that should trigger the GPU should runtime resume and then the displays will be re-enabled.
>>
>> In the behavior you are seeing, when the displays come back on after they blank are you seeing the device resume from runtime suspend?  On resume from suspend (runtime or system) we issue a hotplug notification to userspace in case any displays changed during suspend when the GPU was powered down (and hence could not detect a hotplug event).  Perhaps that event is triggering userspace to reprobe and re-enable the displays shortly after resume from runtime suspend due to some other event that caused the device to runtime resume.  Does something like this help by any chance?
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D300103%26action%3Ddiff%26collapsed%3D%26headers%3D1%26format%3Draw&amp;data=04%7C01%7CHarry.Wentland%40amd.com%7Cd1d2f9528d8e42199af508d9c4c7793d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637757183279389936%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=KdIAXxYP0oQpNCFCe1slYwqgDhRdse2CrkGuCc2KrAE%3D&amp;reserved=0
> 
> I'm not seeing this on my system, and another user tried the patch and
> it fixed his system, so it indeed seems to be something along the
> lines of what I described above.  Something in userspace seems to be
> accessing the GPU causing it to runtime resume.  On resume the driver
> then sends a hotplug event to userspace (in case anything display
> related changed while the GPU was suspended).  The desktop manager
> then sees the hotplug event and reprobes and lights up the displays
> again.  @Wentland, Harry, I guess the display code may already handle
> this (seeing if anything has changed on resume since suspend), so we
> can probably drop the call from the generic amdgpu resume code?  Or if
> not, it should be pretty straightforward to fix that in
> dm_suspend()/dm_resume().
> 

We handle re-detection in dm_resume but only seem to call the
drm_kms_helper_hotplug_event for MST. We might want to call it
in dm_resume but that would just move it from amdgpu_device_resume
and will probably cause the same issue.

What I think we'd really want here is to make sure
drm_kms_helper_hotplug_event is only called when any display
config actually changes. Unfortunately, we're not being very
smart in our detection and just recreate everything (even
though dc_link_detect_helper seems to have code that wants to
be smart enough to detect if the sink is changed or not).
This means this change is non-trivial. At least I can't see
an easy fix that doesn't run the risk of breaking a bunch of
stuff.

Or maybe someone can try to see if (non-MST) detection during
RPM still works with your patch. If it does (for some reason)
then we should be good.

I can't seem to repro the problem on my Navi 1x card with KDE Plasma.
I wonder if it's a Polaris issue.

I also don't see my Navi 1x card going into RPM after
'xset dpms force off' with the Manjaro 5.15 kernel. Might need
to try the latest mainline or amd-stg.

Harry


> Alex

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."
  2021-12-21 22:09           ` Harry Wentland
@ 2021-12-21 22:18             ` Alex Deucher
  0 siblings, 0 replies; 10+ messages in thread
From: Alex Deucher @ 2021-12-21 22:18 UTC (permalink / raw)
  To: Harry Wentland
  Cc: Daniel Vetter, Imre Deak, amd-gfx, Kai-Heng Feng, Deucher,
	Alexander, Linus Torvalds

On Tue, Dec 21, 2021 at 5:09 PM Harry Wentland <harry.wentland@amd.com> wrote:
>
>
>
> On 2021-12-21 16:18, Alex Deucher wrote:
> > On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
> > <Alexander.Deucher@amd.com> wrote:
> >>
> >> [Public]
> >>
> >>> -----Original Message-----
> >>> From: Deucher, Alexander
> >>> Sent: Tuesday, December 21, 2021 12:01 PM
> >>> To: Linus Torvalds <torvalds@linux-foundation.org>; Imre Deak
> >>> <imre.deak@intel.com>; amd-gfx@lists.freedesktop.org
> >>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>; Kai-Heng Feng
> >>> <kai.heng.feng@canonical.com>
> >>> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> >>> PCI device ..."
> >>>
> >>> [Public]
> >>>
> >>>> -----Original Message-----
> >>>> From: Linus Torvalds <torvalds@linux-foundation.org>
> >>>> Sent: Monday, December 20, 2021 5:05 PM
> >>>> To: Imre Deak <imre.deak@intel.com>
> >>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>; Deucher, Alexander
> >>>> <Alexander.Deucher@amd.com>; Kai-Heng Feng
> >>>> <kai.heng.feng@canonical.com>
> >>>> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> >>>> Release PCI device ..."
> >>>>
> >>>> On Mon, Dec 20, 2021 at 1:33 PM Imre Deak <imre.deak@intel.com>
> >>> wrote:
> >>>>>
> >>>>> amdgpu.runpm=0
> >>>>
> >>>> Hmmm.
> >>>>
> >>>> This does seem to "work", but not very well.
> >>>>
> >>>> With this, what seems to happen is odd: I lock the screen, wait, it
> >>>> goes "No signal, shutting down", but then doesn't actually shut down
> >>>> but stays black (with the backlight on). After _another_ five seconds
> >>>> or so, the monitor goes "No signal, shutting down" _again_, and at that
> >>> point it actually does it.
> >>>>
> >>>> So it solves my immediate problem - in that yes, the backlight finally
> >>>> does turn off in the end - but it does seem to be still broken.
> >>>>
> >>>> I'm very surprised if no AMD drm developers can see this exact same thing.
> >>>> This is a very simple setup. The only possibly slightly less common
> >>>> thing is that I have two monitors, but while that is not necessarily
> >>>> the _most_ common setup in an absolute sense, I'd expect it to be very
> >>>> common among DRM developers..
> >>>>
> >>>> I guess I can just change the revert to just a
> >>>>
> >>>>     -int amdgpu_runtime_pm = -1;
> >>>>     +int amdgpu_runtime_pm = 0;
> >>>>
> >>>> instead. The auto-detect is apparently broken. Maybe it should only
> >>>> kick in for LVDS screens on actual laptops?
> >>>>
> >>>> Note: on my machine, I get that
> >>>>
> >>>>    amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
> >>>>
> >>>> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> >>>> and it's only that BACO case that is broken.
> >>>>
> >>>> I have no idea what any of those three things are - I'm just looking
> >>>> at the uses of that amdgpu_runtime_pm variable.
> >>>>
> >>>> amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> >>>> default, tell me something else to try.
> >>>
> >>> For a little background, runtime PM support was added about 10 year ago
> >>> originally to support laptops with multiple GPUs (integrated and discrete).
> >>> It's not specific to the display hardware.  When the GPU is idle, it can be
> >>> powered down completely.  In the case of these laptops, it's D3 cold
> >>> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> >>> which powers off the dGPU completely (i.e., it disappears from the bus).  A
> >>> few years ago we extended this to support desktop dGPUs as well which
> >>> support their own version of runtime D3 (called BACO in AMD parlance - Bus
> >>> Active, Chip Off).  The driver can put the chip into a low power state where
> >>> everything except the bus interface is powered down (to avoid the device
> >>> disappearing from the bus).  So this has worked for almost 2 years now on
> >>> BACO capable parts and for a decade or more on BOCO systems.
> >>> Unfortunately, changing the default runpm parameter setting would cause a
> >>> flood of bug reports about runtime power management breaking and
> >>> suddenly systems are using more power.
> >>>
> >>> Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> >>> Runtime pm was working on amdgpu prior to that commit.  Is it possible
> >>> there is still some race between when amdgpu takes over from efifb?  Does
> >>> it work properly when all pm_runtime calls in efifb are removed or if efifb is
> >>> not enabled?  Runtime pm for Polaris boards has been enabled by default
> >>> since 4fdda2e66de0b which predates both of those patches.
> >>
> >> Thinking about this more, I wonder if there was some change in some userspace component which was hidden by the changes in 55285e21f045 and a6c0fd3d5a8b.  E.g., some desktop component started polling for display changes or GPU temperature or something like that and when a6c0fd3d5a8b was in place the GPU never entered runtime suspend.  Then when 55285e21f045 was applied, it unmasked the new behavior in the userpace component.
> >>
> >> What should happen is that when all of the displays blank, assuming the GPU is otherwise idle, the GPU will runtime suspend after  seconds.  When you move the mouse or hit the keyboard, that should trigger the GPU should runtime resume and then the displays will be re-enabled.
> >>
> >> In the behavior you are seeing, when the displays come back on after they blank are you seeing the device resume from runtime suspend?  On resume from suspend (runtime or system) we issue a hotplug notification to userspace in case any displays changed during suspend when the GPU was powered down (and hence could not detect a hotplug event).  Perhaps that event is triggering userspace to reprobe and re-enable the displays shortly after resume from runtime suspend due to some other event that caused the device to runtime resume.  Does something like this help by any chance?
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D300103%26action%3Ddiff%26collapsed%3D%26headers%3D1%26format%3Draw&amp;data=04%7C01%7CHarry.Wentland%40amd.com%7Cd1d2f9528d8e42199af508d9c4c7793d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637757183279389936%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=KdIAXxYP0oQpNCFCe1slYwqgDhRdse2CrkGuCc2KrAE%3D&amp;reserved=0
> >
> > I'm not seeing this on my system, and another user tried the patch and
> > it fixed his system, so it indeed seems to be something along the
> > lines of what I described above.  Something in userspace seems to be
> > accessing the GPU causing it to runtime resume.  On resume the driver
> > then sends a hotplug event to userspace (in case anything display
> > related changed while the GPU was suspended).  The desktop manager
> > then sees the hotplug event and reprobes and lights up the displays
> > again.  @Wentland, Harry, I guess the display code may already handle
> > this (seeing if anything has changed on resume since suspend), so we
> > can probably drop the call from the generic amdgpu resume code?  Or if
> > not, it should be pretty straightforward to fix that in
> > dm_suspend()/dm_resume().
> >
>
> We handle re-detection in dm_resume but only seem to call the
> drm_kms_helper_hotplug_event for MST. We might want to call it
> in dm_resume but that would just move it from amdgpu_device_resume
> and will probably cause the same issue.
>
> What I think we'd really want here is to make sure
> drm_kms_helper_hotplug_event is only called when any display
> config actually changes. Unfortunately, we're not being very
> smart in our detection and just recreate everything (even
> though dc_link_detect_helper seems to have code that wants to
> be smart enough to detect if the sink is changed or not).
> This means this change is non-trivial. At least I can't see
> an easy fix that doesn't run the risk of breaking a bunch of
> stuff.

I think something like this patch may do the trick:
https://bugzilla.kernel.org/attachment.cgi?id=300109&action=diff&collapsed=&headers=1&format=raw
We can use the helper to check if anything actually changed.

Alex

>
> Or maybe someone can try to see if (non-MST) detection during
> RPM still works with your patch. If it does (for some reason)
> then we should be good.
>
> I can't seem to repro the problem on my Navi 1x card with KDE Plasma.
> I wonder if it's a Polaris issue.
>
> I also don't see my Navi 1x card going into RPM after
> 'xset dpms force off' with the Manjaro 5.15 kernel. Might need
> to try the latest mainline or amd-stg.
>
> Harry
>
>
> > Alex

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..."
  2021-12-21 18:47       ` Deucher, Alexander
  2021-12-21 21:18         ` Alex Deucher
@ 2021-12-23  4:07         ` Alex Deucher
  1 sibling, 0 replies; 10+ messages in thread
From: Alex Deucher @ 2021-12-23  4:07 UTC (permalink / raw)
  To: Deucher, Alexander
  Cc: Daniel Vetter, Imre Deak, amd-gfx, Kai-Heng Feng, Linus Torvalds

On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
<Alexander.Deucher@amd.com> wrote:
>
> [Public]
>
> > -----Original Message-----
> > From: Deucher, Alexander
> > Sent: Tuesday, December 21, 2021 12:01 PM
> > To: Linus Torvalds <torvalds@linux-foundation.org>; Imre Deak
> > <imre.deak@intel.com>; amd-gfx@lists.freedesktop.org
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>; Kai-Heng Feng
> > <kai.heng.feng@canonical.com>
> > Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
> > PCI device ..."
> >
> > [Public]
> >
> > > -----Original Message-----
> > > From: Linus Torvalds <torvalds@linux-foundation.org>
> > > Sent: Monday, December 20, 2021 5:05 PM
> > > To: Imre Deak <imre.deak@intel.com>
> > > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>; Deucher, Alexander
> > > <Alexander.Deucher@amd.com>; Kai-Heng Feng
> > > <kai.heng.feng@canonical.com>
> > > Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
> > > Release PCI device ..."
> > >
> > > On Mon, Dec 20, 2021 at 1:33 PM Imre Deak <imre.deak@intel.com>
> > wrote:
> > > >
> > > > amdgpu.runpm=0
> > >
> > > Hmmm.
> > >
> > > This does seem to "work", but not very well.
> > >
> > > With this, what seems to happen is odd: I lock the screen, wait, it
> > > goes "No signal, shutting down", but then doesn't actually shut down
> > > but stays black (with the backlight on). After _another_ five seconds
> > > or so, the monitor goes "No signal, shutting down" _again_, and at that
> > point it actually does it.
> > >
> > > So it solves my immediate problem - in that yes, the backlight finally
> > > does turn off in the end - but it does seem to be still broken.
> > >
> > > I'm very surprised if no AMD drm developers can see this exact same thing.
> > > This is a very simple setup. The only possibly slightly less common
> > > thing is that I have two monitors, but while that is not necessarily
> > > the _most_ common setup in an absolute sense, I'd expect it to be very
> > > common among DRM developers..
> > >
> > > I guess I can just change the revert to just a
> > >
> > >     -int amdgpu_runtime_pm = -1;
> > >     +int amdgpu_runtime_pm = 0;
> > >
> > > instead. The auto-detect is apparently broken. Maybe it should only
> > > kick in for LVDS screens on actual laptops?
> > >
> > > Note: on my machine, I get that
> > >
> > >    amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
> > >
> > > so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
> > > and it's only that BACO case that is broken.
> > >
> > > I have no idea what any of those three things are - I'm just looking
> > > at the uses of that amdgpu_runtime_pm variable.
> > >
> > > amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
> > > default, tell me something else to try.
> >
> > For a little background, runtime PM support was added about 10 year ago
> > originally to support laptops with multiple GPUs (integrated and discrete).
> > It's not specific to the display hardware.  When the GPU is idle, it can be
> > powered down completely.  In the case of these laptops, it's D3 cold
> > (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
> > which powers off the dGPU completely (i.e., it disappears from the bus).  A
> > few years ago we extended this to support desktop dGPUs as well which
> > support their own version of runtime D3 (called BACO in AMD parlance - Bus
> > Active, Chip Off).  The driver can put the chip into a low power state where
> > everything except the bus interface is powered down (to avoid the device
> > disappearing from the bus).  So this has worked for almost 2 years now on
> > BACO capable parts and for a decade or more on BOCO systems.
> > Unfortunately, changing the default runpm parameter setting would cause a
> > flood of bug reports about runtime power management breaking and
> > suddenly systems are using more power.
> >
> > Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
> > Runtime pm was working on amdgpu prior to that commit.  Is it possible
> > there is still some race between when amdgpu takes over from efifb?  Does
> > it work properly when all pm_runtime calls in efifb are removed or if efifb is
> > not enabled?  Runtime pm for Polaris boards has been enabled by default
> > since 4fdda2e66de0b which predates both of those patches.
>
> Thinking about this more, I wonder if there was some change in some userspace component which was hidden by the changes in 55285e21f045 and a6c0fd3d5a8b.  E.g., some desktop component started polling for display changes or GPU temperature or something like that and when a6c0fd3d5a8b was in place the GPU never entered runtime suspend.  Then when 55285e21f045 was applied, it unmasked the new behavior in the userpace component.
>
> What should happen is that when all of the displays blank, assuming the GPU is otherwise idle, the GPU will runtime suspend after  seconds.  When you move the mouse or hit the keyboard, that should trigger the GPU should runtime resume and then the displays will be re-enabled.
>
> In the behavior you are seeing, when the displays come back on after they blank are you seeing the device resume from runtime suspend?  On resume from suspend (runtime or system) we issue a hotplug notification to userspace in case any displays changed during suspend when the GPU was powered down (and hence could not detect a hotplug event).  Perhaps that event is triggering userspace to reprobe and re-enable the displays shortly after resume from runtime suspend due to some other event that caused the device to runtime resume.  Does something like this help by any chance?
> https://bugzilla.kernel.org/attachment.cgi?id=300103&action=diff&collapsed=&headers=1&format=raw

I've been working with the reporters of the following bug reports.
Probably best to follow the progress there:
https://bugzilla.kernel.org/show_bug.cgi?id=215203
https://gitlab.freedesktop.org/drm/amd/-/issues/1840

Alex

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-12-23  4:07 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAHk-=wjs4AjAKJ26W69xcMB7snFc+0u+rbgA+Tj0S1GvwY2T3Q@mail.gmail.com>
2021-12-20 19:25 ` Expecting to revert commit 55285e21f045 "fbdev/efifb: Release PCI device ..." Daniel Vetter
     [not found] ` <20211220213254.GA7250@ideak-desk.fi.intel.com>
     [not found]   ` <CAHk-=winh9=DS2ZJZbgwTFS3r3oWfrZcM9MedQ4dKzsGW8QaTA@mail.gmail.com>
2021-12-20 22:21     ` Linus Torvalds
2021-12-21  7:51       ` Christian König
2021-12-21 17:03       ` Deucher, Alexander
2021-12-21 17:01     ` Deucher, Alexander
2021-12-21 18:47       ` Deucher, Alexander
2021-12-21 21:18         ` Alex Deucher
2021-12-21 22:09           ` Harry Wentland
2021-12-21 22:18             ` Alex Deucher
2021-12-23  4:07         ` Alex Deucher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.