On Thu, Nov 28, 2019 at 11:03:57PM +0100, Rafael J. Wysocki wrote:
> On Thursday, November 28, 2019 5:50:26 PM CET Thierry Reding wrote:
> > 
> > --0F1p//8PRICkK4MW
> > Content-Type: text/plain; charset=us-ascii
> > Content-Disposition: inline
> > Content-Transfer-Encoding: quoted-printable
> > 
> > On Thu, Nov 28, 2019 at 05:14:51PM +0100, Rafael J. Wysocki wrote:
> > > On Thu, Nov 28, 2019 at 5:03 PM Thierry Reding <thierry.reding@gmail.com>=
> >  wrote:
> > > >
> > > > From: Thierry Reding <treding@nvidia.com>
> > > >
> > > > Currently the driver PM core will automatically acquire a runtime PM
> > > > reference for devices before system sleep is entered. This is needed
> > > > to avoid potential issues related to devices' parents getting put to
> > > > runtime suspend at the wrong time and causing problems with their
> > > > children.
> > >=20
> > > Not only for that.
> > >=20
> > > > In some cases drivers are carefully written to avoid such issues and
> > > > the default behaviour can be changed to allow runtime PM to operate
> > > > regularly during system sleep.
> > >=20
> > > But this change breaks quite a few assumptions in the core too, so no,
> > > it can't be made.
> > 
> > Anything in particular that I can look at? I'm not seeing any issues
> > when I test this, which could of course mean that I'm just getting
> > lucky.
> 
> There are races and such that you may never hit during casual testing.
> 
> > One thing that irritated me is that I think this used to work. I do
> > recall testing suspend/resume a few years ago and devices would get
> > properly runtime suspended/resumed.
> 
> Not true at all.
> 
> The PM core has always taken PM-runtime references on all devices pretty much
> since when PM-runtime was introduced.

You're right. I was finally able to find a toolchain that I could build
an old version of the kernel with. I tested system suspend/resume on the
v4.8 release, which is the first one that had the runtime PM changes as
well as the subsystem suspend/resume support wired up, and I can't see
the runtime PM callbacks invoked during system suspend/resume.

So I must be misremembering, or I'm confusing it with some other tests I
was running at the time.

> > I did some digging but couldn't
> > find anything that would have had an impact on this.
> > 
> > Given that this is completely opt-in feature, why are you categorically
> > NAK'ing this?
> 
> The general problem is that if any device has been touched by system-wide
> suspend code, it should not be subject to PM-runtime any more until the
> subsequent system-wide resume is able to undo whatever the suspend did.
> 
> Moreover, if a device is runtime-suspended, the system-wide suspend code
> may mishandle it, in general.  That's why PM-runtime suspend is not allowed
> during system-wide transitions at all.  And it has always been like that.

For this particular use-case the above should all be irrelevant. None of
the drivers involved here do anything special at system suspend, because
runtime suspend already puts the devices into the lowest possible power
state. Basically when these devices are put into runtime suspend, they
are completely turned off. The only exception is for things like HDMI
where the +5V pin remains powered, so that hotplug detection will work.

The runtime PM state of the devices involved is managed by the subsystem
system suspend/resume helpers in DRM/KMS. Basically those helpers turn
off all the devices in the composite device, which ultimately results in
their last runtime PM reference being released. So for system suspend
and resume, these devices aren't touched, other than maybe for the PM
core's internal book-keeping.

> For a specific platform you may be able to overcome these limitations if
> you are careful enough, but certainly they are there in general and surely
> you cannot prevent people from using your opt-in just because they think
> that they know what they are doing.

That's true. But the same thing is true for pretty much all other APIs.
People obviously have to make sure they know what they're doing, just
like they have to with any other API.

I suppose the documentation for this new function is currently lacking a
bit. Perhaps adding a big warning to this and listing the common
pitfalls would help people make the right call about whether or not they
can use this.

> > Is there some other alternative that I can look into?
> 
> First of all, ensure that the dpm_list ordering is what it should be on the
> system/platform in question.  That can be done with the help of device links.

I don't think we have device links for everything, but the deferred
probe code should take care of ordering the dpm_list correctly because
we do handle deferred probe properly in all cases.

Also, the dpm_list ordering isn't very critical in this case. If the
devices are allowed to runtime suspend during system sleep, the
subsystem sleep helper will put them into runtime suspend at the correct
time. This is propagated all the way through the display pipeline and
that order is ensured by the subsystem helpers.

> In addition, make sure that the devices needed to suspend other devices are
> suspended in the noirq phase of system-wide suspend and resumed in the
> noirq phase of system-wide resume.  Or at least all of the other devices
> need to be suspended before them and resumed after them.

We're fine on this front as well. We have run into such issues in the
past, but I don't think there are any such issue left at the moment. I
do have one pending fix for I2C suspend/resume which fixes an issue
where some pinmuxing changes needed to get the HDMI DDC channel to work
were not getting applied during resume.

That I2C issue is related to this, I think. What I'm seeing is that when
the system goes to sleep, the pinmux looses its programming at a
hardware level, but the I2C driver doesn't know about it because it does
not get runtime suspended. At runtime suspend it would switch the pinmux
state to "idle" which would then match the system suspend state. Upon
runtime resume it sets the "default" pinmux state, which will then
restore the register programming.

In the current case where runtime suspend/resume is prohibited during
system sleep, upon resume the I2C driver will assume that the pinmux
state is still "default" and it won't reapply the state (it's actually
the pinmux subsystem that makes this decision) and causes HDMI DDC
transactions to time out.

One simple fix for that is to use pm_runtime_force_suspend() and
pm_runtime_force_resume() as system suspend/resume callbacks to make
sure the I2C controller is runtime suspended/resumed during system
sleep.

Note that forcing runtime suspend/resume this way is suboptimal in the
DRM/KMS case because the suspend/resume happens disconnected from the
subsystem suspend/resume callbacks, which is not desired as that breaks
some of the assumptions in those callbacks.

> These two things should allow you to cover the vast majority of cases if
> not all of them without messing up with the rules.

One alternative that I had thought about was to just ditch the runtime
PM callbacks for this. However, there's one corner case where this may
break. On early Tegra generations, the two display controllers are
"coupled" in that the second one doesn't work if the first one is
disabled. We describe that using a device link from the second to the
first controller. This causes the first controller to be automatically
be runtime resumed when the second controller is used. This only works
via runtime PM, so if I don't use runtime PM I'd have to add special
handling for that case.

Actually, there's another problem as well. Most of these devices use
generic PM domains to power on/off the SoC partitions that they're in.
If I side-step runtime PM, then I'd have to somehow find a way to
explicitly control the PM domains.

Another alternative would be to have a kind of hybrid approach where I
leave runtime PM calls in the drivers, but disconnect the runtime PM
callback implementations from that. That would at least fix the issue
with the generic PM domains.

However, it would not fix the problem with coupled display controllers
because empty runtime PM callbacks wouldn't actually power up the first
display controller when it is needed by the second controller. I would
have to add infrastructure that basically duplicates some of runtime PM
to fix that.

So the bottom line is that runtime PM is still the best solution for
this problem. It works really nice and is very consistent.

Do you think adding better documentation to this new flag and the
accessors would help remove your concerns about this?

Thierry