On Mon, Jan 28, 2019 at 12:26:48PM +0100, Philipp Zabel wrote:
> Hi Thierry,
> 
> On Fri, 2019-01-25 at 11:15 +0100, Thierry Reding wrote:
> > From: Thierry Reding <treding@nvidia.com>
> > 
> > When requesting a reset control for exclusive use that's already in use,
> > an -EBUSY error code is returned. Users can react accordingly when they
> > receive that error code, so there is no need to loudly complain.
> > 
> > Signed-off-by: Thierry Reding <treding@nvidia.com>
> > ---
> >  drivers/reset/core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/reset/core.c b/drivers/reset/core.c
> > index 9582efb70025..6b452f010b66 100644
> > --- a/drivers/reset/core.c
> > +++ b/drivers/reset/core.c
> > @@ -416,7 +416,7 @@ static struct reset_control *__reset_control_get_internal(
> >  
> >  	list_for_each_entry(rstc, &rcdev->reset_control_head, list) {
> >  		if (rstc->id == index) {
> > -			if (WARN_ON(!rstc->shared || !shared))
> > +			if (!rstc->shared || !shared)
> >  				return ERR_PTR(-EBUSY);
> >  
> >  			kref_get(&rstc->refcnt);
> 
> Are you actually running into this somewhere?

Yeah. I'm running into this on Tegra. Let me give you a brief overview
of how the resets work for better understanding. Most of the modules on
Tegra have dedicated resets, some may have multiple ones to reset a
subset of the hardware within a functional block. Typically the drivers
for these hardware modules will control the reset explicitly, usually to
make sure the hardware is in a known state at probe time. Some drivers
also do this as part of runtime suspend/resume.

Unfortunately we ran into issues when shared resets were introduced
because we also have generic power domains implemented on most platforms
and part of the programming sequence for power domains is to make sure a
hardware module that is part of the domain has its reset asserted. So on
one hand we need the reset for explicit control in the driver and on the
other hand we need the reset control in the power domain driver to make
sure the sequence requirements can be met.

This causes issues on chips such as Tegra210, where for example we need
to reset the SOR (which is an output used to driver HDMI) before setting
a mode to make sure we are in a proper state (the bootloader can have
initialized the SOR) to make sure the rather complicated sequence for
getting up the SOR can be completed.

The power domain for the SOR needs to control the reset for the SOR for
power sequencing reasons and at the same time, the SOR driver needs to
control the reset to get it into a proper state. In the past we were
able to make this work by requesting the reset control in the SOR driver
only if no power domain was attached to the SOR. This would avoid the
shared usage between power domain and SOR driver. We obviously can't
share the reset control because it wouldn't allow us to reset at the
right time.

On Tegra210 this works fine because the SOR power domain will be powered
off sometime during boot and in the process reset the SOR. This is
because all devices that are part of the domain are runtime power
managed and have a zero runtime PM refcount after ->probe(), so the PM
domain will be allowed to be turned off at that point. This is likely
only by accident, and break as well if the driver probe order changes
for some reason.

The problem is more accute on Tegra186 where the SOR is in a much larger
power domain that includes a bunch of other modules. Unfortunately, one
side-effect of that is that the power domain will not be turned off
during boot. I saw this happen when I added support for HDA support. The
HDA module is also part of the same power domain as SOR and HDA keeps a
runtime PM reference all the time. This is because it isn't runtime PM
managed at this point, but even if it was, there are so many modules in
the power domain that we can't guarantee that the power domain will at
some point be powered off and the SOR reset at the time.

Fortunately the power domains on Tegra186 (and later) are no longer
controlled by a Linux driver. Instead, there's now a firmware running on
a microcontroller (called boot and power management processor, or BPMP)
that Linux communicates with over an IPC mechanism. BPMP will internally
control the resets as appropriate, which means that we can exclusively
reset them again from Linux and explicitly reset the SOR when needed.

In order to support both Tegra210 and Tegra186 in the same driver, we no
longer just check whether a power domain was attached to the SOR, but we
just want to get the reset control and if we get -EBUSY we assume that
we run on Tegra210 and earlier and don't have to use the reset (since
the power domain will be able to reset the SOR). For later chips we get
access to the reset control because Linux doesn't know that BPMP also
has access to it.

So much for the "brief" overview... =)

Now, after writing the above, I'm not sure if this is really the best
approach. Technically we could run into the same problem on Tegra210
where we can't explicitly control the reset because Linux already has
it marked as exclusively used by the power domain provider.

But then, I don't know if there's an alternative to just crossing our
fingers and hoping that things will continue to work "by accident". For
some devices it may not matter because they are less picky about their
current state when you start programming them, but the SOR is very
sensitive and I've never been able to make it work properly without
involving the help of the reset at some point.

One alternative that could work for Tegra would be to somehow mark the
resets as being safe to use multiple times. In our use-cases we always
know that it is safe to reset the SOR because the power domains will be
on at the time that we want to control it, so there won't be any
conflicts. However, I suspect that that could set a precedent for other
drivers.

> My reason for adding these warnings was that these point to either a DT
> misconfiguration or a driver bug, and the verbose warning helps to
> quickly identify the actual issue. This is not an error condition that
> I would expect on a correctly configured system.

I generally agree with this approach, but given the above, I'm out of
ideas of how to properly achieve what we need on Tegra. Even this patch
seems like a bad idea in retrospect because we may very well run into a
similar issue on Tegra210 where we can't hide the fact that we're using
the reset from two places at the same time. I mean we could probably do
a really ugly hack and access the reset controller registers directly
from the power domain provider, but that's really not something I want
to do. We already have to do something similar to work around a hardware
bug on Tegra210, but that's about as much as I can endure.

> I don't expect most drivers give a proper error message that contains
> the -EBUSY return value. Usually it's just along the lines of "failed to
> get reset control" without any further indication.

I understand your reluctance. And I'm certainly open to any new ideas
that I haven't tried yet. I haven't tried reintroducing some sort of
non-shared resets, but I've never seriously considered it because it
effectively undoes most of the shared reset stuff. If you think that
some form of this would be acceptable, I will be happy to come up with a
proposal. Or perhaps there's a really simple solution to this problem
and I simply can't see the wood for the trees anymore.

Thierry