Re: [PATCH] drm/amdgpu: always reset asic when going into suspend

From: Alex Deucher <alexdeucher@gmail.com>
To: Daniel Drake <drake@endlessm.com>
Cc: "Deucher, Alexander" <alexander.deucher@amd.com>,
	Christian Koenig <christian.koenig@amd.com>,
	Chunming Zhou <David1.Zhou@amd.com>,
	amd-gfx list <amd-gfx@lists.freedesktop.org>,
	Linux PM <linux-pm@vger.kernel.org>
Subject: Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
Date: Thu, 16 Jan 2020 10:14:49 -0500	[thread overview]
Message-ID: <CADnq5_PNGr4=MqpBeKbhxJ-gpniSCj7L0wO5_V6mjuwpKoaCAg@mail.gmail.com> (raw)
In-Reply-To: <CAD8Lp44FKuEsmdK+zDX_-ZYQEnqjQM-z6nnfE-CJ62mutd+scA@mail.gmail.com>

On Wed, Jan 15, 2020 at 2:44 AM Daniel Drake <drake@endlessm.com> wrote:
>
> On Thu, Dec 19, 2019 at 10:08 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > I think there may be some AMD specific handling needed in
> > drivers/acpi/sleep.c.  My understanding from reading the modern
> > standby documents from MS is that each vendor needs to provide a
> > platform specific PEP driver.  I'm not sure how much of that current
> > code is Intel specific or not.
>
> I don't think there is anything Intel-specific in drivers/acpi/sleep.c.
>
> Reading more about PEP, I see that Linux supports PEP devices with
> ACPI ID INT33A1 or PNP0D80. Indeed the Intel platforms we work with
> have INT33A1 devices in their ACPI tables.
>
> This product has a \_SB.PEP ACPI device with _HID AMD0004 and _CID
> PNP0D80. Full acpidump:
> https://gist.github.com/dsd/ff3dfc0f63cdd9eba4a0fbd9e776e8be (see
> ssdt7)
>
> This PEP device responds to a _DSM with UUID argument
> "e3f32452-febc-43ce-9039-932122d37721", which is not the one
> documented at https://uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf
>
> Nevertheless, there is some data about the GPU:
>                     Package (0x04)
>                     {
>                         One,
>                         "\\_SB.PCI0.GP17.VGA",
>                         Zero,
>                         0x03
>                     },
>
> However since this data is identical to many other devices that
> suspend and resume just fine, I wonder if it is really important.
>
> The one supported method does offer two calls which may mirror the
> Display Off/On Notifications in the above spec:
>                         Case (0x02)
>                         {
>                             \_SB.PCI0.SBRG.EC0.CSEE (0xB7)
>                             Return (Zero)
>                         }
>                         Case (0x03)
>                         {
>                             \_SB.PCI0.SBRG.EC0.CSEE (0xB8)
>                             Notify (\_SB.PCI0.SBRG.EC0.LID, 0x80) //
> Status Change
>                             Return (Zero)
>                         }
>
> but I tried executing this code after suspending amdgpu, and the
> problem still stands, amdgpu cannot wakeup correctly.
>
> There's nothing else really interesting in the PEP device as far as I can see.
>
> PEP things aside, I am still quite suspicious about the fact that
> calling amdgpu_device_suspend() then amdgpu_device_resume() on
> multiple products (not just this one) fails. It seems that this code
> flow is relying on the BIOS doing something in the S3 suspend/resume
> path in order to make the device resumable by amdgpu_device_resume(),
> which is why we have only encountered this issue for the first time on
> our first AMD platform that does not support S3 suspend.
>

It makes sense.  This is an SOC, not a collection of individual
devices.  There are more devices than power rails so the sbios (via
ACPI) has to handle the power rail.  All of the devices using that
power rail have to suspend and tell the sbios before the platform
microcontroller can turn off the power rail.  Presumably there is
something missing that prevents the microcontroller for powering down
the rail.  If the power rail is kept on, the device is never powered
off and still retains its current state.

> With that in mind, and lacking any better info, wouldn't it make sense
> for amdgpu_device_resume() to always call reset? Maybe it's not
> necessary in the S3 case, but it shouldn't harm anything. Or perhaps
> it could check if the device is alive and reset it if it's not?

It's just papering over the problem.  It would be better from a power
perspective for the driver to just not suspend and keep running like
normal.  When the driver is not suspended runtime things like clock
and power gating are active which keep the GPU power at a minimum.

>
> Alternatively do you have any other contacts within AMD that could
> help us figure out the underlying question of how to correctly suspend
> and resume these devices? Happy to ship an affected product sample
> your way.
>

I talked to our sbios team and they seem to think our S0ix
implementation works pretty differently from Intel's.  I'm not really
an expert on this area however.  We have a new team ramping on up this
for Linux however.

Alex