[PATCH] drm/amdgpu: always reset asic when going into suspend

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2019-10-15  6:50 Daniel Drake
       [not found] ` <20191015065002.18701-1-drake-6IF/jdPJHihWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Drake @ 2019-10-15  6:50 UTC (permalink / raw)
  To: alexander.deucher-5C7GfCeVMHo, christian.koenig-5C7GfCeVMHo,
	David1.Zhou-5C7GfCeVMHo
  Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On Asus UX434DA (Ryzen7 3700U), upon resume from s2idle, the screen
turns on again and shows the pre-suspend image, but the display remains
frozen from that point onwards.

The kernel logs show errors:

 [drm] psp command failed and response status is (0x7)
 [drm] Fence fallback timer expired on ring sdma0
 [drm] Fence fallback timer expired on ring gfx
 amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests] *ERROR* IB test failed on gfx (-22).
 [drm:process_one_work] *ERROR* ib ring test failed (-22).

This can also be reproduced with pm_test:
 # echo devices > /sys/power/pm_test
 # echo freeze > /sys/power/mem

The same reproducer causes the same problem on Asus X512DK (Ryzen5 3500U)
even though that model is normally able to suspend and resume OK via S3.

Experimenting, I observed that this error condition can be invoked on
any amdgpu product by executing in succession:

  amdgpu_device_suspend(drm_dev, true, true);
  amdgpu_device_resume(drm_dev, true, true);

i.e. it appears that the resume routine is unable to get the device out
of suspended state, except for the S3 suspend case where it presumably has
a bit of extra help from the firmware or hardware.

However, I also observed that the runtime suspend/resume routines work
OK when tested like this, which lead me to the key difference in these
two cases: the ASIC reset, which only happens in the runtime suspend path.

Since it takes less than 1ms, we should do the ASIC reset in all
suspend paths, fixing resume from s2idle on these products.

Link: https://bugs.freedesktop.org/show_bug.cgi?id=111811
Signed-off-by: Daniel Drake <drake@endlessm.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5a1939dbd4e3..7f4870e974fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3082,15 +3082,16 @@ int amdgpu_device_suspend(struct drm_device *dev, bool suspend, bool fbcon)
 	 */
 	amdgpu_bo_evict_vram(adev);

+	amdgpu_asic_reset(adev);
+	r = amdgpu_asic_reset(adev);
+	if (r)
+		DRM_ERROR("amdgpu asic reset failed\n");
+
 	pci_save_state(dev->pdev);
 	if (suspend) {
 		/* Shut down the device */
 		pci_disable_device(dev->pdev);
 		pci_set_power_state(dev->pdev, PCI_D3hot);
-	} else {
-		r = amdgpu_asic_reset(adev);
-		if (r)
-			DRM_ERROR("amdgpu asic reset failed\n");
 	}

 	return 0;
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
       [not found] ` <20191015065002.18701-1-drake-6IF/jdPJHihWk0Htik3J/w@public.gmane.org>
@ 2019-10-15 18:42   ` Alex Deucher
  2019-10-16  7:23       ` Daniel Drake
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Deucher @ 2019-10-15 18:42 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Deucher, Alexander, Chunming Zhou, Christian Koenig, amd-gfx list

On Tue, Oct 15, 2019 at 2:50 AM Daniel Drake <drake@endlessm.com> wrote:
>
> On Asus UX434DA (Ryzen7 3700U), upon resume from s2idle, the screen
> turns on again and shows the pre-suspend image, but the display remains
> frozen from that point onwards.
>
> The kernel logs show errors:
>
>  [drm] psp command failed and response status is (0x7)
>  [drm] Fence fallback timer expired on ring sdma0
>  [drm] Fence fallback timer expired on ring gfx
>  amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests] *ERROR* IB test failed on gfx (-22).
>  [drm:process_one_work] *ERROR* ib ring test failed (-22).
>
> This can also be reproduced with pm_test:
>  # echo devices > /sys/power/pm_test
>  # echo freeze > /sys/power/mem
>
> The same reproducer causes the same problem on Asus X512DK (Ryzen5 3500U)
> even though that model is normally able to suspend and resume OK via S3.
>
> Experimenting, I observed that this error condition can be invoked on
> any amdgpu product by executing in succession:
>
>   amdgpu_device_suspend(drm_dev, true, true);
>   amdgpu_device_resume(drm_dev, true, true);
>
> i.e. it appears that the resume routine is unable to get the device out
> of suspended state, except for the S3 suspend case where it presumably has
> a bit of extra help from the firmware or hardware.
>
> However, I also observed that the runtime suspend/resume routines work
> OK when tested like this, which lead me to the key difference in these
> two cases: the ASIC reset, which only happens in the runtime suspend path.
>
> Since it takes less than 1ms, we should do the ASIC reset in all
> suspend paths, fixing resume from s2idle on these products.
>

Is s2idle actually powering down the GPU?  Do you see a difference in
power usage?  I think you are just working around the fact that the
GPU never actually gets powered down.  Leaving the GPU in the reset
state probably uses more power than not suspending it in the first
place.

Alex

> Link: https://bugs.freedesktop.org/show_bug.cgi?id=111811
> Signed-off-by: Daniel Drake <drake@endlessm.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5a1939dbd4e3..7f4870e974fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3082,15 +3082,16 @@ int amdgpu_device_suspend(struct drm_device *dev, bool suspend, bool fbcon)
>          */
>         amdgpu_bo_evict_vram(adev);
>
> +       amdgpu_asic_reset(adev);
> +       r = amdgpu_asic_reset(adev);
> +       if (r)
> +               DRM_ERROR("amdgpu asic reset failed\n");
> +
>         pci_save_state(dev->pdev);
>         if (suspend) {
>                 /* Shut down the device */
>                 pci_disable_device(dev->pdev);
>                 pci_set_power_state(dev->pdev, PCI_D3hot);
> -       } else {
> -               r = amdgpu_asic_reset(adev);
> -               if (r)
> -                       DRM_ERROR("amdgpu asic reset failed\n");
>         }
>
>         return 0;
> --
> 2.20.1
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2019-10-16  7:23       ` Daniel Drake
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Drake @ 2019-10-16  7:23 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Christian Koenig, Chunming Zhou,
	amd-gfx list, Linux PM

On Wed, Oct 16, 2019 at 2:43 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> Is s2idle actually powering down the GPU?

My understanding is that s2idle (at a high level) just calls all
devices suspend routines and then puts the CPU into its deepest
running state.
So if there is something special to be done to power off the GPU, I
believe that amdgpu is responsible for making arrangements for that to
happen.
In this case the amdgpu code already does:

        pci_disable_device(dev->pdev);
        pci_set_power_state(dev->pdev, PCI_D3hot);

And the PCI layer will call through to any appropriate ACPI methods
related to that low power state.

> Do you see a difference in power usage?  I think you are just working around the fact that the
> GPU never actually gets powered down.

I ran a series of experiments.

Base setup: no UI running, ran "setterm -powersave 1; setterm -blank
1" and waited 1 minute for screen to turn off.
Base power usage in this state is 4.7W as reported by BAT0/power_now

1. Run amdgpu_device_suspend(ddev, true, true); before my change
--> Power usage increases to 6.1W

2. Run amdgpu_device_suspend(ddev, true, true); with my change applied
--> Power usage increases to 6.0W

3. Put amdgpu device in runtime suspend
--> Power usage increases to 6.2W

4. Try unmodified suspend path but d3cold instead of d3hot
--> Power usage increases to 6.1W

So, all of the suspend schemes actually increase the power usage by
roughly the same amount, reset or not, with and without my patch :/
Any ideas?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2019-10-16  7:23       ` Daniel Drake
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Drake @ 2019-10-16  7:23 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Chunming Zhou, Christian Koenig,
	amd-gfx list, Linux PM

On Wed, Oct 16, 2019 at 2:43 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> Is s2idle actually powering down the GPU?

My understanding is that s2idle (at a high level) just calls all
devices suspend routines and then puts the CPU into its deepest
running state.
So if there is something special to be done to power off the GPU, I
believe that amdgpu is responsible for making arrangements for that to
happen.
In this case the amdgpu code already does:

        pci_disable_device(dev->pdev);
        pci_set_power_state(dev->pdev, PCI_D3hot);

And the PCI layer will call through to any appropriate ACPI methods
related to that low power state.

> Do you see a difference in power usage?  I think you are just working around the fact that the
> GPU never actually gets powered down.

I ran a series of experiments.

Base setup: no UI running, ran "setterm -powersave 1; setterm -blank
1" and waited 1 minute for screen to turn off.
Base power usage in this state is 4.7W as reported by BAT0/power_now

1. Run amdgpu_device_suspend(ddev, true, true); before my change
--> Power usage increases to 6.1W

2. Run amdgpu_device_suspend(ddev, true, true); with my change applied
--> Power usage increases to 6.0W

3. Put amdgpu device in runtime suspend
--> Power usage increases to 6.2W

4. Try unmodified suspend path but d3cold instead of d3hot
--> Power usage increases to 6.1W

So, all of the suspend schemes actually increase the power usage by
roughly the same amount, reset or not, with and without my patch :/
Any ideas?

Thanks,
Daniel
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2019-11-22 15:31         ` Alex Deucher
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Deucher @ 2019-11-22 15:31 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Deucher, Alexander, Christian Koenig, Chunming Zhou,
	amd-gfx list, Linux PM

On Wed, Oct 16, 2019 at 3:24 AM Daniel Drake <drake@endlessm.com> wrote:
>
> On Wed, Oct 16, 2019 at 2:43 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> > Is s2idle actually powering down the GPU?
>
> My understanding is that s2idle (at a high level) just calls all
> devices suspend routines and then puts the CPU into its deepest
> running state.
> So if there is something special to be done to power off the GPU, I
> believe that amdgpu is responsible for making arrangements for that to
> happen.
> In this case the amdgpu code already does:
>
>         pci_disable_device(dev->pdev);
>         pci_set_power_state(dev->pdev, PCI_D3hot);
>
> And the PCI layer will call through to any appropriate ACPI methods
> related to that low power state.
>
> > Do you see a difference in power usage?  I think you are just working around the fact that the
> > GPU never actually gets powered down.
>
> I ran a series of experiments.
>
> Base setup: no UI running, ran "setterm -powersave 1; setterm -blank
> 1" and waited 1 minute for screen to turn off.
> Base power usage in this state is 4.7W as reported by BAT0/power_now
>
> 1. Run amdgpu_device_suspend(ddev, true, true); before my change
> --> Power usage increases to 6.1W
>
> 2. Run amdgpu_device_suspend(ddev, true, true); with my change applied
> --> Power usage increases to 6.0W
>
> 3. Put amdgpu device in runtime suspend
> --> Power usage increases to 6.2W
>
> 4. Try unmodified suspend path but d3cold instead of d3hot
> --> Power usage increases to 6.1W
>
> So, all of the suspend schemes actually increase the power usage by
> roughly the same amount, reset or not, with and without my patch :/
> Any ideas?

Do these patches help?
https://patchwork.freedesktop.org/patch/341775/
https://patchwork.freedesktop.org/patch/341968/

Alex

>
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2019-11-22 15:31         ` Alex Deucher
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Deucher @ 2019-11-22 15:31 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Deucher, Alexander, Chunming Zhou, Christian Koenig,
	amd-gfx list, Linux PM

On Wed, Oct 16, 2019 at 3:24 AM Daniel Drake <drake@endlessm.com> wrote:
>
> On Wed, Oct 16, 2019 at 2:43 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> > Is s2idle actually powering down the GPU?
>
> My understanding is that s2idle (at a high level) just calls all
> devices suspend routines and then puts the CPU into its deepest
> running state.
> So if there is something special to be done to power off the GPU, I
> believe that amdgpu is responsible for making arrangements for that to
> happen.
> In this case the amdgpu code already does:
>
>         pci_disable_device(dev->pdev);
>         pci_set_power_state(dev->pdev, PCI_D3hot);
>
> And the PCI layer will call through to any appropriate ACPI methods
> related to that low power state.
>
> > Do you see a difference in power usage?  I think you are just working around the fact that the
> > GPU never actually gets powered down.
>
> I ran a series of experiments.
>
> Base setup: no UI running, ran "setterm -powersave 1; setterm -blank
> 1" and waited 1 minute for screen to turn off.
> Base power usage in this state is 4.7W as reported by BAT0/power_now
>
> 1. Run amdgpu_device_suspend(ddev, true, true); before my change
> --> Power usage increases to 6.1W
>
> 2. Run amdgpu_device_suspend(ddev, true, true); with my change applied
> --> Power usage increases to 6.0W
>
> 3. Put amdgpu device in runtime suspend
> --> Power usage increases to 6.2W
>
> 4. Try unmodified suspend path but d3cold instead of d3hot
> --> Power usage increases to 6.1W
>
> So, all of the suspend schemes actually increase the power usage by
> roughly the same amount, reset or not, with and without my patch :/
> Any ideas?

Do these patches help?
https://patchwork.freedesktop.org/patch/341775/
https://patchwork.freedesktop.org/patch/341968/

Alex

>
> Thanks,
> Daniel
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2019-11-22 15:31         ` Alex Deucher
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Deucher @ 2019-11-22 15:31 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Deucher, Alexander, Chunming Zhou, Christian Koenig,
	amd-gfx list, Linux PM

On Wed, Oct 16, 2019 at 3:24 AM Daniel Drake <drake@endlessm.com> wrote:
>
> On Wed, Oct 16, 2019 at 2:43 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> > Is s2idle actually powering down the GPU?
>
> My understanding is that s2idle (at a high level) just calls all
> devices suspend routines and then puts the CPU into its deepest
> running state.
> So if there is something special to be done to power off the GPU, I
> believe that amdgpu is responsible for making arrangements for that to
> happen.
> In this case the amdgpu code already does:
>
>         pci_disable_device(dev->pdev);
>         pci_set_power_state(dev->pdev, PCI_D3hot);
>
> And the PCI layer will call through to any appropriate ACPI methods
> related to that low power state.
>
> > Do you see a difference in power usage?  I think you are just working around the fact that the
> > GPU never actually gets powered down.
>
> I ran a series of experiments.
>
> Base setup: no UI running, ran "setterm -powersave 1; setterm -blank
> 1" and waited 1 minute for screen to turn off.
> Base power usage in this state is 4.7W as reported by BAT0/power_now
>
> 1. Run amdgpu_device_suspend(ddev, true, true); before my change
> --> Power usage increases to 6.1W
>
> 2. Run amdgpu_device_suspend(ddev, true, true); with my change applied
> --> Power usage increases to 6.0W
>
> 3. Put amdgpu device in runtime suspend
> --> Power usage increases to 6.2W
>
> 4. Try unmodified suspend path but d3cold instead of d3hot
> --> Power usage increases to 6.1W
>
> So, all of the suspend schemes actually increase the power usage by
> roughly the same amount, reset or not, with and without my patch :/
> Any ideas?

Do these patches help?
https://patchwork.freedesktop.org/patch/341775/
https://patchwork.freedesktop.org/patch/341968/

Alex

>
> Thanks,
> Daniel
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2019-11-25  5:17           ` Daniel Drake
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Drake @ 2019-11-25  5:17 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Christian Koenig, Chunming Zhou,
	amd-gfx list, Linux PM

On Fri, Nov 22, 2019 at 11:32 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> Do these patches help?
> https://patchwork.freedesktop.org/patch/341775/
> https://patchwork.freedesktop.org/patch/341968/

Unfortunately not. The original issue still exists (dead gfx after
resume from s2idle) and also when I trigger execution of the suspend
or runtime suspend routines the power usage increases around 1.5W as
before.

Have you confirmed that amdgpu s2idle is working on platforms you have in hand?

Thanks
Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2019-11-25  5:17           ` Daniel Drake
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Drake @ 2019-11-25  5:17 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Chunming Zhou, Christian Koenig,
	amd-gfx list, Linux PM

On Fri, Nov 22, 2019 at 11:32 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> Do these patches help?
> https://patchwork.freedesktop.org/patch/341775/
> https://patchwork.freedesktop.org/patch/341968/

Unfortunately not. The original issue still exists (dead gfx after
resume from s2idle) and also when I trigger execution of the suspend
or runtime suspend routines the power usage increases around 1.5W as
before.

Have you confirmed that amdgpu s2idle is working on platforms you have in hand?

Thanks
Daniel
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2019-11-25  5:17           ` Daniel Drake
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Drake @ 2019-11-25  5:17 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Chunming Zhou, Christian Koenig,
	amd-gfx list, Linux PM

On Fri, Nov 22, 2019 at 11:32 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> Do these patches help?
> https://patchwork.freedesktop.org/patch/341775/
> https://patchwork.freedesktop.org/patch/341968/

Unfortunately not. The original issue still exists (dead gfx after
resume from s2idle) and also when I trigger execution of the suspend
or runtime suspend routines the power usage increases around 1.5W as
before.

Have you confirmed that amdgpu s2idle is working on platforms you have in hand?

Thanks
Daniel
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
  2019-11-25  5:17           ` Daniel Drake
@ 2019-12-16  9:00             ` Daniel Drake
  -1 siblings, 0 replies; 22+ messages in thread
From: Daniel Drake @ 2019-12-16  9:00 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Christian Koenig, Chunming Zhou,
	amd-gfx list, Linux PM

Hi Alex,

On Mon, Nov 25, 2019 at 1:17 PM Daniel Drake <drake@endlessm.com> wrote:
> Unfortunately not. The original issue still exists (dead gfx after
> resume from s2idle) and also when I trigger execution of the suspend
> or runtime suspend routines the power usage increases around 1.5W as
> before.
>
> Have you confirmed that amdgpu s2idle is working on platforms you have in hand?

Any further ideas here? Or any workarounds that you would consider?

This platform has been rather tricky but all of the other problems are
now solved:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f897e60a12f0b9146357780d317879bce2a877dc
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d21b8adbd475dba19ac2086d3306327b4a297418
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=406857f773b082bc88edfd24967facf4ed07ac85
https://patchwork.kernel.org/patch/11263477/

amdgpu is the only breakage left before Linux can be shipped on this
family of products.

Thanks
Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2019-12-16  9:00             ` Daniel Drake
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Drake @ 2019-12-16  9:00 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Chunming Zhou, Christian Koenig,
	amd-gfx list, Linux PM

Hi Alex,

On Mon, Nov 25, 2019 at 1:17 PM Daniel Drake <drake@endlessm.com> wrote:
> Unfortunately not. The original issue still exists (dead gfx after
> resume from s2idle) and also when I trigger execution of the suspend
> or runtime suspend routines the power usage increases around 1.5W as
> before.
>
> Have you confirmed that amdgpu s2idle is working on platforms you have in hand?

Any further ideas here? Or any workarounds that you would consider?

This platform has been rather tricky but all of the other problems are
now solved:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f897e60a12f0b9146357780d317879bce2a877dc
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d21b8adbd475dba19ac2086d3306327b4a297418
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=406857f773b082bc88edfd24967facf4ed07ac85
https://patchwork.kernel.org/patch/11263477/

amdgpu is the only breakage left before Linux can be shipped on this
family of products.

Thanks
Daniel
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
  2019-12-16  9:00             ` Daniel Drake
@ 2019-12-19 14:08               ` Alex Deucher
  -1 siblings, 0 replies; 22+ messages in thread
From: Alex Deucher @ 2019-12-19 14:08 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Deucher, Alexander, Christian Koenig, Chunming Zhou,
	amd-gfx list, Linux PM

On Mon, Dec 16, 2019 at 4:00 AM Daniel Drake <drake@endlessm.com> wrote:
>
> Hi Alex,
>
> On Mon, Nov 25, 2019 at 1:17 PM Daniel Drake <drake@endlessm.com> wrote:
> > Unfortunately not. The original issue still exists (dead gfx after
> > resume from s2idle) and also when I trigger execution of the suspend
> > or runtime suspend routines the power usage increases around 1.5W as
> > before.
> >
> > Have you confirmed that amdgpu s2idle is working on platforms you have in hand?
>
> Any further ideas here? Or any workarounds that you would consider?

I think there may be some AMD specific handling needed in
drivers/acpi/sleep.c.  My understanding from reading the modern
standby documents from MS is that each vendor needs to provide a
platform specific PEP driver.  I'm not sure how much of that current
code is Intel specific or not.

Alex

>
> This platform has been rather tricky but all of the other problems are
> now solved:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f897e60a12f0b9146357780d317879bce2a877dc
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d21b8adbd475dba19ac2086d3306327b4a297418
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=406857f773b082bc88edfd24967facf4ed07ac85
> https://patchwork.kernel.org/patch/11263477/
>
> amdgpu is the only breakage left before Linux can be shipped on this
> family of products.
>
> Thanks
> Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2019-12-19 14:08               ` Alex Deucher
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Deucher @ 2019-12-19 14:08 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Deucher, Alexander, Chunming Zhou, Christian Koenig,
	amd-gfx list, Linux PM

On Mon, Dec 16, 2019 at 4:00 AM Daniel Drake <drake@endlessm.com> wrote:
>
> Hi Alex,
>
> On Mon, Nov 25, 2019 at 1:17 PM Daniel Drake <drake@endlessm.com> wrote:
> > Unfortunately not. The original issue still exists (dead gfx after
> > resume from s2idle) and also when I trigger execution of the suspend
> > or runtime suspend routines the power usage increases around 1.5W as
> > before.
> >
> > Have you confirmed that amdgpu s2idle is working on platforms you have in hand?
>
> Any further ideas here? Or any workarounds that you would consider?

I think there may be some AMD specific handling needed in
drivers/acpi/sleep.c.  My understanding from reading the modern
standby documents from MS is that each vendor needs to provide a
platform specific PEP driver.  I'm not sure how much of that current
code is Intel specific or not.

Alex

>
> This platform has been rather tricky but all of the other problems are
> now solved:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f897e60a12f0b9146357780d317879bce2a877dc
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d21b8adbd475dba19ac2086d3306327b4a297418
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=406857f773b082bc88edfd24967facf4ed07ac85
> https://patchwork.kernel.org/patch/11263477/
>
> amdgpu is the only breakage left before Linux can be shipped on this
> family of products.
>
> Thanks
> Daniel
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
  2019-12-19 14:08               ` Alex Deucher
@ 2020-01-15  7:44                 ` Daniel Drake
  -1 siblings, 0 replies; 22+ messages in thread
From: Daniel Drake @ 2020-01-15  7:44 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Christian Koenig, Chunming Zhou,
	amd-gfx list, Linux PM

On Thu, Dec 19, 2019 at 10:08 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> I think there may be some AMD specific handling needed in
> drivers/acpi/sleep.c.  My understanding from reading the modern
> standby documents from MS is that each vendor needs to provide a
> platform specific PEP driver.  I'm not sure how much of that current
> code is Intel specific or not.

I don't think there is anything Intel-specific in drivers/acpi/sleep.c.

Reading more about PEP, I see that Linux supports PEP devices with
ACPI ID INT33A1 or PNP0D80. Indeed the Intel platforms we work with
have INT33A1 devices in their ACPI tables.

This product has a \_SB.PEP ACPI device with _HID AMD0004 and _CID
PNP0D80. Full acpidump:
https://gist.github.com/dsd/ff3dfc0f63cdd9eba4a0fbd9e776e8be (see
ssdt7)

This PEP device responds to a _DSM with UUID argument
"e3f32452-febc-43ce-9039-932122d37721", which is not the one
documented at https://uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf

Nevertheless, there is some data about the GPU:
                    Package (0x04)
                    {
                        One,
                        "\\_SB.PCI0.GP17.VGA",
                        Zero,
                        0x03
                    },

However since this data is identical to many other devices that
suspend and resume just fine, I wonder if it is really important.

The one supported method does offer two calls which may mirror the
Display Off/On Notifications in the above spec:
                        Case (0x02)
                        {
                            \_SB.PCI0.SBRG.EC0.CSEE (0xB7)
                            Return (Zero)
                        }
                        Case (0x03)
                        {
                            \_SB.PCI0.SBRG.EC0.CSEE (0xB8)
                            Notify (\_SB.PCI0.SBRG.EC0.LID, 0x80) //
Status Change
                            Return (Zero)
                        }

but I tried executing this code after suspending amdgpu, and the
problem still stands, amdgpu cannot wakeup correctly.

There's nothing else really interesting in the PEP device as far as I can see.

PEP things aside, I am still quite suspicious about the fact that
calling amdgpu_device_suspend() then amdgpu_device_resume() on
multiple products (not just this one) fails. It seems that this code
flow is relying on the BIOS doing something in the S3 suspend/resume
path in order to make the device resumable by amdgpu_device_resume(),
which is why we have only encountered this issue for the first time on
our first AMD platform that does not support S3 suspend.

With that in mind, and lacking any better info, wouldn't it make sense
for amdgpu_device_resume() to always call reset? Maybe it's not
necessary in the S3 case, but it shouldn't harm anything. Or perhaps
it could check if the device is alive and reset it if it's not?

Alternatively do you have any other contacts within AMD that could
help us figure out the underlying question of how to correctly suspend
and resume these devices? Happy to ship an affected product sample
your way.

Thanks
Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2020-01-15  7:44                 ` Daniel Drake
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Drake @ 2020-01-15  7:44 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Chunming Zhou, Christian Koenig,
	amd-gfx list, Linux PM

On Thu, Dec 19, 2019 at 10:08 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> I think there may be some AMD specific handling needed in
> drivers/acpi/sleep.c.  My understanding from reading the modern
> standby documents from MS is that each vendor needs to provide a
> platform specific PEP driver.  I'm not sure how much of that current
> code is Intel specific or not.

I don't think there is anything Intel-specific in drivers/acpi/sleep.c.

Reading more about PEP, I see that Linux supports PEP devices with
ACPI ID INT33A1 or PNP0D80. Indeed the Intel platforms we work with
have INT33A1 devices in their ACPI tables.

This product has a \_SB.PEP ACPI device with _HID AMD0004 and _CID
PNP0D80. Full acpidump:
https://gist.github.com/dsd/ff3dfc0f63cdd9eba4a0fbd9e776e8be (see
ssdt7)

This PEP device responds to a _DSM with UUID argument
"e3f32452-febc-43ce-9039-932122d37721", which is not the one
documented at https://uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf

Nevertheless, there is some data about the GPU:
                    Package (0x04)
                    {
                        One,
                        "\\_SB.PCI0.GP17.VGA",
                        Zero,
                        0x03
                    },

However since this data is identical to many other devices that
suspend and resume just fine, I wonder if it is really important.

The one supported method does offer two calls which may mirror the
Display Off/On Notifications in the above spec:
                        Case (0x02)
                        {
                            \_SB.PCI0.SBRG.EC0.CSEE (0xB7)
                            Return (Zero)
                        }
                        Case (0x03)
                        {
                            \_SB.PCI0.SBRG.EC0.CSEE (0xB8)
                            Notify (\_SB.PCI0.SBRG.EC0.LID, 0x80) //
Status Change
                            Return (Zero)
                        }

but I tried executing this code after suspending amdgpu, and the
problem still stands, amdgpu cannot wakeup correctly.

There's nothing else really interesting in the PEP device as far as I can see.

PEP things aside, I am still quite suspicious about the fact that
calling amdgpu_device_suspend() then amdgpu_device_resume() on
multiple products (not just this one) fails. It seems that this code
flow is relying on the BIOS doing something in the S3 suspend/resume
path in order to make the device resumable by amdgpu_device_resume(),
which is why we have only encountered this issue for the first time on
our first AMD platform that does not support S3 suspend.

With that in mind, and lacking any better info, wouldn't it make sense
for amdgpu_device_resume() to always call reset? Maybe it's not
necessary in the S3 case, but it shouldn't harm anything. Or perhaps
it could check if the device is alive and reset it if it's not?

Alternatively do you have any other contacts within AMD that could
help us figure out the underlying question of how to correctly suspend
and resume these devices? Happy to ship an affected product sample
your way.

Thanks
Daniel
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
  2020-01-15  7:44                 ` Daniel Drake
@ 2020-01-16 15:14                   ` Alex Deucher
  -1 siblings, 0 replies; 22+ messages in thread
From: Alex Deucher @ 2020-01-16 15:14 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Deucher, Alexander, Christian Koenig, Chunming Zhou,
	amd-gfx list, Linux PM

On Wed, Jan 15, 2020 at 2:44 AM Daniel Drake <drake@endlessm.com> wrote:
>
> On Thu, Dec 19, 2019 at 10:08 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > I think there may be some AMD specific handling needed in
> > drivers/acpi/sleep.c.  My understanding from reading the modern
> > standby documents from MS is that each vendor needs to provide a
> > platform specific PEP driver.  I'm not sure how much of that current
> > code is Intel specific or not.
>
> I don't think there is anything Intel-specific in drivers/acpi/sleep.c.
>
> Reading more about PEP, I see that Linux supports PEP devices with
> ACPI ID INT33A1 or PNP0D80. Indeed the Intel platforms we work with
> have INT33A1 devices in their ACPI tables.
>
> This product has a \_SB.PEP ACPI device with _HID AMD0004 and _CID
> PNP0D80. Full acpidump:
> https://gist.github.com/dsd/ff3dfc0f63cdd9eba4a0fbd9e776e8be (see
> ssdt7)
>
> This PEP device responds to a _DSM with UUID argument
> "e3f32452-febc-43ce-9039-932122d37721", which is not the one
> documented at https://uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf
>
> Nevertheless, there is some data about the GPU:
>                     Package (0x04)
>                     {
>                         One,
>                         "\\_SB.PCI0.GP17.VGA",
>                         Zero,
>                         0x03
>                     },
>
> However since this data is identical to many other devices that
> suspend and resume just fine, I wonder if it is really important.
>
> The one supported method does offer two calls which may mirror the
> Display Off/On Notifications in the above spec:
>                         Case (0x02)
>                         {
>                             \_SB.PCI0.SBRG.EC0.CSEE (0xB7)
>                             Return (Zero)
>                         }
>                         Case (0x03)
>                         {
>                             \_SB.PCI0.SBRG.EC0.CSEE (0xB8)
>                             Notify (\_SB.PCI0.SBRG.EC0.LID, 0x80) //
> Status Change
>                             Return (Zero)
>                         }
>
> but I tried executing this code after suspending amdgpu, and the
> problem still stands, amdgpu cannot wakeup correctly.
>
> There's nothing else really interesting in the PEP device as far as I can see.
>
> PEP things aside, I am still quite suspicious about the fact that
> calling amdgpu_device_suspend() then amdgpu_device_resume() on
> multiple products (not just this one) fails. It seems that this code
> flow is relying on the BIOS doing something in the S3 suspend/resume
> path in order to make the device resumable by amdgpu_device_resume(),
> which is why we have only encountered this issue for the first time on
> our first AMD platform that does not support S3 suspend.
>

It makes sense.  This is an SOC, not a collection of individual
devices.  There are more devices than power rails so the sbios (via
ACPI) has to handle the power rail.  All of the devices using that
power rail have to suspend and tell the sbios before the platform
microcontroller can turn off the power rail.  Presumably there is
something missing that prevents the microcontroller for powering down
the rail.  If the power rail is kept on, the device is never powered
off and still retains its current state.

> With that in mind, and lacking any better info, wouldn't it make sense
> for amdgpu_device_resume() to always call reset? Maybe it's not
> necessary in the S3 case, but it shouldn't harm anything. Or perhaps
> it could check if the device is alive and reset it if it's not?

It's just papering over the problem.  It would be better from a power
perspective for the driver to just not suspend and keep running like
normal.  When the driver is not suspended runtime things like clock
and power gating are active which keep the GPU power at a minimum.

>
> Alternatively do you have any other contacts within AMD that could
> help us figure out the underlying question of how to correctly suspend
> and resume these devices? Happy to ship an affected product sample
> your way.
>

I talked to our sbios team and they seem to think our S0ix
implementation works pretty differently from Intel's.  I'm not really
an expert on this area however.  We have a new team ramping on up this
for Linux however.

Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2020-01-16 15:14                   ` Alex Deucher
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Deucher @ 2020-01-16 15:14 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Deucher, Alexander, Chunming Zhou, Christian Koenig,
	amd-gfx list, Linux PM

On Wed, Jan 15, 2020 at 2:44 AM Daniel Drake <drake@endlessm.com> wrote:
>
> On Thu, Dec 19, 2019 at 10:08 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > I think there may be some AMD specific handling needed in
> > drivers/acpi/sleep.c.  My understanding from reading the modern
> > standby documents from MS is that each vendor needs to provide a
> > platform specific PEP driver.  I'm not sure how much of that current
> > code is Intel specific or not.
>
> I don't think there is anything Intel-specific in drivers/acpi/sleep.c.
>
> Reading more about PEP, I see that Linux supports PEP devices with
> ACPI ID INT33A1 or PNP0D80. Indeed the Intel platforms we work with
> have INT33A1 devices in their ACPI tables.
>
> This product has a \_SB.PEP ACPI device with _HID AMD0004 and _CID
> PNP0D80. Full acpidump:
> https://gist.github.com/dsd/ff3dfc0f63cdd9eba4a0fbd9e776e8be (see
> ssdt7)
>
> This PEP device responds to a _DSM with UUID argument
> "e3f32452-febc-43ce-9039-932122d37721", which is not the one
> documented at https://uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf
>
> Nevertheless, there is some data about the GPU:
>                     Package (0x04)
>                     {
>                         One,
>                         "\\_SB.PCI0.GP17.VGA",
>                         Zero,
>                         0x03
>                     },
>
> However since this data is identical to many other devices that
> suspend and resume just fine, I wonder if it is really important.
>
> The one supported method does offer two calls which may mirror the
> Display Off/On Notifications in the above spec:
>                         Case (0x02)
>                         {
>                             \_SB.PCI0.SBRG.EC0.CSEE (0xB7)
>                             Return (Zero)
>                         }
>                         Case (0x03)
>                         {
>                             \_SB.PCI0.SBRG.EC0.CSEE (0xB8)
>                             Notify (\_SB.PCI0.SBRG.EC0.LID, 0x80) //
> Status Change
>                             Return (Zero)
>                         }
>
> but I tried executing this code after suspending amdgpu, and the
> problem still stands, amdgpu cannot wakeup correctly.
>
> There's nothing else really interesting in the PEP device as far as I can see.
>
> PEP things aside, I am still quite suspicious about the fact that
> calling amdgpu_device_suspend() then amdgpu_device_resume() on
> multiple products (not just this one) fails. It seems that this code
> flow is relying on the BIOS doing something in the S3 suspend/resume
> path in order to make the device resumable by amdgpu_device_resume(),
> which is why we have only encountered this issue for the first time on
> our first AMD platform that does not support S3 suspend.
>

It makes sense.  This is an SOC, not a collection of individual
devices.  There are more devices than power rails so the sbios (via
ACPI) has to handle the power rail.  All of the devices using that
power rail have to suspend and tell the sbios before the platform
microcontroller can turn off the power rail.  Presumably there is
something missing that prevents the microcontroller for powering down
the rail.  If the power rail is kept on, the device is never powered
off and still retains its current state.

> With that in mind, and lacking any better info, wouldn't it make sense
> for amdgpu_device_resume() to always call reset? Maybe it's not
> necessary in the S3 case, but it shouldn't harm anything. Or perhaps
> it could check if the device is alive and reset it if it's not?

It's just papering over the problem.  It would be better from a power
perspective for the driver to just not suspend and keep running like
normal.  When the driver is not suspended runtime things like clock
and power gating are active which keep the GPU power at a minimum.

>
> Alternatively do you have any other contacts within AMD that could
> help us figure out the underlying question of how to correctly suspend
> and resume these devices? Happy to ship an affected product sample
> your way.
>

I talked to our sbios team and they seem to think our S0ix
implementation works pretty differently from Intel's.  I'm not really
an expert on this area however.  We have a new team ramping on up this
for Linux however.

Alex
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
  2020-01-16 15:14                   ` Alex Deucher
@ 2020-02-07  4:52                     ` Daniel Drake
  -1 siblings, 0 replies; 22+ messages in thread
From: Daniel Drake @ 2020-02-07  4:52 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Christian Koenig, Chunming Zhou,
	amd-gfx list, Linux PM

On Thu, Jan 16, 2020 at 11:15 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> It's just papering over the problem.  It would be better from a power
> perspective for the driver to just not suspend and keep running like
> normal.  When the driver is not suspended runtime things like clock
> and power gating are active which keep the GPU power at a minimum.

Until we have a better solution, are there any strategies we could
apply here to avoid the suspend as you say?
e.g. DMI quirk these products to disable suspend? Or disable suspend
on all s2idle setups?

This would certainly be better than the current situation of the
machine becoming unusable on resume.

> I talked to our sbios team and they seem to think our S0ix
> implementation works pretty differently from Intel's.  I'm not really
> an expert on this area however.  We have a new team ramping on up this
> for Linux however.

Thanks for following up on this internally! Can I lend a product
sample to the new team so that they have direct access?

Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2020-02-07  4:52                     ` Daniel Drake
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Drake @ 2020-02-07  4:52 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Deucher, Alexander, Chunming Zhou, Christian Koenig,
	amd-gfx list, Linux PM

On Thu, Jan 16, 2020 at 11:15 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> It's just papering over the problem.  It would be better from a power
> perspective for the driver to just not suspend and keep running like
> normal.  When the driver is not suspended runtime things like clock
> and power gating are active which keep the GPU power at a minimum.

Until we have a better solution, are there any strategies we could
apply here to avoid the suspend as you say?
e.g. DMI quirk these products to disable suspend? Or disable suspend
on all s2idle setups?

This would certainly be better than the current situation of the
machine becoming unusable on resume.

> I talked to our sbios team and they seem to think our S0ix
> implementation works pretty differently from Intel's.  I'm not really
> an expert on this area however.  We have a new team ramping on up this
> for Linux however.

Thanks for following up on this internally! Can I lend a product
sample to the new team so that they have direct access?

Daniel
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
  2020-01-16 15:14                   ` Alex Deucher
@ 2020-10-23 15:04                     ` Alex Deucher
  -1 siblings, 0 replies; 22+ messages in thread
From: Alex Deucher @ 2020-10-23 15:04 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Deucher, Alexander, Christian Koenig, Chunming Zhou,
	amd-gfx list, Linux PM

On Thu, Jan 16, 2020 at 10:14 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Jan 15, 2020 at 2:44 AM Daniel Drake <drake@endlessm.com> wrote:
> >
> > On Thu, Dec 19, 2019 at 10:08 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > > I think there may be some AMD specific handling needed in
> > > drivers/acpi/sleep.c.  My understanding from reading the modern
> > > standby documents from MS is that each vendor needs to provide a
> > > platform specific PEP driver.  I'm not sure how much of that current
> > > code is Intel specific or not.
> >
> > I don't think there is anything Intel-specific in drivers/acpi/sleep.c.
> >
> > Reading more about PEP, I see that Linux supports PEP devices with
> > ACPI ID INT33A1 or PNP0D80. Indeed the Intel platforms we work with
> > have INT33A1 devices in their ACPI tables.
> >
> > This product has a \_SB.PEP ACPI device with _HID AMD0004 and _CID
> > PNP0D80. Full acpidump:
> > https://gist.github.com/dsd/ff3dfc0f63cdd9eba4a0fbd9e776e8be (see
> > ssdt7)
> >
> > This PEP device responds to a _DSM with UUID argument
> > "e3f32452-febc-43ce-9039-932122d37721", which is not the one
> > documented at https://uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf
> >
> > Nevertheless, there is some data about the GPU:
> >                     Package (0x04)
> >                     {
> >                         One,
> >                         "\\_SB.PCI0.GP17.VGA",
> >                         Zero,
> >                         0x03
> >                     },
> >
> > However since this data is identical to many other devices that
> > suspend and resume just fine, I wonder if it is really important.
> >
> > The one supported method does offer two calls which may mirror the
> > Display Off/On Notifications in the above spec:
> >                         Case (0x02)
> >                         {
> >                             \_SB.PCI0.SBRG.EC0.CSEE (0xB7)
> >                             Return (Zero)
> >                         }
> >                         Case (0x03)
> >                         {
> >                             \_SB.PCI0.SBRG.EC0.CSEE (0xB8)
> >                             Notify (\_SB.PCI0.SBRG.EC0.LID, 0x80) //
> > Status Change
> >                             Return (Zero)
> >                         }
> >
> > but I tried executing this code after suspending amdgpu, and the
> > problem still stands, amdgpu cannot wakeup correctly.
> >
> > There's nothing else really interesting in the PEP device as far as I can see.
> >
> > PEP things aside, I am still quite suspicious about the fact that
> > calling amdgpu_device_suspend() then amdgpu_device_resume() on
> > multiple products (not just this one) fails. It seems that this code
> > flow is relying on the BIOS doing something in the S3 suspend/resume
> > path in order to make the device resumable by amdgpu_device_resume(),
> > which is why we have only encountered this issue for the first time on
> > our first AMD platform that does not support S3 suspend.
> >
>
> It makes sense.  This is an SOC, not a collection of individual
> devices.  There are more devices than power rails so the sbios (via
> ACPI) has to handle the power rail.  All of the devices using that
> power rail have to suspend and tell the sbios before the platform
> microcontroller can turn off the power rail.  Presumably there is
> something missing that prevents the microcontroller for powering down
> the rail.  If the power rail is kept on, the device is never powered
> off and still retains its current state.
>
> > With that in mind, and lacking any better info, wouldn't it make sense
> > for amdgpu_device_resume() to always call reset? Maybe it's not
> > necessary in the S3 case, but it shouldn't harm anything. Or perhaps
> > it could check if the device is alive and reset it if it's not?
>
> It's just papering over the problem.  It would be better from a power
> perspective for the driver to just not suspend and keep running like
> normal.  When the driver is not suspended runtime things like clock
> and power gating are active which keep the GPU power at a minimum.
>
> >
> > Alternatively do you have any other contacts within AMD that could
> > help us figure out the underlying question of how to correctly suspend
> > and resume these devices? Happy to ship an affected product sample
> > your way.
> >
>
> I talked to our sbios team and they seem to think our S0ix
> implementation works pretty differently from Intel's.  I'm not really
> an expert on this area however.  We have a new team ramping on up this
> for Linux however.

Initial patches for S0ix platform support:
https://patchwork.kernel.org/project/platform-driver-x86/patch/20201023080410.458629-1-Shyam-sundar.S-k@amd.com/
https://patchwork.kernel.org/project/linux-acpi/patch/20201023080315.458570-1-Shyam-sundar.S-k@amd.com/

You also need the following GPU driver patches:
https://patchwork.freedesktop.org/series/82762/

There is also an audio patch required which will be available soon.

Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] drm/amdgpu: always reset asic when going into suspend
@ 2020-10-23 15:04                     ` Alex Deucher
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Deucher @ 2020-10-23 15:04 UTC (permalink / raw)
  To: Daniel Drake
  Cc: Deucher, Alexander, Chunming Zhou, Christian Koenig,
	amd-gfx list, Linux PM

On Thu, Jan 16, 2020 at 10:14 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Jan 15, 2020 at 2:44 AM Daniel Drake <drake@endlessm.com> wrote:
> >
> > On Thu, Dec 19, 2019 at 10:08 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > > I think there may be some AMD specific handling needed in
> > > drivers/acpi/sleep.c.  My understanding from reading the modern
> > > standby documents from MS is that each vendor needs to provide a
> > > platform specific PEP driver.  I'm not sure how much of that current
> > > code is Intel specific or not.
> >
> > I don't think there is anything Intel-specific in drivers/acpi/sleep.c.
> >
> > Reading more about PEP, I see that Linux supports PEP devices with
> > ACPI ID INT33A1 or PNP0D80. Indeed the Intel platforms we work with
> > have INT33A1 devices in their ACPI tables.
> >
> > This product has a \_SB.PEP ACPI device with _HID AMD0004 and _CID
> > PNP0D80. Full acpidump:
> > https://gist.github.com/dsd/ff3dfc0f63cdd9eba4a0fbd9e776e8be (see
> > ssdt7)
> >
> > This PEP device responds to a _DSM with UUID argument
> > "e3f32452-febc-43ce-9039-932122d37721", which is not the one
> > documented at https://uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf
> >
> > Nevertheless, there is some data about the GPU:
> >                     Package (0x04)
> >                     {
> >                         One,
> >                         "\\_SB.PCI0.GP17.VGA",
> >                         Zero,
> >                         0x03
> >                     },
> >
> > However since this data is identical to many other devices that
> > suspend and resume just fine, I wonder if it is really important.
> >
> > The one supported method does offer two calls which may mirror the
> > Display Off/On Notifications in the above spec:
> >                         Case (0x02)
> >                         {
> >                             \_SB.PCI0.SBRG.EC0.CSEE (0xB7)
> >                             Return (Zero)
> >                         }
> >                         Case (0x03)
> >                         {
> >                             \_SB.PCI0.SBRG.EC0.CSEE (0xB8)
> >                             Notify (\_SB.PCI0.SBRG.EC0.LID, 0x80) //
> > Status Change
> >                             Return (Zero)
> >                         }
> >
> > but I tried executing this code after suspending amdgpu, and the
> > problem still stands, amdgpu cannot wakeup correctly.
> >
> > There's nothing else really interesting in the PEP device as far as I can see.
> >
> > PEP things aside, I am still quite suspicious about the fact that
> > calling amdgpu_device_suspend() then amdgpu_device_resume() on
> > multiple products (not just this one) fails. It seems that this code
> > flow is relying on the BIOS doing something in the S3 suspend/resume
> > path in order to make the device resumable by amdgpu_device_resume(),
> > which is why we have only encountered this issue for the first time on
> > our first AMD platform that does not support S3 suspend.
> >
>
> It makes sense.  This is an SOC, not a collection of individual
> devices.  There are more devices than power rails so the sbios (via
> ACPI) has to handle the power rail.  All of the devices using that
> power rail have to suspend and tell the sbios before the platform
> microcontroller can turn off the power rail.  Presumably there is
> something missing that prevents the microcontroller for powering down
> the rail.  If the power rail is kept on, the device is never powered
> off and still retains its current state.
>
> > With that in mind, and lacking any better info, wouldn't it make sense
> > for amdgpu_device_resume() to always call reset? Maybe it's not
> > necessary in the S3 case, but it shouldn't harm anything. Or perhaps
> > it could check if the device is alive and reset it if it's not?
>
> It's just papering over the problem.  It would be better from a power
> perspective for the driver to just not suspend and keep running like
> normal.  When the driver is not suspended runtime things like clock
> and power gating are active which keep the GPU power at a minimum.
>
> >
> > Alternatively do you have any other contacts within AMD that could
> > help us figure out the underlying question of how to correctly suspend
> > and resume these devices? Happy to ship an affected product sample
> > your way.
> >
>
> I talked to our sbios team and they seem to think our S0ix
> implementation works pretty differently from Intel's.  I'm not really
> an expert on this area however.  We have a new team ramping on up this
> for Linux however.

Initial patches for S0ix platform support:
https://patchwork.kernel.org/project/platform-driver-x86/patch/20201023080410.458629-1-Shyam-sundar.S-k@amd.com/
https://patchwork.kernel.org/project/linux-acpi/patch/20201023080315.458570-1-Shyam-sundar.S-k@amd.com/

You also need the following GPU driver patches:
https://patchwork.freedesktop.org/series/82762/

There is also an audio patch required which will be available soon.

Alex
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-10-23 15:05 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-15  6:50 [PATCH] drm/amdgpu: always reset asic when going into suspend Daniel Drake
     [not found] ` <20191015065002.18701-1-drake-6IF/jdPJHihWk0Htik3J/w@public.gmane.org>
2019-10-15 18:42   ` Alex Deucher
2019-10-16  7:23     ` Daniel Drake
2019-10-16  7:23       ` Daniel Drake
2019-11-22 15:31       ` Alex Deucher
2019-11-22 15:31         ` Alex Deucher
2019-11-22 15:31         ` Alex Deucher
2019-11-25  5:17         ` Daniel Drake
2019-11-25  5:17           ` Daniel Drake
2019-11-25  5:17           ` Daniel Drake
2019-12-16  9:00           ` Daniel Drake
2019-12-16  9:00             ` Daniel Drake
2019-12-19 14:08             ` Alex Deucher
2019-12-19 14:08               ` Alex Deucher
2020-01-15  7:44               ` Daniel Drake
2020-01-15  7:44                 ` Daniel Drake
2020-01-16 15:14                 ` Alex Deucher
2020-01-16 15:14                   ` Alex Deucher
2020-02-07  4:52                   ` Daniel Drake
2020-02-07  4:52                     ` Daniel Drake
2020-10-23 15:04                   ` Alex Deucher
2020-10-23 15:04                     ` Alex Deucher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.