[PATCH 1/2] drm/amdgpu: Use mod_delayed_work in amdgpu_gfx_off

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] drm/amdgpu: Use mod_delayed_work in amdgpu_gfx_off_ctrl
@ 2021-08-11 16:52 Michel Dänzer
  2021-08-11 16:52 ` [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks Michel Dänzer
                   ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Michel Dänzer @ 2021-08-11 16:52 UTC (permalink / raw)
  To: Alex Deucher, Christian König; +Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

From: Michel Dänzer <mdaenzer@redhat.com>

In contrast to schedule_delayed_work, this pushes back the work if it
was already scheduled before. Specific behaviour change:

Before:

amdgpu_device_delay_enable_gfx_off ran ~100 ms after the first time
GFXOFF was disabled and re-enabled, even if GFXOFF was disabled and
re-enabled again during those 100 ms.

After:

amdgpu_device_delay_enable_gfx_off runs ~100 ms after the last time
GFXOFF is disabled and re-enabled.

The former resulted in frame drops / stutter with the upcoming mutter
41 release on Navi 14, due to constantly enabling GFXOFF in the HW and
disabling it again (for getting the GPU clock counter).

Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index a0be0772c8b3..9cfef56b2aee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -569,7 +569,7 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
 		adev->gfx.gfx_off_req_count--;
 
 	if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
-		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
+		mod_delayed_work(system_wq, &adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
 	} else if (!enable && adev->gfx.gfx_off_state) {
 		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
 			adev->gfx.gfx_off_state = false;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-11 16:52 [PATCH 1/2] drm/amdgpu: Use mod_delayed_work in amdgpu_gfx_off_ctrl Michel Dänzer
@ 2021-08-11 16:52 ` Michel Dänzer
  2021-08-11 20:34   ` Alex Deucher
  2021-08-11 21:34   ` AW: " Koenig, Christian
  2021-08-12  2:43 ` [PATCH 1/2] drm/amdgpu: Use mod_delayed_work in amdgpu_gfx_off_ctrl Quan, Evan
  2021-08-13 10:29 ` [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled Michel Dänzer
  2 siblings, 2 replies; 49+ messages in thread
From: Michel Dänzer @ 2021-08-11 16:52 UTC (permalink / raw)
  To: Alex Deucher, Christian König; +Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

From: Michel Dänzer <mdaenzer@redhat.com>

In contrast to schedule_delayed_work, this pushes back the work if it
was already scheduled before. Specific behaviour change:

Before:

The scheduled work ran ~1 second after the first time ring_end_use was
called, even if the ring was used again during that second.

After:

The scheduled work runs ~1 second after the last time ring_end_use is
called.

Inspired by the corresponding change in amdgpu_gfx_off_ctrl. While I
haven't run into specific issues in this case, the new behaviour makes
more sense to me.

Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c    | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
index 8996cb4ed57a..2c0040153f6c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
@@ -110,7 +110,7 @@ void amdgpu_jpeg_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_jpeg_ring_end_use(struct amdgpu_ring *ring)
 {
 	atomic_dec(&ring->adev->jpeg.total_submission_cnt);
-	schedule_delayed_work(&ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
+	mod_delayed_work(system_wq, &ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
 }
 
 int amdgpu_jpeg_dec_ring_test_ring(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 0f576f294d8a..b6b1d7eeb8e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -1283,7 +1283,7 @@ void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_uvd_ring_end_use(struct amdgpu_ring *ring)
 {
 	if (!amdgpu_sriov_vf(ring->adev))
-		schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
+		mod_delayed_work(system_wq, &ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index 1ae7f824adc7..2253c18a6688 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -401,7 +401,7 @@ void amdgpu_vce_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_vce_ring_end_use(struct amdgpu_ring *ring)
 {
 	if (!amdgpu_sriov_vf(ring->adev))
-		schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
+		mod_delayed_work(system_wq, &ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
index 284bb42d6c86..d5937ab5ac80 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
@@ -1874,7 +1874,7 @@ void vcn_v1_0_set_pg_for_begin_use(struct amdgpu_ring *ring, bool set_clocks)
 
 void vcn_v1_0_ring_end_use(struct amdgpu_ring *ring)
 {
-	schedule_delayed_work(&ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
+	mod_delayed_work(system_wq, &ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
 	mutex_unlock(&ring->adev->vcn.vcn1_jpeg1_workaround);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-11 16:52 ` [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks Michel Dänzer
@ 2021-08-11 20:34   ` Alex Deucher
  2021-08-11 21:00     ` Zhu, James
  2021-08-11 21:34   ` AW: " Koenig, Christian
  1 sibling, 1 reply; 49+ messages in thread
From: Alex Deucher @ 2021-08-11 20:34 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Alex Deucher, Christian König, Leo Liu, James Zhu,
	amd-gfx list, Maling list - DRI developers

On Wed, Aug 11, 2021 at 12:52 PM Michel Dänzer <michel@daenzer.net> wrote:
>
> From: Michel Dänzer <mdaenzer@redhat.com>
>
> In contrast to schedule_delayed_work, this pushes back the work if it
> was already scheduled before. Specific behaviour change:
>
> Before:
>
> The scheduled work ran ~1 second after the first time ring_end_use was
> called, even if the ring was used again during that second.
>
> After:
>
> The scheduled work runs ~1 second after the last time ring_end_use is
> called.
>
> Inspired by the corresponding change in amdgpu_gfx_off_ctrl. While I
> haven't run into specific issues in this case, the new behaviour makes
> more sense to me.
>
> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>

Makes sense to me.  Applied the series.

Thanks!

Alex


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c  | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c  | 2 +-
>  drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c    | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
> index 8996cb4ed57a..2c0040153f6c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
> @@ -110,7 +110,7 @@ void amdgpu_jpeg_ring_begin_use(struct amdgpu_ring *ring)
>  void amdgpu_jpeg_ring_end_use(struct amdgpu_ring *ring)
>  {
>         atomic_dec(&ring->adev->jpeg.total_submission_cnt);
> -       schedule_delayed_work(&ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
> +       mod_delayed_work(system_wq, &ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
>  }
>
>  int amdgpu_jpeg_dec_ring_test_ring(struct amdgpu_ring *ring)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> index 0f576f294d8a..b6b1d7eeb8e5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> @@ -1283,7 +1283,7 @@ void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
>  void amdgpu_uvd_ring_end_use(struct amdgpu_ring *ring)
>  {
>         if (!amdgpu_sriov_vf(ring->adev))
> -               schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
> +               mod_delayed_work(system_wq, &ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
>  }
>
>  /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> index 1ae7f824adc7..2253c18a6688 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> @@ -401,7 +401,7 @@ void amdgpu_vce_ring_begin_use(struct amdgpu_ring *ring)
>  void amdgpu_vce_ring_end_use(struct amdgpu_ring *ring)
>  {
>         if (!amdgpu_sriov_vf(ring->adev))
> -               schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
> +               mod_delayed_work(system_wq, &ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
>  }
>
>  /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
> index 284bb42d6c86..d5937ab5ac80 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
> @@ -1874,7 +1874,7 @@ void vcn_v1_0_set_pg_for_begin_use(struct amdgpu_ring *ring, bool set_clocks)
>
>  void vcn_v1_0_ring_end_use(struct amdgpu_ring *ring)
>  {
> -       schedule_delayed_work(&ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
> +       mod_delayed_work(system_wq, &ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
>         mutex_unlock(&ring->adev->vcn.vcn1_jpeg1_workaround);
>  }
>
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-11 20:34   ` Alex Deucher
@ 2021-08-11 21:00     ` Zhu, James
  0 siblings, 0 replies; 49+ messages in thread
From: Zhu, James @ 2021-08-11 21:00 UTC (permalink / raw)
  To: Alex Deucher, Michel Dänzer
  Cc: Deucher, Alexander, Koenig, Christian, Liu, Leo, amd-gfx list,
	Maling list - DRI developers

[-- Attachment #1: Type: text/plain, Size: 4350 bytes --]

[AMD Official Use Only]

This patch is Reviewed-by: James Zhu <James.Zhu@amd.com>


Thanks & Best Regards!


James Zhu

________________________________
From: Alex Deucher <alexdeucher@gmail.com>
Sent: Wednesday, August 11, 2021 4:34 PM
To: Michel Dänzer <michel@daenzer.net>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Maling list - DRI developers <dri-devel@lists.freedesktop.org>
Subject: Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

On Wed, Aug 11, 2021 at 12:52 PM Michel Dänzer <michel@daenzer.net> wrote:
>
> From: Michel Dänzer <mdaenzer@redhat.com>
>
> In contrast to schedule_delayed_work, this pushes back the work if it
> was already scheduled before. Specific behaviour change:
>
> Before:
>
> The scheduled work ran ~1 second after the first time ring_end_use was
> called, even if the ring was used again during that second.
>
> After:
>
> The scheduled work runs ~1 second after the last time ring_end_use is
> called.
>
> Inspired by the corresponding change in amdgpu_gfx_off_ctrl. While I
> haven't run into specific issues in this case, the new behaviour makes
> more sense to me.
>
> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>

Makes sense to me.  Applied the series.

Thanks!

Alex


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c  | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c  | 2 +-
>  drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c    | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
> index 8996cb4ed57a..2c0040153f6c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
> @@ -110,7 +110,7 @@ void amdgpu_jpeg_ring_begin_use(struct amdgpu_ring *ring)
>  void amdgpu_jpeg_ring_end_use(struct amdgpu_ring *ring)
>  {
>         atomic_dec(&ring->adev->jpeg.total_submission_cnt);
> -       schedule_delayed_work(&ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
> +       mod_delayed_work(system_wq, &ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
>  }
>
>  int amdgpu_jpeg_dec_ring_test_ring(struct amdgpu_ring *ring)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> index 0f576f294d8a..b6b1d7eeb8e5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
> @@ -1283,7 +1283,7 @@ void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
>  void amdgpu_uvd_ring_end_use(struct amdgpu_ring *ring)
>  {
>         if (!amdgpu_sriov_vf(ring->adev))
> -               schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
> +               mod_delayed_work(system_wq, &ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
>  }
>
>  /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> index 1ae7f824adc7..2253c18a6688 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
> @@ -401,7 +401,7 @@ void amdgpu_vce_ring_begin_use(struct amdgpu_ring *ring)
>  void amdgpu_vce_ring_end_use(struct amdgpu_ring *ring)
>  {
>         if (!amdgpu_sriov_vf(ring->adev))
> -               schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
> +               mod_delayed_work(system_wq, &ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
>  }
>
>  /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
> index 284bb42d6c86..d5937ab5ac80 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
> @@ -1874,7 +1874,7 @@ void vcn_v1_0_set_pg_for_begin_use(struct amdgpu_ring *ring, bool set_clocks)
>
>  void vcn_v1_0_ring_end_use(struct amdgpu_ring *ring)
>  {
> -       schedule_delayed_work(&ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
> +       mod_delayed_work(system_wq, &ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
>         mutex_unlock(&ring->adev->vcn.vcn1_jpeg1_workaround);
>  }
>
> --
> 2.32.0
>

[-- Attachment #2: Type: text/html, Size: 7356 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* AW: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-11 16:52 ` [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks Michel Dänzer
  2021-08-11 20:34   ` Alex Deucher
@ 2021-08-11 21:34   ` Koenig, Christian
  2021-08-11 22:12     ` Zhu, James
  2021-08-12  2:42     ` Quan, Evan
  1 sibling, 2 replies; 49+ messages in thread
From: Koenig, Christian @ 2021-08-11 21:34 UTC (permalink / raw)
  To: Michel Dänzer, Deucher, Alexander
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 4435 bytes --]

NAK to at least this patch.

Since activating power management while submitting work is problematic cancel_delayed_work() must have been called during begin use or otherwise we have a serious coding problem in the first place.

So this change shouldn't make a difference and I suggest to really stick with schedule_delayed_work().

Maybe add a comment how this works?

Need to take a closer look at the first patch when I'm back from vacation, but it could be that this applies there as well.

Regards,
Christian.

________________________________
Von: Michel Dänzer <michel@daenzer.net>
Gesendet: Mittwoch, 11. August 2021 18:52
An: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
Betreff: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

From: Michel Dänzer <mdaenzer@redhat.com>

In contrast to schedule_delayed_work, this pushes back the work if it
was already scheduled before. Specific behaviour change:

Before:

The scheduled work ran ~1 second after the first time ring_end_use was
called, even if the ring was used again during that second.

After:

The scheduled work runs ~1 second after the last time ring_end_use is
called.

Inspired by the corresponding change in amdgpu_gfx_off_ctrl. While I
haven't run into specific issues in this case, the new behaviour makes
more sense to me.

Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c    | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
index 8996cb4ed57a..2c0040153f6c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
@@ -110,7 +110,7 @@ void amdgpu_jpeg_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_jpeg_ring_end_use(struct amdgpu_ring *ring)
 {
         atomic_dec(&ring->adev->jpeg.total_submission_cnt);
-       schedule_delayed_work(&ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
+       mod_delayed_work(system_wq, &ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
 }

 int amdgpu_jpeg_dec_ring_test_ring(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 0f576f294d8a..b6b1d7eeb8e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -1283,7 +1283,7 @@ void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_uvd_ring_end_use(struct amdgpu_ring *ring)
 {
         if (!amdgpu_sriov_vf(ring->adev))
-               schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
+               mod_delayed_work(system_wq, &ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
 }

 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index 1ae7f824adc7..2253c18a6688 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -401,7 +401,7 @@ void amdgpu_vce_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_vce_ring_end_use(struct amdgpu_ring *ring)
 {
         if (!amdgpu_sriov_vf(ring->adev))
-               schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
+               mod_delayed_work(system_wq, &ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
 }

 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
index 284bb42d6c86..d5937ab5ac80 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
@@ -1874,7 +1874,7 @@ void vcn_v1_0_set_pg_for_begin_use(struct amdgpu_ring *ring, bool set_clocks)

 void vcn_v1_0_ring_end_use(struct amdgpu_ring *ring)
 {
-       schedule_delayed_work(&ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
+       mod_delayed_work(system_wq, &ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
         mutex_unlock(&ring->adev->vcn.vcn1_jpeg1_workaround);
 }

--
2.32.0


[-- Attachment #2: Type: text/html, Size: 6283 bytes --]

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-11 21:34   ` AW: " Koenig, Christian
@ 2021-08-11 22:12     ` Zhu, James
  2021-08-11 22:22       ` Zhu, James
  2021-08-12  2:42     ` Quan, Evan
  1 sibling, 1 reply; 49+ messages in thread
From: Zhu, James @ 2021-08-11 22:12 UTC (permalink / raw)
  To: Koenig, Christian, Michel Dänzer, Deucher, Alexander
  Cc: Liu, Leo, amd-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 5612 bytes --]

[AMD Official Use Only]

Hi Christian,

Since we have strict check on queue status, I don't think original design can cause issue here.
But this change should help improve below case:

  1.  both enc thread and dec thread try to start begin_use.
  2.  dec thread gets the chance to finish begin_use process first.
  3.  before dec thread enters end_use, enc thread gets time slot to run through begin_use(No delay work scheduled at that time)
  4.  dec thread enters end_use, scheduled a delay work
  5.   enc thread enters end_use, modify this delay work.

It will help reduce one delay work call at least.

Thanks & Best Regards!

James Zhu

________________________________
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Wednesday, August 11, 2021 5:34 PM
To: Michel Dänzer <michel@daenzer.net>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
Subject: AW: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

NAK to at least this patch.

Since activating power management while submitting work is problematic cancel_delayed_work() must have been called during begin use or otherwise we have a serious coding problem in the first place.

So this change shouldn't make a difference and I suggest to really stick with schedule_delayed_work().

Maybe add a comment how this works?

Need to take a closer look at the first patch when I'm back from vacation, but it could be that this applies there as well.

Regards,
Christian.

________________________________
Von: Michel Dänzer <michel@daenzer.net>
Gesendet: Mittwoch, 11. August 2021 18:52
An: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
Betreff: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

From: Michel Dänzer <mdaenzer@redhat.com>

In contrast to schedule_delayed_work, this pushes back the work if it
was already scheduled before. Specific behaviour change:

Before:

The scheduled work ran ~1 second after the first time ring_end_use was
called, even if the ring was used again during that second.

After:

The scheduled work runs ~1 second after the last time ring_end_use is
called.

Inspired by the corresponding change in amdgpu_gfx_off_ctrl. While I
haven't run into specific issues in this case, the new behaviour makes
more sense to me.

Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c    | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
index 8996cb4ed57a..2c0040153f6c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
@@ -110,7 +110,7 @@ void amdgpu_jpeg_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_jpeg_ring_end_use(struct amdgpu_ring *ring)
 {
         atomic_dec(&ring->adev->jpeg.total_submission_cnt);
-       schedule_delayed_work(&ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
+       mod_delayed_work(system_wq, &ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
 }

 int amdgpu_jpeg_dec_ring_test_ring(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 0f576f294d8a..b6b1d7eeb8e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -1283,7 +1283,7 @@ void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_uvd_ring_end_use(struct amdgpu_ring *ring)
 {
         if (!amdgpu_sriov_vf(ring->adev))
-               schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
+               mod_delayed_work(system_wq, &ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
 }

 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index 1ae7f824adc7..2253c18a6688 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -401,7 +401,7 @@ void amdgpu_vce_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_vce_ring_end_use(struct amdgpu_ring *ring)
 {
         if (!amdgpu_sriov_vf(ring->adev))
-               schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
+               mod_delayed_work(system_wq, &ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
 }

 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
index 284bb42d6c86..d5937ab5ac80 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
@@ -1874,7 +1874,7 @@ void vcn_v1_0_set_pg_for_begin_use(struct amdgpu_ring *ring, bool set_clocks)

 void vcn_v1_0_ring_end_use(struct amdgpu_ring *ring)
 {
-       schedule_delayed_work(&ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
+       mod_delayed_work(system_wq, &ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
         mutex_unlock(&ring->adev->vcn.vcn1_jpeg1_workaround);
 }

--
2.32.0

[-- Attachment #2: Type: text/html, Size: 9283 bytes --]

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-11 22:12     ` Zhu, James
@ 2021-08-11 22:22       ` Zhu, James
  0 siblings, 0 replies; 49+ messages in thread
From: Zhu, James @ 2021-08-11 22:22 UTC (permalink / raw)
  To: Koenig, Christian, Michel Dänzer, Deucher, Alexander
  Cc: Liu, Leo, amd-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 6408 bytes --]

[AMD Official Use Only]

I shouldn't say reduce one delay work call ,  For this case, Michael's proposal is closer to idle work design's purpose.

Thanks & Best Regards!

James Zhu

________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Zhu, James <James.Zhu@amd.com>
Sent: Wednesday, August 11, 2021 6:12 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; Michel Dänzer <michel@daenzer.net>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Liu, Leo <Leo.Liu@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
Subject: Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

[AMD Official Use Only]

[AMD Official Use Only]

Hi Christian,

Since we have strict check on queue status, I don't think original design can cause issue here.
But this change should help improve below case:

  1.  both enc thread and dec thread try to start begin_use.
  2.  dec thread gets the chance to finish begin_use process first.
  3.  before dec thread enters end_use, enc thread gets time slot to run through begin_use(No delay work scheduled at that time)
  4.  dec thread enters end_use, scheduled a delay work
  5.   enc thread enters end_use, modify this delay work.

It will help reduce one delay work call at least.

Thanks & Best Regards!

James Zhu

________________________________
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Wednesday, August 11, 2021 5:34 PM
To: Michel Dänzer <michel@daenzer.net>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
Subject: AW: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

NAK to at least this patch.

Since activating power management while submitting work is problematic cancel_delayed_work() must have been called during begin use or otherwise we have a serious coding problem in the first place.

So this change shouldn't make a difference and I suggest to really stick with schedule_delayed_work().

Maybe add a comment how this works?

Need to take a closer look at the first patch when I'm back from vacation, but it could be that this applies there as well.

Regards,
Christian.

________________________________
Von: Michel Dänzer <michel@daenzer.net>
Gesendet: Mittwoch, 11. August 2021 18:52
An: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
Betreff: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

From: Michel Dänzer <mdaenzer@redhat.com>

In contrast to schedule_delayed_work, this pushes back the work if it
was already scheduled before. Specific behaviour change:

Before:

The scheduled work ran ~1 second after the first time ring_end_use was
called, even if the ring was used again during that second.

After:

The scheduled work runs ~1 second after the last time ring_end_use is
called.

Inspired by the corresponding change in amdgpu_gfx_off_ctrl. While I
haven't run into specific issues in this case, the new behaviour makes
more sense to me.

Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c    | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
index 8996cb4ed57a..2c0040153f6c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
@@ -110,7 +110,7 @@ void amdgpu_jpeg_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_jpeg_ring_end_use(struct amdgpu_ring *ring)
 {
         atomic_dec(&ring->adev->jpeg.total_submission_cnt);
-       schedule_delayed_work(&ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
+       mod_delayed_work(system_wq, &ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
 }

 int amdgpu_jpeg_dec_ring_test_ring(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 0f576f294d8a..b6b1d7eeb8e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -1283,7 +1283,7 @@ void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_uvd_ring_end_use(struct amdgpu_ring *ring)
 {
         if (!amdgpu_sriov_vf(ring->adev))
-               schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
+               mod_delayed_work(system_wq, &ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
 }

 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index 1ae7f824adc7..2253c18a6688 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -401,7 +401,7 @@ void amdgpu_vce_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_vce_ring_end_use(struct amdgpu_ring *ring)
 {
         if (!amdgpu_sriov_vf(ring->adev))
-               schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
+               mod_delayed_work(system_wq, &ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
 }

 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
index 284bb42d6c86..d5937ab5ac80 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
@@ -1874,7 +1874,7 @@ void vcn_v1_0_set_pg_for_begin_use(struct amdgpu_ring *ring, bool set_clocks)

 void vcn_v1_0_ring_end_use(struct amdgpu_ring *ring)
 {
-       schedule_delayed_work(&ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
+       mod_delayed_work(system_wq, &ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
         mutex_unlock(&ring->adev->vcn.vcn1_jpeg1_workaround);
 }

--
2.32.0

[-- Attachment #2: Type: text/html, Size: 11288 bytes --]

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* RE: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-11 21:34   ` AW: " Koenig, Christian
  2021-08-11 22:12     ` Zhu, James
@ 2021-08-12  2:42     ` Quan, Evan
  2021-08-12  5:55       ` AW: " Koenig, Christian
  1 sibling, 1 reply; 49+ messages in thread
From: Quan, Evan @ 2021-08-12  2:42 UTC (permalink / raw)
  To: Koenig, Christian, Michel Dänzer, Deucher, Alexander
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 5911 bytes --]

[AMD Official Use Only]

Different from the 1st patch(for amdgpu_gfx_off_ctrl) of the series, "cancel_delayed_work_sync(&adev->uvd.idle_work)" will be called on like amdgpu_uvd_ring_begin_use().  Under this case, does it make any difference from previous implementation "schedule_delayed_work"?
Suppose the sequence is as below:

  *   Ring begin use
  *   Ring end use -->  mod_delayed_work() : queue a new delayed work, right?
  *   Ring begin use (within 1s) --> cancel_delayed_work_sync() will cancel the work submitted above, right?
  *   Ring end use  --> mod_delayed_work(): queue another new scheduled work, same as previous "schedule_delayed_work"?

BR
Evan
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Koenig, Christian
Sent: Thursday, August 12, 2021 5:34 AM
To: Michel Dänzer <michel@daenzer.net>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org
Subject: AW: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

NAK to at least this patch.

Since activating power management while submitting work is problematic cancel_delayed_work() must have been called during begin use or otherwise we have a serious coding problem in the first place.

So this change shouldn't make a difference and I suggest to really stick with schedule_delayed_work().

Maybe add a comment how this works?

Need to take a closer look at the first patch when I'm back from vacation, but it could be that this applies there as well.

Regards,
Christian.

________________________________
Von: Michel Dänzer <michel@daenzer.net<mailto:michel@daenzer.net>>
Gesendet: Mittwoch, 11. August 2021 18:52
An: Deucher, Alexander <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>; Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>
Cc: Liu, Leo <Leo.Liu@amd.com<mailto:Leo.Liu@amd.com>>; Zhu, James <James.Zhu@amd.com<mailto:James.Zhu@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>; dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org> <dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>>
Betreff: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

From: Michel Dänzer <mdaenzer@redhat.com<mailto:mdaenzer@redhat.com>>

In contrast to schedule_delayed_work, this pushes back the work if it
was already scheduled before. Specific behaviour change:

Before:

The scheduled work ran ~1 second after the first time ring_end_use was
called, even if the ring was used again during that second.

After:

The scheduled work runs ~1 second after the last time ring_end_use is
called.

Inspired by the corresponding change in amdgpu_gfx_off_ctrl. While I
haven't run into specific issues in this case, the new behaviour makes
more sense to me.

Signed-off-by: Michel Dänzer <mdaenzer@redhat.com<mailto:mdaenzer@redhat.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c    | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
index 8996cb4ed57a..2c0040153f6c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
@@ -110,7 +110,7 @@ void amdgpu_jpeg_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_jpeg_ring_end_use(struct amdgpu_ring *ring)
 {
         atomic_dec(&ring->adev->jpeg.total_submission_cnt);
-       schedule_delayed_work(&ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
+       mod_delayed_work(system_wq, &ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
 }

 int amdgpu_jpeg_dec_ring_test_ring(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 0f576f294d8a..b6b1d7eeb8e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -1283,7 +1283,7 @@ void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_uvd_ring_end_use(struct amdgpu_ring *ring)
 {
         if (!amdgpu_sriov_vf(ring->adev))
-               schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
+               mod_delayed_work(system_wq, &ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
 }

 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index 1ae7f824adc7..2253c18a6688 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -401,7 +401,7 @@ void amdgpu_vce_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_vce_ring_end_use(struct amdgpu_ring *ring)
 {
         if (!amdgpu_sriov_vf(ring->adev))
-               schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
+               mod_delayed_work(system_wq, &ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
 }

 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
index 284bb42d6c86..d5937ab5ac80 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
@@ -1874,7 +1874,7 @@ void vcn_v1_0_set_pg_for_begin_use(struct amdgpu_ring *ring, bool set_clocks)

 void vcn_v1_0_ring_end_use(struct amdgpu_ring *ring)
 {
-       schedule_delayed_work(&ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
+       mod_delayed_work(system_wq, &ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
         mutex_unlock(&ring->adev->vcn.vcn1_jpeg1_workaround);
 }

--
2.32.0

[-- Attachment #2: Type: text/html, Size: 16046 bytes --]

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu: Use mod_delayed_work in amdgpu_gfx_off_ctrl
  2021-08-11 16:52 [PATCH 1/2] drm/amdgpu: Use mod_delayed_work in amdgpu_gfx_off_ctrl Michel Dänzer
  2021-08-11 16:52 ` [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks Michel Dänzer
@ 2021-08-12  2:43 ` Quan, Evan
  2021-08-13 10:29 ` [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled Michel Dänzer
  2 siblings, 0 replies; 49+ messages in thread
From: Quan, Evan @ 2021-08-12  2:43 UTC (permalink / raw)
  To: Michel Dänzer, Deucher, Alexander, Koenig, Christian
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

[AMD Official Use Only]

Reviewed-by: Evan Quan <evan.quan@amd.com>

> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Michel Dänzer
> Sent: Thursday, August 12, 2021 12:52 AM
> To: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>
> Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-
> gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> Subject: [PATCH 1/2] drm/amdgpu: Use mod_delayed_work in
> amdgpu_gfx_off_ctrl
> 
> From: Michel Dänzer <mdaenzer@redhat.com>
> 
> In contrast to schedule_delayed_work, this pushes back the work if it
> was already scheduled before. Specific behaviour change:
> 
> Before:
> 
> amdgpu_device_delay_enable_gfx_off ran ~100 ms after the first time
> GFXOFF was disabled and re-enabled, even if GFXOFF was disabled and
> re-enabled again during those 100 ms.
> 
> After:
> 
> amdgpu_device_delay_enable_gfx_off runs ~100 ms after the last time
> GFXOFF is disabled and re-enabled.
> 
> The former resulted in frame drops / stutter with the upcoming mutter
> 41 release on Navi 14, due to constantly enabling GFXOFF in the HW and
> disabling it again (for getting the GPU clock counter).
> 
> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index a0be0772c8b3..9cfef56b2aee 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -569,7 +569,7 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device
> *adev, bool enable)
>  		adev->gfx.gfx_off_req_count--;
> 
>  	if (enable && !adev->gfx.gfx_off_state && !adev-
> >gfx.gfx_off_req_count) {
> -		schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
> GFX_OFF_DELAY_ENABLE);
> +		mod_delayed_work(system_wq, &adev-
> >gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>  	} else if (!enable && adev->gfx.gfx_off_state) {
>  		if (!amdgpu_dpm_set_powergating_by_smu(adev,
> AMD_IP_BLOCK_TYPE_GFX, false)) {
>  			adev->gfx.gfx_off_state = false;
> --
> 2.32.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* AW: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-12  2:42     ` Quan, Evan
@ 2021-08-12  5:55       ` Koenig, Christian
  2021-08-12  8:11         ` Michel Dänzer
  0 siblings, 1 reply; 49+ messages in thread
From: Koenig, Christian @ 2021-08-12  5:55 UTC (permalink / raw)
  To: Quan, Evan, Michel Dänzer, Deucher, Alexander
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 7127 bytes --]

Hi James,

Evan seems to have understood how this all works together.

See while any begin/end use critical section is active the work should not be active.

When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.

Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.

Something similar applies to the first patch I think, so when this makes a difference it is actually a bug.

Regards,
Christian.
________________________________
Von: Quan, Evan <Evan.Quan@amd.com>
Gesendet: Donnerstag, 12. August 2021 04:42
An: Koenig, Christian <Christian.Koenig@amd.com>; Michel Dänzer <michel@daenzer.net>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
Betreff: RE: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

[AMD Official Use Only]

Different from the 1st patch(for amdgpu_gfx_off_ctrl) of the series, “cancel_delayed_work_sync(&adev->uvd.idle_work)” will be called on like amdgpu_uvd_ring_begin_use().  Under this case, does it make any difference from previous implementation ”schedule_delayed_work”?

Suppose the sequence is as below:

  *   Ring begin use
  *   Ring end use -->  mod_delayed_work() : queue a new delayed work, right?
  *   Ring begin use (within 1s) --> cancel_delayed_work_sync() will cancel the work submitted above, right?
  *   Ring end use  --> mod_delayed_work(): queue another new scheduled work, same as previous “schedule_delayed_work”?

BR

Evan

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Koenig, Christian
Sent: Thursday, August 12, 2021 5:34 AM
To: Michel Dänzer <michel@daenzer.net>; Deucher, Alexander <Alexander.Deucher@amd.com>
Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org
Subject: AW: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

NAK to at least this patch.

Since activating power management while submitting work is problematic cancel_delayed_work() must have been called during begin use or otherwise we have a serious coding problem in the first place.

So this change shouldn't make a difference and I suggest to really stick with schedule_delayed_work().

Maybe add a comment how this works?

Need to take a closer look at the first patch when I'm back from vacation, but it could be that this applies there as well.

Regards,

Christian.

________________________________

Von: Michel Dänzer <michel@daenzer.net<mailto:michel@daenzer.net>>
Gesendet: Mittwoch, 11. August 2021 18:52
An: Deucher, Alexander <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>; Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>
Cc: Liu, Leo <Leo.Liu@amd.com<mailto:Leo.Liu@amd.com>>; Zhu, James <James.Zhu@amd.com<mailto:James.Zhu@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>; dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org> <dri-devel@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org>>
Betreff: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

From: Michel Dänzer <mdaenzer@redhat.com<mailto:mdaenzer@redhat.com>>

In contrast to schedule_delayed_work, this pushes back the work if it
was already scheduled before. Specific behaviour change:

Before:

The scheduled work ran ~1 second after the first time ring_end_use was
called, even if the ring was used again during that second.

After:

The scheduled work runs ~1 second after the last time ring_end_use is
called.

Inspired by the corresponding change in amdgpu_gfx_off_ctrl. While I
haven't run into specific issues in this case, the new behaviour makes
more sense to me.

Signed-off-by: Michel Dänzer <mdaenzer@redhat.com<mailto:mdaenzer@redhat.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c    | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
index 8996cb4ed57a..2c0040153f6c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
@@ -110,7 +110,7 @@ void amdgpu_jpeg_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_jpeg_ring_end_use(struct amdgpu_ring *ring)
 {
         atomic_dec(&ring->adev->jpeg.total_submission_cnt);
-       schedule_delayed_work(&ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
+       mod_delayed_work(system_wq, &ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
 }

 int amdgpu_jpeg_dec_ring_test_ring(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 0f576f294d8a..b6b1d7eeb8e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -1283,7 +1283,7 @@ void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_uvd_ring_end_use(struct amdgpu_ring *ring)
 {
         if (!amdgpu_sriov_vf(ring->adev))
-               schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
+               mod_delayed_work(system_wq, &ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
 }

 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index 1ae7f824adc7..2253c18a6688 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -401,7 +401,7 @@ void amdgpu_vce_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_vce_ring_end_use(struct amdgpu_ring *ring)
 {
         if (!amdgpu_sriov_vf(ring->adev))
-               schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
+               mod_delayed_work(system_wq, &ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
 }

 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
index 284bb42d6c86..d5937ab5ac80 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
@@ -1874,7 +1874,7 @@ void vcn_v1_0_set_pg_for_begin_use(struct amdgpu_ring *ring, bool set_clocks)

 void vcn_v1_0_ring_end_use(struct amdgpu_ring *ring)
 {
-       schedule_delayed_work(&ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
+       mod_delayed_work(system_wq, &ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
         mutex_unlock(&ring->adev->vcn.vcn1_jpeg1_workaround);
 }

--
2.32.0

[-- Attachment #2: Type: text/html, Size: 12212 bytes --]

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-12  5:55       ` AW: " Koenig, Christian
@ 2021-08-12  8:11         ` Michel Dänzer
  2021-08-12 11:33           ` Lazar, Lijo
  2021-08-16  7:33           ` Christian König
  0 siblings, 2 replies; 49+ messages in thread
From: Michel Dänzer @ 2021-08-12  8:11 UTC (permalink / raw)
  To: Koenig, Christian, Quan, Evan, Deucher, Alexander
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
> Hi James,
> 
> Evan seems to have understood how this all works together.
> 
> See while any begin/end use critical section is active the work should not be active.
> 
> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
> 
> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.

It merely assumes that the work may already have been scheduled before.

Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.

So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.

> Something similar applies to the first patch I think,

There are no cancel work calls in that case, so the commit log is accurate TTBOMK.

I noticed this because current mutter Git main wasn't able to sustain 60 fps on Navi 14 with a simple glxgears -fullscreen. mutter was dropping frames because its CPU work for a frame update occasionally took up to 3 ms, instead of the normal 2-300 microseconds. sysprof showed a lot of cycles spent in the functions which enable/disable GFXOFF in the HW.

> so when this makes a difference it is actually a bug.

There was certainly a bug though, which patch 1 fixes. :)

-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-12  8:11         ` Michel Dänzer
@ 2021-08-12 11:33           ` Lazar, Lijo
  2021-08-12 16:54             ` Michel Dänzer
  2021-08-16  7:33           ` Christian König
  1 sibling, 1 reply; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-12 11:33 UTC (permalink / raw)
  To: Michel Dänzer, Koenig, Christian, Quan, Evan, Deucher, Alexander
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel



On 8/12/2021 1:41 PM, Michel Dänzer wrote:
> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>> Hi James,
>>
>> Evan seems to have understood how this all works together.
>>
>> See while any begin/end use critical section is active the work should not be active.
>>
>> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
>>
>> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
> 
> It merely assumes that the work may already have been scheduled before.
> 
> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
> 
> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.
> 
> 
>> Something similar applies to the first patch I think,
> 
> There are no cancel work calls in that case, so the commit log is accurate TTBOMK.

Curious -

For patch 1, does it make a difference if any delayed work scheduled is 
cancelled in the else part before proceeding?

} else if (!enable && adev->gfx.gfx_off_state) {
cancel_delayed_work();


Thanks,
Lijo

> 
> I noticed this because current mutter Git main wasn't able to sustain 60 fps on Navi 14 with a simple glxgears -fullscreen. mutter was dropping frames because its CPU work for a frame update occasionally took up to 3 ms, instead of the normal 2-300 microseconds. sysprof showed a lot of cycles spent in the functions which enable/disable GFXOFF in the HW.
> 
> 
>> so when this makes a difference it is actually a bug.
> 
> There was certainly a bug though, which patch 1 fixes. :)
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-12 11:33           ` Lazar, Lijo
@ 2021-08-12 16:54             ` Michel Dänzer
  2021-08-13  4:23               ` Lazar, Lijo
  0 siblings, 1 reply; 49+ messages in thread
From: Michel Dänzer @ 2021-08-12 16:54 UTC (permalink / raw)
  To: Lazar, Lijo, Koenig, Christian, Quan, Evan, Deucher, Alexander
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

On 2021-08-12 1:33 p.m., Lazar, Lijo wrote:
> On 8/12/2021 1:41 PM, Michel Dänzer wrote:
>> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>>> Hi James,
>>>
>>> Evan seems to have understood how this all works together.
>>>
>>> See while any begin/end use critical section is active the work should not be active.
>>>
>>> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
>>>
>>> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
>>
>> It merely assumes that the work may already have been scheduled before.
>>
>> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
>>
>> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.
>>
>>
>>> Something similar applies to the first patch I think,
>>
>> There are no cancel work calls in that case, so the commit log is accurate TTBOMK.
> 
> Curious -
> 
> For patch 1, does it make a difference if any delayed work scheduled is cancelled in the else part before proceeding?
> 
> } else if (!enable && adev->gfx.gfx_off_state) {
> cancel_delayed_work();

I tried the patch below.

While this does seem to fix the problem as well, I see a potential issue:

1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex
2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock
3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync

I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain though)


Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm not sure how offhand. (With cancel_delayed_work instead, I'm worried amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that might happen with mod_delayed_work as well...)


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c

index a0be0772c8b3..3e4585ffb9af 100644

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c

@@ -570,8 +570,11 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)



        if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {

                schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);

-       } else if (!enable && adev->gfx.gfx_off_state) {

-               if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {

+       } else if (!enable) {

+               cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);

+

+               if (adev->gfx.gfx_off_state &&

+                   !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {

                        adev->gfx.gfx_off_state = false;



                        if (adev->gfx.funcs->init_spm_golden) {



-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-12 16:54             ` Michel Dänzer
@ 2021-08-13  4:23               ` Lazar, Lijo
  2021-08-13 10:31                 ` Michel Dänzer
  0 siblings, 1 reply; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-13  4:23 UTC (permalink / raw)
  To: Michel Dänzer, Koenig, Christian, Quan, Evan, Deucher, Alexander
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel



On 8/12/2021 10:24 PM, Michel Dänzer wrote:
> On 2021-08-12 1:33 p.m., Lazar, Lijo wrote:
>> On 8/12/2021 1:41 PM, Michel Dänzer wrote:
>>> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>>>> Hi James,
>>>>
>>>> Evan seems to have understood how this all works together.
>>>>
>>>> See while any begin/end use critical section is active the work should not be active.
>>>>
>>>> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
>>>>
>>>> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
>>>
>>> It merely assumes that the work may already have been scheduled before.
>>>
>>> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
>>>
>>> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.
>>>
>>>
>>>> Something similar applies to the first patch I think,
>>>
>>> There are no cancel work calls in that case, so the commit log is accurate TTBOMK.
>>
>> Curious -
>>
>> For patch 1, does it make a difference if any delayed work scheduled is cancelled in the else part before proceeding?
>>
>> } else if (!enable && adev->gfx.gfx_off_state) {
>> cancel_delayed_work();
> 
> I tried the patch below.
> 
> While this does seem to fix the problem as well, I see a potential issue:
> 
> 1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex
> 2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock
> 3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync
> 
> I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain though)
> 

Should use the cancel_delayed_work instead of the _sync version. As you 
mentioned - at best work is not scheduled yet and cancelled 
successfully, or at worst it's waiting for the mutex. In the worst case, 
if amdgpu_device_delay_enable_gfx_off gets the mutex after 
amdgpu_gfx_off_ctrl unlocks it, there is an extra check as below.

if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count)

The count wouldn't be 0 and hence it won't enable GFXOFF.

> 
> Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm not sure how offhand. (With cancel_delayed_work instead, I'm worried amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that might happen with mod_delayed_work as well...)

As mentioned earlier, cancel_delayed_work won't cause this issue.

In the mod_delayed_ patch, mod_ version is called only when req_count is 
0. While that is a good thing, it keeps alive one more contender for the 
mutex.

The cancel_ version eliminates that contender if happens to be called at 
the right time (more likely if there are multiple requests to disable 
gfxoff). On the other hand, don't know how costly it is to call cancel_ 
every time on the else part (or maybe call only once when count 
increments to 1?).

Thanks,
Lijo

> 
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> 
> index a0be0772c8b3..3e4585ffb9af 100644
> 
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> 
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> 
> @@ -570,8 +570,11 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
> 
> 
> 
>          if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
> 
>                  schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
> 
> -       } else if (!enable && adev->gfx.gfx_off_state) {
> 
> -               if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
> 
> +       } else if (!enable) {
> 
> +               cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
> 
> +
> 
> +               if (adev->gfx.gfx_off_state &&
> 
> +                   !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
> 
>                          adev->gfx.gfx_off_state = false;
> 
> 
> 
>                          if (adev->gfx.funcs->init_spm_golden) {
> 
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-11 16:52 [PATCH 1/2] drm/amdgpu: Use mod_delayed_work in amdgpu_gfx_off_ctrl Michel Dänzer
  2021-08-11 16:52 ` [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks Michel Dänzer
  2021-08-12  2:43 ` [PATCH 1/2] drm/amdgpu: Use mod_delayed_work in amdgpu_gfx_off_ctrl Quan, Evan
@ 2021-08-13 10:29 ` Michel Dänzer
  2021-08-13 11:50   ` Lazar, Lijo
                     ` (3 more replies)
  2 siblings, 4 replies; 49+ messages in thread
From: Michel Dänzer @ 2021-08-13 10:29 UTC (permalink / raw)
  To: Alex Deucher, Christian König; +Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

From: Michel Dänzer <mdaenzer@redhat.com>

schedule_delayed_work does not push back the work if it was already
scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
was disabled and re-enabled again during those 100 ms.

This resulted in frame drops / stutter with the upcoming mutter 41
release on Navi 14, due to constantly enabling GFXOFF in the HW and
disabling it again (for getting the GPU clock counter).

To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
enabled to disabled. This makes sure the delayed work will be scheduled
as intended in the reverse case.

In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
to use mutex_trylock instead of mutex_lock.

v2:
* Use cancel_delayed_work_sync & mutex_trylock instead of
  mod_delayed_work.

Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
 3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f3fd5ec710b6..8b025f70706c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
 	struct amdgpu_device *adev =
 		container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
 
-	mutex_lock(&adev->gfx.gfx_off_mutex);
+	/* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
+	if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
+		/* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
+		 * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
+		 * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
+		 */
+		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
+		return;
+	}
+
 	if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
 		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
 			adev->gfx.gfx_off_state = true;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index a0be0772c8b3..da4c46db3093 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -28,9 +28,6 @@
 #include "amdgpu_rlc.h"
 #include "amdgpu_ras.h"
 
-/* delay 0.1 second to enable gfx off feature */
-#define GFX_OFF_DELAY_ENABLE         msecs_to_jiffies(100)
-
 /*
  * GPU GFX IP block helpers function.
  */
@@ -569,9 +566,13 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
 		adev->gfx.gfx_off_req_count--;
 
 	if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
-		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
-	} else if (!enable && adev->gfx.gfx_off_state) {
-		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
+		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
+	} else if (!enable) {
+		if (adev->gfx.gfx_off_req_count == 1 && !adev->gfx.gfx_off_state)
+			cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
+
+		if (adev->gfx.gfx_off_state &&
+		    !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
 			adev->gfx.gfx_off_state = false;
 
 			if (adev->gfx.funcs->init_spm_golden) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index d43fe2ed8116..dcdb505bb7f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -32,6 +32,9 @@
 #include "amdgpu_rlc.h"
 #include "soc15.h"
 
+/* delay 0.1 second to enable gfx off feature */
+#define AMDGPU_GFX_OFF_DELAY_ENABLE msecs_to_jiffies(100)
+
 /* GFX current status */
 #define AMDGPU_GFX_NORMAL_MODE			0x00000000L
 #define AMDGPU_GFX_SAFE_MODE			0x00000001L
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-13  4:23               ` Lazar, Lijo
@ 2021-08-13 10:31                 ` Michel Dänzer
  2021-08-13 11:18                   ` Lazar, Lijo
  0 siblings, 1 reply; 49+ messages in thread
From: Michel Dänzer @ 2021-08-13 10:31 UTC (permalink / raw)
  To: Lazar, Lijo, Koenig, Christian, Quan, Evan, Deucher, Alexander
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

On 2021-08-13 6:23 a.m., Lazar, Lijo wrote:
> 
> 
> On 8/12/2021 10:24 PM, Michel Dänzer wrote:
>> On 2021-08-12 1:33 p.m., Lazar, Lijo wrote:
>>> On 8/12/2021 1:41 PM, Michel Dänzer wrote:
>>>> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>>>>> Hi James,
>>>>>
>>>>> Evan seems to have understood how this all works together.
>>>>>
>>>>> See while any begin/end use critical section is active the work should not be active.
>>>>>
>>>>> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
>>>>>
>>>>> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
>>>>
>>>> It merely assumes that the work may already have been scheduled before.
>>>>
>>>> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
>>>>
>>>> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.
>>>>
>>>>
>>>>> Something similar applies to the first patch I think,
>>>>
>>>> There are no cancel work calls in that case, so the commit log is accurate TTBOMK.
>>>
>>> Curious -
>>>
>>> For patch 1, does it make a difference if any delayed work scheduled is cancelled in the else part before proceeding?
>>>
>>> } else if (!enable && adev->gfx.gfx_off_state) {
>>> cancel_delayed_work();
>>
>> I tried the patch below.
>>
>> While this does seem to fix the problem as well, I see a potential issue:
>>
>> 1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex
>> 2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock
>> 3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync
>>
>> I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain though)
> 
> Should use the cancel_delayed_work instead of the _sync version.

The thing is, it's not clear to me from cancel_delayed_work's description that it's guaranteed not to wait for amdgpu_device_delay_enable_gfx_off to finish if it's already running. If that's not guaranteed, it's prone to the same deadlock.

> As you mentioned - at best work is not scheduled yet and cancelled successfully, or at worst it's waiting for the mutex. In the worst case, if amdgpu_device_delay_enable_gfx_off gets the mutex after amdgpu_gfx_off_ctrl unlocks it, there is an extra check as below.
> 
> if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count)
> 
> The count wouldn't be 0 and hence it won't enable GFXOFF.

I'm not sure, but it might also be possible for amdgpu_device_delay_enable_gfx_off to get the mutex only after amdgpu_gfx_off_ctrl was called again and set adev->gfx.gfx_off_req_count back to 0.


>> Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm not sure how offhand. (With cancel_delayed_work instead, I'm worried amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that might happen with mod_delayed_work as well...)
> 
> As mentioned earlier, cancel_delayed_work won't cause this issue.
> 
> In the mod_delayed_ patch, mod_ version is called only when req_count is 0. While that is a good thing, it keeps alive one more contender for the mutex.

Not sure what you mean. It leaves the possibility of amdgpu_device_delay_enable_gfx_off running just after amdgpu_gfx_off_ctrl tried to postpone it. As discussed above, something similar might be possible with cancel_delayed_work as well.

> The cancel_ version eliminates that contender if happens to be called at the right time (more likely if there are multiple requests to disable gfxoff). On the other hand, don't know how costly it is to call cancel_ every time on the else part (or maybe call only once when count increments to 1?).

Sure, why not, though I doubt it matters much — I expect adev->gfx.gfx_off_req_count transitioning between 0 <-> 1 to be the most common case by far.


I sent out a v2 patch which should address all these issues.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-13 10:31                 ` Michel Dänzer
@ 2021-08-13 11:18                   ` Lazar, Lijo
  0 siblings, 0 replies; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-13 11:18 UTC (permalink / raw)
  To: Michel Dänzer, Koenig, Christian, Quan, Evan, Deucher, Alexander
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel



On 8/13/2021 4:01 PM, Michel Dänzer wrote:
> On 2021-08-13 6:23 a.m., Lazar, Lijo wrote:
>>
>>
>> On 8/12/2021 10:24 PM, Michel Dänzer wrote:
>>> On 2021-08-12 1:33 p.m., Lazar, Lijo wrote:
>>>> On 8/12/2021 1:41 PM, Michel Dänzer wrote:
>>>>> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>>>>>> Hi James,
>>>>>>
>>>>>> Evan seems to have understood how this all works together.
>>>>>>
>>>>>> See while any begin/end use critical section is active the work should not be active.
>>>>>>
>>>>>> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
>>>>>>
>>>>>> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
>>>>>
>>>>> It merely assumes that the work may already have been scheduled before.
>>>>>
>>>>> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
>>>>>
>>>>> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.
>>>>>
>>>>>
>>>>>> Something similar applies to the first patch I think,
>>>>>
>>>>> There are no cancel work calls in that case, so the commit log is accurate TTBOMK.
>>>>
>>>> Curious -
>>>>
>>>> For patch 1, does it make a difference if any delayed work scheduled is cancelled in the else part before proceeding?
>>>>
>>>> } else if (!enable && adev->gfx.gfx_off_state) {
>>>> cancel_delayed_work();
>>>
>>> I tried the patch below.
>>>
>>> While this does seem to fix the problem as well, I see a potential issue:
>>>
>>> 1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex
>>> 2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock
>>> 3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync
>>>
>>> I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain though)
>>
>> Should use the cancel_delayed_work instead of the _sync version.
> 
> The thing is, it's not clear to me from cancel_delayed_work's description that it's guaranteed not to wait for amdgpu_device_delay_enable_gfx_off to finish if it's already running. If that's not guaranteed, it's prone to the same deadlock.

 From what I understood from the the description, cancel initiates a 
cancel. If the work has already started, it returns false saying it 
couldn't succeed otherwise cancels out the scheduled work and returns 
true. In the note below, it asks to specifically use the _sync version 
if we need to wait for an already started work and that definitely has 
the problem of deadlock you mentioned above.

  * Note:
  * The work callback function may still be running on return, unless
  * it returns %true and the work doesn't re-arm itself.  Explicitly 
flush or
  * use cancel_delayed_work_sync() to wait on it.


> 
>> As you mentioned - at best work is not scheduled yet and cancelled successfully, or at worst it's waiting for the mutex. In the worst case, if amdgpu_device_delay_enable_gfx_off gets the mutex after amdgpu_gfx_off_ctrl unlocks it, there is an extra check as below.
>>
>> if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count)
>>
>> The count wouldn't be 0 and hence it won't enable GFXOFF.
> 
> I'm not sure, but it might also be possible for amdgpu_device_delay_enable_gfx_off to get the mutex only after amdgpu_gfx_off_ctrl was called again and set adev->gfx.gfx_off_req_count back to 0.
> 

Yes, this is a case we can't avoid in either case. If the work has 
already started, then mod_delayed_ also doesn't have any impact. Another 
case is work thread already got the mutex and a disable request comes 
just at that time. It needs to wait till mutex is released by work, that 
could mean enable gfxoff immediately followed by disable.

> 
>>> Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm not sure how offhand. (With cancel_delayed_work instead, I'm worried amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that might happen with mod_delayed_work as well...)
>>
>> As mentioned earlier, cancel_delayed_work won't cause this issue.
>>
>> In the mod_delayed_ patch, mod_ version is called only when req_count is 0. While that is a good thing, it keeps alive one more contender for the mutex.
> 
> Not sure what you mean. It leaves the possibility of amdgpu_device_delay_enable_gfx_off running just after amdgpu_gfx_off_ctrl tried to postpone it. As discussed above, something similar might be possible with cancel_delayed_work as well.
> 

The mod_delayed is called only req_count gets back to 0. If there is 
another disable request comes after that, it doesn't cancel out the work
scheduled nor does it adjust the delay.

Ex:
Disable gfxoff -> Enable gfxoff (now the work is scheduled) -> Disable 
gfxoff (within 5ms or whatever the delay be, but this call won't go to 
the mod_delayed path to delay it further) -> Work starts after 5ms and 
creates a contention for the mutex -> Enable gfxoff

When cancel_ is used, the second disable call immediately cancels out 
any work that is scheduled but not started and it doesn't create an 
unnecessary contention for the mutex. It's a matter of who gets the 
mutex first. Cancel has a better chance to eliminate the second thread 
possibility.

>> The cancel_ version eliminates that contender if happens to be called at the right time (more likely if there are multiple requests to disable gfxoff). On the other hand, don't know how costly it is to call cancel_ every time on the else part (or maybe call only once when count increments to 1?).
> 
> Sure, why not, though I doubt it matters much — I expect adev->gfx.gfx_off_req_count transitioning between 0 <-> 1 to be the most common case by far.
> 
> 
> I sent out a v2 patch which should address all these issues.
> 

Will check that.

Thanks,
Lijo

> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-13 10:29 ` [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled Michel Dänzer
@ 2021-08-13 11:50   ` Lazar, Lijo
  2021-08-13 13:34     ` Michel Dänzer
  2021-08-16  7:38   ` Christian König
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-13 11:50 UTC (permalink / raw)
  To: Michel Dänzer, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel



On 8/13/2021 3:59 PM, Michel Dänzer wrote:
> From: Michel Dänzer <mdaenzer@redhat.com>
> 
> schedule_delayed_work does not push back the work if it was already
> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
> was disabled and re-enabled again during those 100 ms.
> 
> This resulted in frame drops / stutter with the upcoming mutter 41
> release on Navi 14, due to constantly enabling GFXOFF in the HW and
> disabling it again (for getting the GPU clock counter).
> 
> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
> enabled to disabled. This makes sure the delayed work will be scheduled
> as intended in the reverse case.
> 
> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
> to use mutex_trylock instead of mutex_lock.
> 
> v2:
> * Use cancel_delayed_work_sync & mutex_trylock instead of
>    mod_delayed_work.
> 
> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>   3 files changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f3fd5ec710b6..8b025f70706c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>   	struct amdgpu_device *adev =
>   		container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>   
> -	mutex_lock(&adev->gfx.gfx_off_mutex);
> +	/* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
> +	if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
> +		/* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
> +		 * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
> +		 * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
> +		 */
> +		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
> +		return;

This is not needed and is just creating another thread to contend for 
mutex. The checks below take care of enabling gfxoff correctly. If it's 
already in gfx_off state, it doesn't do anything. So I don't see why 
this change is needed.

The other problem is amdgpu_get_gfx_off_status() also uses the same 
mutex. So it won't be knowing which thread it would be contending 
against and blindly creates more work items.

> +	}
> +
>   	if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>   		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>   			adev->gfx.gfx_off_state = true;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index a0be0772c8b3..da4c46db3093 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -28,9 +28,6 @@
>   #include "amdgpu_rlc.h"
>   #include "amdgpu_ras.h"
>   
> -/* delay 0.1 second to enable gfx off feature */
> -#define GFX_OFF_DELAY_ENABLE         msecs_to_jiffies(100)
> -
>   /*
>    * GPU GFX IP block helpers function.
>    */
> @@ -569,9 +566,13 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>   		adev->gfx.gfx_off_req_count--;
>   
>   	if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
> -		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
> -	} else if (!enable && adev->gfx.gfx_off_state) {
> -		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
> +		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
> +	} else if (!enable) {
> +		if (adev->gfx.gfx_off_req_count == 1 && !adev->gfx.gfx_off_state)
> +			cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);

This has the deadlock problem as discussed in the other thread.

Thanks,
Lijo

> +		if (adev->gfx.gfx_off_state &&
> +		    !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>   			adev->gfx.gfx_off_state = false;
>   
>   			if (adev->gfx.funcs->init_spm_golden) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> index d43fe2ed8116..dcdb505bb7f4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> @@ -32,6 +32,9 @@
>   #include "amdgpu_rlc.h"
>   #include "soc15.h"
>   
> +/* delay 0.1 second to enable gfx off feature */
> +#define AMDGPU_GFX_OFF_DELAY_ENABLE msecs_to_jiffies(100)
> +
>   /* GFX current status */
>   #define AMDGPU_GFX_NORMAL_MODE			0x00000000L
>   #define AMDGPU_GFX_SAFE_MODE			0x00000001L
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-13 11:50   ` Lazar, Lijo
@ 2021-08-13 13:34     ` Michel Dänzer
  2021-08-13 14:14       ` Lazar, Lijo
  0 siblings, 1 reply; 49+ messages in thread
From: Michel Dänzer @ 2021-08-13 13:34 UTC (permalink / raw)
  To: Lazar, Lijo, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

On 2021-08-13 1:50 p.m., Lazar, Lijo wrote:
> 
> 
> On 8/13/2021 3:59 PM, Michel Dänzer wrote:
>> From: Michel Dänzer <mdaenzer@redhat.com>
>>
>> schedule_delayed_work does not push back the work if it was already
>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>> was disabled and re-enabled again during those 100 ms.
>>
>> This resulted in frame drops / stutter with the upcoming mutter 41
>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>> disabling it again (for getting the GPU clock counter).
>>
>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
>> enabled to disabled. This makes sure the delayed work will be scheduled
>> as intended in the reverse case.
>>
>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
>> to use mutex_trylock instead of mutex_lock.
>>
>> v2:
>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>    mod_delayed_work.
>>
>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>>   3 files changed, 20 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index f3fd5ec710b6..8b025f70706c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>       struct amdgpu_device *adev =
>>           container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>   -    mutex_lock(&adev->gfx.gfx_off_mutex);
>> +    /* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
>> +    if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
>> +        /* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
>> +         * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
>> +         * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
>> +         */
>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>> +        return;
> 
> This is not needed and is just creating another thread to contend for mutex.

Still not sure what you mean by that. What other thread?

> The checks below take care of enabling gfxoff correctly. If it's already in gfx_off state, it doesn't do anything. So I don't see why this change is needed.

mutex_trylock is needed to prevent the deadlock discussed before and below.

schedule_delayed_work is needed due to this scenario hinted at by the comment:

1. amdgpu_gfx_off_ctrl locks mutex, calls schedule_delayed_work
2. amdgpu_device_delay_enable_gfx_off runs, calls mutex_trylock, which fails

GFXOFF would never get re-enabled in HW in this case (until amdgpu_gfx_off_ctrl calls schedule_delayed_work again).

(cancel_delayed_work_sync guarantees there's no pending delayed work when it returns, even if amdgpu_device_delay_enable_gfx_off calls schedule_delayed_work)


> The other problem is amdgpu_get_gfx_off_status() also uses the same mutex.

Not sure what for TBH. AFAICT there's only one implementation of this for Renoir, which just reads a register. (It's only called from debugfs)

> So it won't be knowing which thread it would be contending against and blindly creates more work items. 

There is only ever at most one instance of the delayed work at any time. amdgpu_device_delay_enable_gfx_off doesn't care whether amdgpu_gfx_off_ctrl or amdgpu_get_gfx_off_status is holding the mutex, it just keeps re-scheduling itself 100 ms later until it succeeds.


>> @@ -569,9 +566,13 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>           adev->gfx.gfx_off_req_count--;
>>         if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>> +    } else if (!enable) {
>> +        if (adev->gfx.gfx_off_req_count == 1 && !adev->gfx.gfx_off_state)
>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
> 
> This has the deadlock problem as discussed in the other thread.

It does not. If amdgpu_device_delay_enable_gfx_off runs while amdgpu_gfx_off_ctrl holds the mutex, 
mutex_trylock fails and the former bails.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-13 13:34     ` Michel Dänzer
@ 2021-08-13 14:14       ` Lazar, Lijo
  2021-08-13 14:40         ` Michel Dänzer
  0 siblings, 1 reply; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-13 14:14 UTC (permalink / raw)
  To: Michel Dänzer, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel



On 8/13/2021 7:04 PM, Michel Dänzer wrote:
> On 2021-08-13 1:50 p.m., Lazar, Lijo wrote:
>>
>>
>> On 8/13/2021 3:59 PM, Michel Dänzer wrote:
>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>
>>> schedule_delayed_work does not push back the work if it was already
>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>> was disabled and re-enabled again during those 100 ms.
>>>
>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>> disabling it again (for getting the GPU clock counter).
>>>
>>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
>>> enabled to disabled. This makes sure the delayed work will be scheduled
>>> as intended in the reverse case.
>>>
>>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
>>> to use mutex_trylock instead of mutex_lock.
>>>
>>> v2:
>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>     mod_delayed_work.
>>>
>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>>>    3 files changed, 20 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index f3fd5ec710b6..8b025f70706c 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>        struct amdgpu_device *adev =
>>>            container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>    -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>> +    /* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
>>> +    if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
>>> +        /* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
>>> +         * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
>>> +         * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
>>> +         */
>>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>>> +        return;
>>
>> This is not needed and is just creating another thread to contend for mutex.
> 
> Still not sure what you mean by that. What other thread?

Sorry, I meant it schedules another workitem and delays GFXOFF 
enablement further. For ex: if it was another function like 
gfx_off_status holding the lock at the time of check.

> 
>> The checks below take care of enabling gfxoff correctly. If it's already in gfx_off state, it doesn't do anything. So I don't see why this change is needed.
> 
> mutex_trylock is needed to prevent the deadlock discussed before and below.
> 
> schedule_delayed_work is needed due to this scenario hinted at by the comment:
> 
> 1. amdgpu_gfx_off_ctrl locks mutex, calls schedule_delayed_work
> 2. amdgpu_device_delay_enable_gfx_off runs, calls mutex_trylock, which fails
> 
> GFXOFF would never get re-enabled in HW in this case (until amdgpu_gfx_off_ctrl calls schedule_delayed_work again).
> 
> (cancel_delayed_work_sync guarantees there's no pending delayed work when it returns, even if amdgpu_device_delay_enable_gfx_off calls schedule_delayed_work)
> 

I think we need to explain based on the original code before. There is 
an asssumption here that the only other contention of this mutex is with 
the gfx_off_ctrl function. That is not true, so this is not the only 
case where mutex_trylock can fail. It could be because gfx_off_status is 
holding the lock.

As far as I understand if the work has already started running when 
schedule_delayed_work is called, it will insert another in the work 
queue after delay. Based on that understanding I didn't find a problem 
with the original code. Maybe, mutex_trylock is added to call _sync to 
make sure work is cancelled or not running but that breaks other 
assumptions.

>> The other problem is amdgpu_get_gfx_off_status() also uses the same mutex.
> 
> Not sure what for TBH. AFAICT there's only one implementation of this for Renoir, which just reads a register. (It's only called from debugfs)
> 

I'm not sure either :) But as long as there are other functions that 
contend for the same lock, it's not good to implement based on 
assumptions only about a particular scenario.

>> So it won't be knowing which thread it would be contending against and blindly creates more work items.
> 
> There is only ever at most one instance of the delayed work at any time. amdgpu_device_delay_enable_gfx_off doesn't care whether amdgpu_gfx_off_ctrl or amdgpu_get_gfx_off_status is holding the mutex, it just keeps re-scheduling itself 100 ms later until it succeeds.
> 

Yes, that is the problem, there could be cases where it could have gone 
to gfxoff right after gfx_off_status releases the lock, but it doesn't 
delaying it further. That would be the case if some other function is 
also introduced which takes this mutex.

> 
>>> @@ -569,9 +566,13 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>>            adev->gfx.gfx_off_req_count--;
>>>          if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>>> +    } else if (!enable) {
>>> +        if (adev->gfx.gfx_off_req_count == 1 && !adev->gfx.gfx_off_state)
>>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>
>> This has the deadlock problem as discussed in the other thread.
> 
> It does not. If amdgpu_device_delay_enable_gfx_off runs while amdgpu_gfx_off_ctrl holds the mutex,
> mutex_trylock fails and the former bails.
> 

Ok, but now it creates a case of re-arming the work item from the work.

TBH I didn't understand the problem on having to use _sync itself and 
not cancel_delayed_work().

The edge case you mentioned for a cancel_delayed_work looks like a rare 
case

amdgpu_gfx_off_ctrl(disable) gets the lock
amdgpu_device_delay_enable_gfx_off - waits for the lock
amdgpu_gfx_off_ctrl(enable) gets the lock again  (this has to be 
matching call for the previous disable)

This scenario looks highly improbable as in general we expect some other 
work that needs to be done done between disable/enable.

Thanks,
Lijo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-13 14:14       ` Lazar, Lijo
@ 2021-08-13 14:40         ` Michel Dänzer
  2021-08-13 15:07           ` Lazar, Lijo
  0 siblings, 1 reply; 49+ messages in thread
From: Michel Dänzer @ 2021-08-13 14:40 UTC (permalink / raw)
  To: Lazar, Lijo, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

On 2021-08-13 4:14 p.m., Lazar, Lijo wrote:
> On 8/13/2021 7:04 PM, Michel Dänzer wrote:
>> On 2021-08-13 1:50 p.m., Lazar, Lijo wrote:
>>> On 8/13/2021 3:59 PM, Michel Dänzer wrote:
>>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>>
>>>> schedule_delayed_work does not push back the work if it was already
>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>> was disabled and re-enabled again during those 100 ms.
>>>>
>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>> disabling it again (for getting the GPU clock counter).
>>>>
>>>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
>>>> enabled to disabled. This makes sure the delayed work will be scheduled
>>>> as intended in the reverse case.
>>>>
>>>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
>>>> to use mutex_trylock instead of mutex_lock.
>>>>
>>>> v2:
>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>     mod_delayed_work.
>>>>
>>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>>>>    3 files changed, 20 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index f3fd5ec710b6..8b025f70706c 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>        struct amdgpu_device *adev =
>>>>            container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>>    -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>> +    /* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
>>>> +    if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
>>>> +        /* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
>>>> +         * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
>>>> +         * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
>>>> +         */
>>>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>>>> +        return;
>>>
>>> This is not needed and is just creating another thread to contend for mutex.
>>
>> Still not sure what you mean by that. What other thread?
> 
> Sorry, I meant it schedules another workitem and delays GFXOFF enablement further. For ex: if it was another function like gfx_off_status holding the lock at the time of check.
> 
>>
>>> The checks below take care of enabling gfxoff correctly. If it's already in gfx_off state, it doesn't do anything. So I don't see why this change is needed.
>>
>> mutex_trylock is needed to prevent the deadlock discussed before and below.
>>
>> schedule_delayed_work is needed due to this scenario hinted at by the comment:
>>
>> 1. amdgpu_gfx_off_ctrl locks mutex, calls schedule_delayed_work
>> 2. amdgpu_device_delay_enable_gfx_off runs, calls mutex_trylock, which fails
>>
>> GFXOFF would never get re-enabled in HW in this case (until amdgpu_gfx_off_ctrl calls schedule_delayed_work again).
>>
>> (cancel_delayed_work_sync guarantees there's no pending delayed work when it returns, even if amdgpu_device_delay_enable_gfx_off calls schedule_delayed_work)
>>
> 
> I think we need to explain based on the original code before. There is an asssumption here that the only other contention of this mutex is with the gfx_off_ctrl function.

Not really.


> As far as I understand if the work has already started running when schedule_delayed_work is called, it will insert another in the work queue after delay. Based on that understanding I didn't find a problem with the original code.

Original code as in without this patch or the mod_delayed_work patch? If so, the problem is not when the work has already started running. It's that when it hasn't started running yet, schedule_delayed_work doesn't change the timeout for the already scheduled work, so it ends up enabling GFXOFF earlier than intended (and thus at all in scenarios when it's not supposed to).


> [...], there could be cases where it could have gone to gfxoff right after gfx_off_status releases the lock, but it doesn't delaying it further. That would be the case if some other function is also introduced which takes this mutex.

I really don't think we need to worry about amdgpu_get_gfx_off_status, since it's only called from debugfs (and should be very short). If something hits that debugfs file and it causes higher energy consumption, that's a "doctor, it hurts if I do this" kind of problem.

We can worry about future users of the mutex when they show up.


>>>> @@ -569,9 +566,13 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>>>            adev->gfx.gfx_off_req_count--;
>>>>          if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>>>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>>>> +    } else if (!enable) {
>>>> +        if (adev->gfx.gfx_off_req_count == 1 && !adev->gfx.gfx_off_state)
>>>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>>
>>> This has the deadlock problem as discussed in the other thread.
>>
>> It does not. If amdgpu_device_delay_enable_gfx_off runs while amdgpu_gfx_off_ctrl holds the mutex,
>> mutex_trylock fails and the former bails.
> 
> Ok, but now it creates a case of re-arming the work item from the work.
> 
> TBH I didn't understand the problem on having to use _sync itself and not cancel_delayed_work().
> 
> The edge case you mentioned for a cancel_delayed_work looks like a rare case
> 
> amdgpu_gfx_off_ctrl(disable) gets the lock
> amdgpu_device_delay_enable_gfx_off - waits for the lock
> amdgpu_gfx_off_ctrl(enable) gets the lock again  (this has to be matching call for the previous disable)
> 
> This scenario looks highly improbable as in general we expect some other work that needs to be done done between disable/enable.

At least for the case that started me on this journey (reading the GFX clock counter), that should be very short, just a couple of register reads.

I agree it's highly improbable, I'm trying to make it impossible. :)


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-13 14:40         ` Michel Dänzer
@ 2021-08-13 15:07           ` Lazar, Lijo
  2021-08-13 16:00             ` Michel Dänzer
  0 siblings, 1 reply; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-13 15:07 UTC (permalink / raw)
  To: Michel Dänzer, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel



On 8/13/2021 8:10 PM, Michel Dänzer wrote:
> On 2021-08-13 4:14 p.m., Lazar, Lijo wrote:
>> On 8/13/2021 7:04 PM, Michel Dänzer wrote:
>>> On 2021-08-13 1:50 p.m., Lazar, Lijo wrote:
>>>> On 8/13/2021 3:59 PM, Michel Dänzer wrote:
>>>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>>>
>>>>> schedule_delayed_work does not push back the work if it was already
>>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>>> was disabled and re-enabled again during those 100 ms.
>>>>>
>>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>>> disabling it again (for getting the GPU clock counter).
>>>>>
>>>>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
>>>>> enabled to disabled. This makes sure the delayed work will be scheduled
>>>>> as intended in the reverse case.
>>>>>
>>>>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
>>>>> to use mutex_trylock instead of mutex_lock.
>>>>>
>>>>> v2:
>>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>>      mod_delayed_work.
>>>>>
>>>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>>>>>     3 files changed, 20 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index f3fd5ec710b6..8b025f70706c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>>         struct amdgpu_device *adev =
>>>>>             container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>>>     -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>> +    /* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
>>>>> +    if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
>>>>> +        /* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
>>>>> +         * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
>>>>> +         * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
>>>>> +         */
>>>>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>>>>> +        return;
>>>>
>>>> This is not needed and is just creating another thread to contend for mutex.
>>>
>>> Still not sure what you mean by that. What other thread?
>>
>> Sorry, I meant it schedules another workitem and delays GFXOFF enablement further. For ex: if it was another function like gfx_off_status holding the lock at the time of check.
>>
>>>
>>>> The checks below take care of enabling gfxoff correctly. If it's already in gfx_off state, it doesn't do anything. So I don't see why this change is needed.
>>>
>>> mutex_trylock is needed to prevent the deadlock discussed before and below.
>>>
>>> schedule_delayed_work is needed due to this scenario hinted at by the comment:
>>>
>>> 1. amdgpu_gfx_off_ctrl locks mutex, calls schedule_delayed_work
>>> 2. amdgpu_device_delay_enable_gfx_off runs, calls mutex_trylock, which fails
>>>
>>> GFXOFF would never get re-enabled in HW in this case (until amdgpu_gfx_off_ctrl calls schedule_delayed_work again).
>>>
>>> (cancel_delayed_work_sync guarantees there's no pending delayed work when it returns, even if amdgpu_device_delay_enable_gfx_off calls schedule_delayed_work)
>>>
>>
>> I think we need to explain based on the original code before. There is an asssumption here that the only other contention of this mutex is with the gfx_off_ctrl function.
> 
> Not really.
> 
> 
>> As far as I understand if the work has already started running when schedule_delayed_work is called, it will insert another in the work queue after delay. Based on that understanding I didn't find a problem with the original code.
> 
> Original code as in without this patch or the mod_delayed_work patch? If so, the problem is not when the work has already started running. It's that when it hasn't started running yet, schedule_delayed_work doesn't change the timeout for the already scheduled work, so it ends up enabling GFXOFF earlier than intended (and thus at all in scenarios when it's not supposed to).
> 

I meant the original implementation of 
amdgpu_device_delay_enable_gfx_off().


If you indeed want to use _sync, there is a small problem with this 
implementation also which is roughly equivalent to the original problem 
you faced.

amdgpu_gfx_off_ctrl(disable) locks mutex
calls cancel_delayed_work_sync
amdgpu_device_delay_enable_gfx_off already started running
	mutex_trylock fails and schedules another one
amdgpu_gfx_off_ctrl(enable)
	schedules_delayed_work() - Delay is not extended, it's the same as when 
it's rearmed from work item.

Probably, overthinking about the solution. Looking back, mod_ version is 
simpler :). May be just delay it further everytime there is a call with 
enable instead of doing it only for req_cnt==0?

Thanks,
Lijo

> 
>> [...], there could be cases where it could have gone to gfxoff right after gfx_off_status releases the lock, but it doesn't delaying it further. That would be the case if some other function is also introduced which takes this mutex.
> 
> I really don't think we need to worry about amdgpu_get_gfx_off_status, since it's only called from debugfs (and should be very short). If something hits that debugfs file and it causes higher energy consumption, that's a "doctor, it hurts if I do this" kind of problem.
> 
> We can worry about future users of the mutex when they show up.
> 
> 
>>>>> @@ -569,9 +566,13 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>>>>             adev->gfx.gfx_off_req_count--;
>>>>>           if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>>>>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>>>>> +    } else if (!enable) {
>>>>> +        if (adev->gfx.gfx_off_req_count == 1 && !adev->gfx.gfx_off_state)
>>>>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>>>
>>>> This has the deadlock problem as discussed in the other thread.
>>>
>>> It does not. If amdgpu_device_delay_enable_gfx_off runs while amdgpu_gfx_off_ctrl holds the mutex,
>>> mutex_trylock fails and the former bails.
>>
>> Ok, but now it creates a case of re-arming the work item from the work.
>>
>> TBH I didn't understand the problem on having to use _sync itself and not cancel_delayed_work().
>>
>> The edge case you mentioned for a cancel_delayed_work looks like a rare case
>>
>> amdgpu_gfx_off_ctrl(disable) gets the lock
>> amdgpu_device_delay_enable_gfx_off - waits for the lock
>> amdgpu_gfx_off_ctrl(enable) gets the lock again  (this has to be matching call for the previous disable)
>>
>> This scenario looks highly improbable as in general we expect some other work that needs to be done done between disable/enable.
> 
> At least for the case that started me on this journey (reading the GFX clock counter), that should be very short, just a couple of register reads.
> 
> I agree it's highly improbable, I'm trying to make it impossible. :)
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-13 15:07           ` Lazar, Lijo
@ 2021-08-13 16:00             ` Michel Dänzer
  2021-08-16  4:13               ` Lazar, Lijo
  0 siblings, 1 reply; 49+ messages in thread
From: Michel Dänzer @ 2021-08-13 16:00 UTC (permalink / raw)
  To: Lazar, Lijo, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

On 2021-08-13 5:07 p.m., Lazar, Lijo wrote:
> 
> 
> On 8/13/2021 8:10 PM, Michel Dänzer wrote:
>> On 2021-08-13 4:14 p.m., Lazar, Lijo wrote:
>>> On 8/13/2021 7:04 PM, Michel Dänzer wrote:
>>>> On 2021-08-13 1:50 p.m., Lazar, Lijo wrote:
>>>>> On 8/13/2021 3:59 PM, Michel Dänzer wrote:
>>>>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>>>>
>>>>>> schedule_delayed_work does not push back the work if it was already
>>>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>>>> was disabled and re-enabled again during those 100 ms.
>>>>>>
>>>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>>>> disabling it again (for getting the GPU clock counter).
>>>>>>
>>>>>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
>>>>>> enabled to disabled. This makes sure the delayed work will be scheduled
>>>>>> as intended in the reverse case.
>>>>>>
>>>>>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
>>>>>> to use mutex_trylock instead of mutex_lock.
>>>>>>
>>>>>> v2:
>>>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>>>      mod_delayed_work.
>>>>>>
>>>>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>>>>>>     3 files changed, 20 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index f3fd5ec710b6..8b025f70706c 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>>>         struct amdgpu_device *adev =
>>>>>>             container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>>>>     -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>> +    /* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
>>>>>> +    if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
>>>>>> +        /* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
>>>>>> +         * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
>>>>>> +         * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
>>>>>> +         */
>>>>>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>>>>>> +        return;
>>>>>
>>>>> This is not needed and is just creating another thread to contend for mutex.
>>>>
>>>> Still not sure what you mean by that. What other thread?
>>>
>>> Sorry, I meant it schedules another workitem and delays GFXOFF enablement further. For ex: if it was another function like gfx_off_status holding the lock at the time of check.
>>>
>>>>
>>>>> The checks below take care of enabling gfxoff correctly. If it's already in gfx_off state, it doesn't do anything. So I don't see why this change is needed.
>>>>
>>>> mutex_trylock is needed to prevent the deadlock discussed before and below.
>>>>
>>>> schedule_delayed_work is needed due to this scenario hinted at by the comment:
>>>>
>>>> 1. amdgpu_gfx_off_ctrl locks mutex, calls schedule_delayed_work
>>>> 2. amdgpu_device_delay_enable_gfx_off runs, calls mutex_trylock, which fails
>>>>
>>>> GFXOFF would never get re-enabled in HW in this case (until amdgpu_gfx_off_ctrl calls schedule_delayed_work again).
>>>>
>>>> (cancel_delayed_work_sync guarantees there's no pending delayed work when it returns, even if amdgpu_device_delay_enable_gfx_off calls schedule_delayed_work)
>>>>
>>>
>>> I think we need to explain based on the original code before. There is an asssumption here that the only other contention of this mutex is with the gfx_off_ctrl function.
>>
>> Not really.
>>
>>
>>> As far as I understand if the work has already started running when schedule_delayed_work is called, it will insert another in the work queue after delay. Based on that understanding I didn't find a problem with the original code.
>>
>> Original code as in without this patch or the mod_delayed_work patch? If so, the problem is not when the work has already started running. It's that when it hasn't started running yet, schedule_delayed_work doesn't change the timeout for the already scheduled work, so it ends up enabling GFXOFF earlier than intended (and thus at all in scenarios when it's not supposed to).
>>
> 
> I meant the original implementation of amdgpu_device_delay_enable_gfx_off().
> 
> 
> If you indeed want to use _sync, there is a small problem with this implementation also which is roughly equivalent to the original problem you faced.
> 
> amdgpu_gfx_off_ctrl(disable) locks mutex
> calls cancel_delayed_work_sync
> amdgpu_device_delay_enable_gfx_off already started running
>     mutex_trylock fails and schedules another one
> amdgpu_gfx_off_ctrl(enable)
>     schedules_delayed_work() - Delay is not extended, it's the same as when it's rearmed from work item.


This cannot happen. When cancel_delayed_work_sync returns, it guarantees that the delayed work is not scheduled
, even if amdgpu_device_delay_enable_gfx_off called schedule_delayed_work. In other words, it cancels that as well.


> Probably, overthinking about the solution. Looking back, mod_ version is simpler :). May be just delay it further everytime there is a call with enable instead of doing it only for req_cnt==0?

That has some issues as well:

* Still prone to the "amdgpu_device_delay_enable_gfx_off re-enables GFXOFF immediately after amdgpu_gfx_off_ctrl dropped req_count to 0" race if the former starts running between when the latter locks the mutex and calls mod_delayed_work.
* If the work is not scheduled yet, mod_delayed_work would schedule it, even if req_count > 0, in which case it couldn't actually enable GFXOFF.

Conceptually, making sure the work is never scheduled while req_count > 0 seems cleaner to me. It's the same principle as in the JPEG/UVD/VCE/VCN ring functions (which are presumably hotter paths than these amdgpu_gfx_off functions) I needlessly modified in patch 2.

(It also means amdgpu_device_delay_enable_gfx_off technically no longer needs to test req_count or gfx_off_state; I can spin a v3 for that if desired)


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-13 16:00             ` Michel Dänzer
@ 2021-08-16  4:13               ` Lazar, Lijo
  2021-08-16 10:45                 ` Michel Dänzer
  0 siblings, 1 reply; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-16  4:13 UTC (permalink / raw)
  To: Michel Dänzer, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel



On 8/13/2021 9:30 PM, Michel Dänzer wrote:
> On 2021-08-13 5:07 p.m., Lazar, Lijo wrote:
>>
>>
>> On 8/13/2021 8:10 PM, Michel Dänzer wrote:
>>> On 2021-08-13 4:14 p.m., Lazar, Lijo wrote:
>>>> On 8/13/2021 7:04 PM, Michel Dänzer wrote:
>>>>> On 2021-08-13 1:50 p.m., Lazar, Lijo wrote:
>>>>>> On 8/13/2021 3:59 PM, Michel Dänzer wrote:
>>>>>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>>>>>
>>>>>>> schedule_delayed_work does not push back the work if it was already
>>>>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>>>>> was disabled and re-enabled again during those 100 ms.
>>>>>>>
>>>>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>>>>> disabling it again (for getting the GPU clock counter).
>>>>>>>
>>>>>>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
>>>>>>> enabled to disabled. This makes sure the delayed work will be scheduled
>>>>>>> as intended in the reverse case.
>>>>>>>
>>>>>>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
>>>>>>> to use mutex_trylock instead of mutex_lock.
>>>>>>>
>>>>>>> v2:
>>>>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>>>>       mod_delayed_work.
>>>>>>>
>>>>>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>>>>>> ---
>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>>>>>>>      3 files changed, 20 insertions(+), 7 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index f3fd5ec710b6..8b025f70706c 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>>>>          struct amdgpu_device *adev =
>>>>>>>              container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>>>>>      -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>>> +    /* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
>>>>>>> +    if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
>>>>>>> +        /* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
>>>>>>> +         * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
>>>>>>> +         * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
>>>>>>> +         */
>>>>>>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>>>>>>> +        return;
>>>>>>
>>>>>> This is not needed and is just creating another thread to contend for mutex.
>>>>>
>>>>> Still not sure what you mean by that. What other thread?
>>>>
>>>> Sorry, I meant it schedules another workitem and delays GFXOFF enablement further. For ex: if it was another function like gfx_off_status holding the lock at the time of check.
>>>>
>>>>>
>>>>>> The checks below take care of enabling gfxoff correctly. If it's already in gfx_off state, it doesn't do anything. So I don't see why this change is needed.
>>>>>
>>>>> mutex_trylock is needed to prevent the deadlock discussed before and below.
>>>>>
>>>>> schedule_delayed_work is needed due to this scenario hinted at by the comment:
>>>>>
>>>>> 1. amdgpu_gfx_off_ctrl locks mutex, calls schedule_delayed_work
>>>>> 2. amdgpu_device_delay_enable_gfx_off runs, calls mutex_trylock, which fails
>>>>>
>>>>> GFXOFF would never get re-enabled in HW in this case (until amdgpu_gfx_off_ctrl calls schedule_delayed_work again).
>>>>>
>>>>> (cancel_delayed_work_sync guarantees there's no pending delayed work when it returns, even if amdgpu_device_delay_enable_gfx_off calls schedule_delayed_work)
>>>>>
>>>>
>>>> I think we need to explain based on the original code before. There is an asssumption here that the only other contention of this mutex is with the gfx_off_ctrl function.
>>>
>>> Not really.
>>>
>>>
>>>> As far as I understand if the work has already started running when schedule_delayed_work is called, it will insert another in the work queue after delay. Based on that understanding I didn't find a problem with the original code.
>>>
>>> Original code as in without this patch or the mod_delayed_work patch? If so, the problem is not when the work has already started running. It's that when it hasn't started running yet, schedule_delayed_work doesn't change the timeout for the already scheduled work, so it ends up enabling GFXOFF earlier than intended (and thus at all in scenarios when it's not supposed to).
>>>
>>
>> I meant the original implementation of amdgpu_device_delay_enable_gfx_off().
>>
>>
>> If you indeed want to use _sync, there is a small problem with this implementation also which is roughly equivalent to the original problem you faced.
>>
>> amdgpu_gfx_off_ctrl(disable) locks mutex
>> calls cancel_delayed_work_sync
>> amdgpu_device_delay_enable_gfx_off already started running
>>      mutex_trylock fails and schedules another one
>> amdgpu_gfx_off_ctrl(enable)
>>      schedules_delayed_work() - Delay is not extended, it's the same as when it's rearmed from work item.
> 
> 
> This cannot happen. When cancel_delayed_work_sync returns, it guarantees that the delayed work is not scheduled
> , even if amdgpu_device_delay_enable_gfx_off called schedule_delayed_work. In other words, it cancels that as well.
> 

Ah, thanks! Didn't know that it will cancel out re-queued work also. In 
that case, may be reduce the delay for re-queuing it - say 50% or 25% of 
AMDGPU_GFX_OFF_DELAY_ENABLE. Instead of delaying GFXOFF further, it's 
better to enable it faster as it's losing out to another enable or some 
other function.

>> Probably, overthinking about the solution. Looking back, mod_ version is simpler :). May be just delay it further everytime there is a call with enable instead of doing it only for req_cnt==0?
> 
> That has some issues as well:
> 
> * Still prone to the "amdgpu_device_delay_enable_gfx_off re-enables GFXOFF immediately after amdgpu_gfx_off_ctrl dropped req_count to 0" race if the former starts running between when the latter locks the mutex and calls mod_delayed_work.
> * If the work is not scheduled yet, mod_delayed_work would schedule it, even if req_count > 0, in which case it couldn't actually enable GFXOFF.
> 
> Conceptually, making sure the work is never scheduled while req_count > 0 seems cleaner to me. It's the same principle as in the JPEG/UVD/VCE/VCN ring functions (which are presumably hotter paths than these amdgpu_gfx_off functions) I needlessly modified in patch 2.
> 
> (It also means amdgpu_device_delay_enable_gfx_off technically no longer needs to test req_count or gfx_off_state; I can spin a v3 for that if desired)
> 

Would still keep the "gfx_off_state check" to avoid executing the 
sequence due to buggy enable calls coming when it's already in gfxoff 
(if at all that happens).

Thanks,
Lijo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks
  2021-08-12  8:11         ` Michel Dänzer
  2021-08-12 11:33           ` Lazar, Lijo
@ 2021-08-16  7:33           ` Christian König
  1 sibling, 0 replies; 49+ messages in thread
From: Christian König @ 2021-08-16  7:33 UTC (permalink / raw)
  To: Michel Dänzer, Koenig, Christian, Quan, Evan, Deucher, Alexander
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

Am 12.08.21 um 10:11 schrieb Michel Dänzer:
> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>> Hi James,
>>
>> Evan seems to have understood how this all works together.
>>
>> See while any begin/end use critical section is active the work should not be active.
>>
>> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
>>
>> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
> It merely assumes that the work may already have been scheduled before.
>
> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
>
> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.

Yeah, I think that would be much better.

>> Something similar applies to the first patch I think,
> There are no cancel work calls in that case, so the commit log is accurate TTBOMK.
>
> I noticed this because current mutter Git main wasn't able to sustain 60 fps on Navi 14 with a simple glxgears -fullscreen. mutter was dropping frames because its CPU work for a frame update occasionally took up to 3 ms, instead of the normal 2-300 microseconds. sysprof showed a lot of cycles spent in the functions which enable/disable GFXOFF in the HW.
>
>
>> so when this makes a difference it is actually a bug.
> There was certainly a bug though, which patch 1 fixes. :)

Agreed, just wanted to note that this is most likely not the right 
solution since Alex was already picking it up.

Going to reply separately on the new patch as well.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-13 10:29 ` [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled Michel Dänzer
  2021-08-13 11:50   ` Lazar, Lijo
@ 2021-08-16  7:38   ` Christian König
  2021-08-16 10:38     ` Michel Dänzer
  2021-08-16 10:20   ` Quan, Evan
  2021-08-16 10:35   ` [PATCH v3] " Michel Dänzer
  3 siblings, 1 reply; 49+ messages in thread
From: Christian König @ 2021-08-16  7:38 UTC (permalink / raw)
  To: Michel Dänzer, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

Am 13.08.21 um 12:29 schrieb Michel Dänzer:
> From: Michel Dänzer <mdaenzer@redhat.com>
>
> schedule_delayed_work does not push back the work if it was already
> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
> was disabled and re-enabled again during those 100 ms.
>
> This resulted in frame drops / stutter with the upcoming mutter 41
> release on Navi 14, due to constantly enabling GFXOFF in the HW and
> disabling it again (for getting the GPU clock counter).
>
> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
> enabled to disabled. This makes sure the delayed work will be scheduled
> as intended in the reverse case.
>
> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
> to use mutex_trylock instead of mutex_lock.
>
> v2:
> * Use cancel_delayed_work_sync & mutex_trylock instead of
>    mod_delayed_work.

While this may work it still smells a little bit fishy.

In general you have two common locking orders around work items, either 
lock->work or work->lock. If you mix this as lock->work->lock like here 
trouble is usually imminent.

I think what we should do instead is to double check if taking the lock 
inside the work item is necessary and instead making sure that the work 
is sync canceled when we don't want it to run. In other words fully 
switching to the lock->work approach.

But please note that this are just high level design thoughts, I don't 
really know the details of the gfx_off code at all. Could even be that 
we need two locks, one outside and one inside of the work item.

Regards,
Christian.

>
> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>   3 files changed, 20 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f3fd5ec710b6..8b025f70706c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>   	struct amdgpu_device *adev =
>   		container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>   
> -	mutex_lock(&adev->gfx.gfx_off_mutex);
> +	/* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
> +	if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
> +		/* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
> +		 * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
> +		 * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
> +		 */
> +		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
> +		return;
> +	}
> +
>   	if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>   		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>   			adev->gfx.gfx_off_state = true;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index a0be0772c8b3..da4c46db3093 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -28,9 +28,6 @@
>   #include "amdgpu_rlc.h"
>   #include "amdgpu_ras.h"
>   
> -/* delay 0.1 second to enable gfx off feature */
> -#define GFX_OFF_DELAY_ENABLE         msecs_to_jiffies(100)
> -
>   /*
>    * GPU GFX IP block helpers function.
>    */
> @@ -569,9 +566,13 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>   		adev->gfx.gfx_off_req_count--;
>   
>   	if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
> -		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
> -	} else if (!enable && adev->gfx.gfx_off_state) {
> -		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
> +		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
> +	} else if (!enable) {
> +		if (adev->gfx.gfx_off_req_count == 1 && !adev->gfx.gfx_off_state)
> +			cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
> +
> +		if (adev->gfx.gfx_off_state &&
> +		    !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>   			adev->gfx.gfx_off_state = false;
>   
>   			if (adev->gfx.funcs->init_spm_golden) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> index d43fe2ed8116..dcdb505bb7f4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> @@ -32,6 +32,9 @@
>   #include "amdgpu_rlc.h"
>   #include "soc15.h"
>   
> +/* delay 0.1 second to enable gfx off feature */
> +#define AMDGPU_GFX_OFF_DELAY_ENABLE msecs_to_jiffies(100)
> +
>   /* GFX current status */
>   #define AMDGPU_GFX_NORMAL_MODE			0x00000000L
>   #define AMDGPU_GFX_SAFE_MODE			0x00000001L


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-13 10:29 ` [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled Michel Dänzer
  2021-08-13 11:50   ` Lazar, Lijo
  2021-08-16  7:38   ` Christian König
@ 2021-08-16 10:20   ` Quan, Evan
  2021-08-16 10:43     ` Michel Dänzer
  2021-08-16 10:35   ` [PATCH v3] " Michel Dänzer
  3 siblings, 1 reply; 49+ messages in thread
From: Quan, Evan @ 2021-08-16 10:20 UTC (permalink / raw)
  To: Michel Dänzer, Deucher, Alexander, Koenig, Christian
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

[AMD Official Use Only]

Hi Michel,

The patch seems reasonable to me(especially the cancel_delayed_work_sync() part).
However, can you explain more about the code below?
What's the race issue here exactly?

+	/* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
+	if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
+		/* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
+		 * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
+		 * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
+		 */
+		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
+		return;
+	}

BR
Evan
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Michel Dänzer
> Sent: Friday, August 13, 2021 6:29 PM
> To: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>
> Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-
> gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> Subject: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is
> disabled
> 
> From: Michel Dänzer <mdaenzer@redhat.com>
> 
> schedule_delayed_work does not push back the work if it was already
> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
> was disabled and re-enabled again during those 100 ms.
> 
> This resulted in frame drops / stutter with the upcoming mutter 41
> release on Navi 14, due to constantly enabling GFXOFF in the HW and
> disabling it again (for getting the GPU clock counter).
> 
> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
> enabled to disabled. This makes sure the delayed work will be scheduled
> as intended in the reverse case.
> 
> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
> to use mutex_trylock instead of mutex_lock.
> 
> v2:
> * Use cancel_delayed_work_sync & mutex_trylock instead of
>   mod_delayed_work.
> 
> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>  3 files changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f3fd5ec710b6..8b025f70706c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2777,7 +2777,16 @@ static void
> amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>  	struct amdgpu_device *adev =
>  		container_of(work, struct amdgpu_device,
> gfx.gfx_off_delay_work.work);
> 
> -	mutex_lock(&adev->gfx.gfx_off_mutex);
> +	/* mutex_lock could deadlock with cancel_delayed_work_sync in
> amdgpu_gfx_off_ctrl. */
> +	if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
> +		/* If there's a bug which causes amdgpu_gfx_off_ctrl to be
> called with enable=true
> +		 * when adev->gfx.gfx_off_req_count is already 0, we might
> race with that.
> +		 * Re-schedule to make sure gfx off will be re-enabled in the
> HW eventually.
> +		 */
> +		schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
> AMDGPU_GFX_OFF_DELAY_ENABLE);
> +		return;
> +	}
> +
>  	if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>  		if (!amdgpu_dpm_set_powergating_by_smu(adev,
> AMD_IP_BLOCK_TYPE_GFX, true))
>  			adev->gfx.gfx_off_state = true;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index a0be0772c8b3..da4c46db3093 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -28,9 +28,6 @@
>  #include "amdgpu_rlc.h"
>  #include "amdgpu_ras.h"
> 
> -/* delay 0.1 second to enable gfx off feature */
> -#define GFX_OFF_DELAY_ENABLE         msecs_to_jiffies(100)
> -
>  /*
>   * GPU GFX IP block helpers function.
>   */
> @@ -569,9 +566,13 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device
> *adev, bool enable)
>  		adev->gfx.gfx_off_req_count--;
> 
>  	if (enable && !adev->gfx.gfx_off_state && !adev-
> >gfx.gfx_off_req_count) {
> -		schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
> GFX_OFF_DELAY_ENABLE);
> -	} else if (!enable && adev->gfx.gfx_off_state) {
> -		if (!amdgpu_dpm_set_powergating_by_smu(adev,
> AMD_IP_BLOCK_TYPE_GFX, false)) {
> +		schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
> AMDGPU_GFX_OFF_DELAY_ENABLE);
> +	} else if (!enable) {
> +		if (adev->gfx.gfx_off_req_count == 1 && !adev-
> >gfx.gfx_off_state)
> +			cancel_delayed_work_sync(&adev-
> >gfx.gfx_off_delay_work);
> +
> +		if (adev->gfx.gfx_off_state &&
> +		    !amdgpu_dpm_set_powergating_by_smu(adev,
> AMD_IP_BLOCK_TYPE_GFX, false)) {
>  			adev->gfx.gfx_off_state = false;
> 
>  			if (adev->gfx.funcs->init_spm_golden) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> index d43fe2ed8116..dcdb505bb7f4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> @@ -32,6 +32,9 @@
>  #include "amdgpu_rlc.h"
>  #include "soc15.h"
> 
> +/* delay 0.1 second to enable gfx off feature */
> +#define AMDGPU_GFX_OFF_DELAY_ENABLE msecs_to_jiffies(100)
> +
>  /* GFX current status */
>  #define AMDGPU_GFX_NORMAL_MODE			0x00000000L
>  #define AMDGPU_GFX_SAFE_MODE			0x00000001L
> --
> 2.32.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-13 10:29 ` [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled Michel Dänzer
                     ` (2 preceding siblings ...)
  2021-08-16 10:20   ` Quan, Evan
@ 2021-08-16 10:35   ` Michel Dänzer
  2021-08-16 11:33     ` Lazar, Lijo
                       ` (2 more replies)
  3 siblings, 3 replies; 49+ messages in thread
From: Michel Dänzer @ 2021-08-16 10:35 UTC (permalink / raw)
  To: Alex Deucher, Christian König; +Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

From: Michel Dänzer <mdaenzer@redhat.com>

schedule_delayed_work does not push back the work if it was already
scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
was disabled and re-enabled again during those 100 ms.

This resulted in frame drops / stutter with the upcoming mutter 41
release on Navi 14, due to constantly enabling GFXOFF in the HW and
disabling it again (for getting the GPU clock counter).

To fix this, call cancel_delayed_work_sync when the disable count
transitions from 0 to 1, and only schedule the delayed work on the
reverse transition, not if the disable count was already 0. This makes
sure the delayed work doesn't run at unexpected times, and allows it to
be lock-free.

v2:
* Use cancel_delayed_work_sync & mutex_trylock instead of
  mod_delayed_work.
v3:
* Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)

Cc: stable@vger.kernel.org
Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 22 +++++++++++++++++-----
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f3fd5ec710b6..f944ed858f3e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
 	struct amdgpu_device *adev =
 		container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
 
-	mutex_lock(&adev->gfx.gfx_off_mutex);
-	if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
-		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
-			adev->gfx.gfx_off_state = true;
-	}
-	mutex_unlock(&adev->gfx.gfx_off_mutex);
+	WARN_ON_ONCE(adev->gfx.gfx_off_state);
+	WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
+
+	if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
+		adev->gfx.gfx_off_state = true;
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index a0be0772c8b3..ca91aafcb32b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -563,15 +563,26 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
 
 	mutex_lock(&adev->gfx.gfx_off_mutex);
 
-	if (!enable)
-		adev->gfx.gfx_off_req_count++;
-	else if (adev->gfx.gfx_off_req_count > 0)
+	if (enable) {
+		/* If the count is already 0, it means there's an imbalance bug somewhere.
+		 * Note that the bug may be in a different caller than the one which triggers the
+		 * WARN_ON_ONCE.
+		 */
+		if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
+			goto unlock;
+
 		adev->gfx.gfx_off_req_count--;
+	} else {
+		adev->gfx.gfx_off_req_count++;
+	}
 
 	if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
 		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
-	} else if (!enable && adev->gfx.gfx_off_state) {
-		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
+	} else if (!enable && adev->gfx.gfx_off_req_count == 1) {
+		cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
+
+		if (adev->gfx.gfx_off_state &&
+		    !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
 			adev->gfx.gfx_off_state = false;
 
 			if (adev->gfx.funcs->init_spm_golden) {
@@ -581,6 +592,7 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
 		}
 	}
 
+unlock:
 	mutex_unlock(&adev->gfx.gfx_off_mutex);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-16  7:38   ` Christian König
@ 2021-08-16 10:38     ` Michel Dänzer
  0 siblings, 0 replies; 49+ messages in thread
From: Michel Dänzer @ 2021-08-16 10:38 UTC (permalink / raw)
  To: Christian König, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

On 2021-08-16 9:38 a.m., Christian König wrote:
> Am 13.08.21 um 12:29 schrieb Michel Dänzer:
>> From: Michel Dänzer <mdaenzer@redhat.com>
>>
>> schedule_delayed_work does not push back the work if it was already
>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>> was disabled and re-enabled again during those 100 ms.
>>
>> This resulted in frame drops / stutter with the upcoming mutter 41
>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>> disabling it again (for getting the GPU clock counter).
>>
>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
>> enabled to disabled. This makes sure the delayed work will be scheduled
>> as intended in the reverse case.
>>
>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
>> to use mutex_trylock instead of mutex_lock.
>>
>> v2:
>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>    mod_delayed_work.
> 
> While this may work it still smells a little bit fishy.
> 
> In general you have two common locking orders around work items, either lock->work or work->lock. If you mix this as lock->work->lock like here trouble is usually imminent.
> 
> I think what we should do instead is to double check if taking the lock inside the work item is necessary and instead making sure that the work is sync canceled when we don't want it to run. In other words fully switching to the lock->work approach.

Done in v3, thanks for the suggestion!


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-16 10:20   ` Quan, Evan
@ 2021-08-16 10:43     ` Michel Dänzer
  0 siblings, 0 replies; 49+ messages in thread
From: Michel Dänzer @ 2021-08-16 10:43 UTC (permalink / raw)
  To: Quan, Evan, Deucher, Alexander, Koenig, Christian
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

On 2021-08-16 12:20 p.m., Quan, Evan wrote:
> [AMD Official Use Only]
> 
> Hi Michel,
> 
> The patch seems reasonable to me(especially the cancel_delayed_work_sync() part).
> However, can you explain more about the code below?
> What's the race issue here exactly?
> 
> +	/* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
> +	if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
> +		/* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
> +		 * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
> +		 * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
> +		 */
> +		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
> +		return;
> +	}

If amdgpu_gfx_off_ctrl was called with enable=true when adev->gfx.gfx_off_req_count == 0 already, it could have prevented amdgpu_device_delay_enable_gfx_off from locking the mutex.

v3 solves this by only scheduling the work when adev->gfx.gfx_off_req_count transitions from 1 to 0, which means it no longer needs to lock the mutex.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-16  4:13               ` Lazar, Lijo
@ 2021-08-16 10:45                 ` Michel Dänzer
  0 siblings, 0 replies; 49+ messages in thread
From: Michel Dänzer @ 2021-08-16 10:45 UTC (permalink / raw)
  To: Lazar, Lijo, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

On 2021-08-16 6:13 a.m., Lazar, Lijo wrote:
> On 8/13/2021 9:30 PM, Michel Dänzer wrote:
>> On 2021-08-13 5:07 p.m., Lazar, Lijo wrote:
>>> On 8/13/2021 8:10 PM, Michel Dänzer wrote:
>>>> On 2021-08-13 4:14 p.m., Lazar, Lijo wrote:
>>>>> On 8/13/2021 7:04 PM, Michel Dänzer wrote:
>>>>>> On 2021-08-13 1:50 p.m., Lazar, Lijo wrote:
>>>>>>> On 8/13/2021 3:59 PM, Michel Dänzer wrote:
>>>>>>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>>>>>>
>>>>>>>> schedule_delayed_work does not push back the work if it was already
>>>>>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>>>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>>>>>> was disabled and re-enabled again during those 100 ms.
>>>>>>>>
>>>>>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>>>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>>>>>> disabling it again (for getting the GPU clock counter).
>>>>>>>>
>>>>>>>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
>>>>>>>> enabled to disabled. This makes sure the delayed work will be scheduled
>>>>>>>> as intended in the reverse case.
>>>>>>>>
>>>>>>>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
>>>>>>>> to use mutex_trylock instead of mutex_lock.
>>>>>>>>
>>>>>>>> v2:
>>>>>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>>>>>       mod_delayed_work.
>>>>>>>>
>>>>>>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>>>>>>>>      3 files changed, 20 insertions(+), 7 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index f3fd5ec710b6..8b025f70706c 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>>>>>          struct amdgpu_device *adev =
>>>>>>>>              container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>>>>>>      -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>>>> +    /* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
>>>>>>>> +    if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
>>>>>>>> +        /* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
>>>>>>>> +         * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
>>>>>>>> +         * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
>>>>>>>> +         */
>>>>>>>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>>>>>>>> +        return;
>>>>>>>
>>>>>>> This is not needed and is just creating another thread to contend for mutex.
>>>>>>
>>>>>> Still not sure what you mean by that. What other thread?
>>>>>
>>>>> Sorry, I meant it schedules another workitem and delays GFXOFF enablement further. For ex: if it was another function like gfx_off_status holding the lock at the time of check.
>>>>>
>>>>>>
>>>>>>> The checks below take care of enabling gfxoff correctly. If it's already in gfx_off state, it doesn't do anything. So I don't see why this change is needed.
>>>>>>
>>>>>> mutex_trylock is needed to prevent the deadlock discussed before and below.
>>>>>>
>>>>>> schedule_delayed_work is needed due to this scenario hinted at by the comment:
>>>>>>
>>>>>> 1. amdgpu_gfx_off_ctrl locks mutex, calls schedule_delayed_work
>>>>>> 2. amdgpu_device_delay_enable_gfx_off runs, calls mutex_trylock, which fails
>>>>>>
>>>>>> GFXOFF would never get re-enabled in HW in this case (until amdgpu_gfx_off_ctrl calls schedule_delayed_work again).
>>>>>>
>>>>>> (cancel_delayed_work_sync guarantees there's no pending delayed work when it returns, even if amdgpu_device_delay_enable_gfx_off calls schedule_delayed_work)
>>>>>>
>>>>>
>>>>> I think we need to explain based on the original code before. There is an asssumption here that the only other contention of this mutex is with the gfx_off_ctrl function.
>>>>
>>>> Not really.
>>>>
>>>>
>>>>> As far as I understand if the work has already started running when schedule_delayed_work is called, it will insert another in the work queue after delay. Based on that understanding I didn't find a problem with the original code.
>>>>
>>>> Original code as in without this patch or the mod_delayed_work patch? If so, the problem is not when the work has already started running. It's that when it hasn't started running yet, schedule_delayed_work doesn't change the timeout for the already scheduled work, so it ends up enabling GFXOFF earlier than intended (and thus at all in scenarios when it's not supposed to).
>>>>
>>>
>>> I meant the original implementation of amdgpu_device_delay_enable_gfx_off().
>>>
>>>
>>> If you indeed want to use _sync, there is a small problem with this implementation also which is roughly equivalent to the original problem you faced.
>>>
>>> amdgpu_gfx_off_ctrl(disable) locks mutex
>>> calls cancel_delayed_work_sync
>>> amdgpu_device_delay_enable_gfx_off already started running
>>>      mutex_trylock fails and schedules another one
>>> amdgpu_gfx_off_ctrl(enable)
>>>      schedules_delayed_work() - Delay is not extended, it's the same as when it's rearmed from work item.
>>
>>
>> This cannot happen. When cancel_delayed_work_sync returns, it guarantees that the delayed work is not scheduled
>> , even if amdgpu_device_delay_enable_gfx_off called schedule_delayed_work. In other words, it cancels that as well.
>>
> 
> Ah, thanks! Didn't know that it will cancel out re-queued work also. In that case, may be reduce the delay for re-queuing it - say 50% or 25% of AMDGPU_GFX_OFF_DELAY_ENABLE. Instead of delaying GFXOFF further, it's better to enable it faster as it's losing out to another enable or some other function.
> 
>>> Probably, overthinking about the solution. Looking back, mod_ version is simpler :). May be just delay it further everytime there is a call with enable instead of doing it only for req_cnt==0?
>>
>> That has some issues as well:
>>
>> * Still prone to the "amdgpu_device_delay_enable_gfx_off re-enables GFXOFF immediately after amdgpu_gfx_off_ctrl dropped req_count to 0" race if the former starts running between when the latter locks the mutex and calls mod_delayed_work.
>> * If the work is not scheduled yet, mod_delayed_work would schedule it, even if req_count > 0, in which case it couldn't actually enable GFXOFF.
>>
>> Conceptually, making sure the work is never scheduled while req_count > 0 seems cleaner to me. It's the same principle as in the JPEG/UVD/VCE/VCN ring functions (which are presumably hotter paths than these amdgpu_gfx_off functions) I needlessly modified in patch 2.
>>
>> (It also means amdgpu_device_delay_enable_gfx_off technically no longer needs to test req_count or gfx_off_state; I can spin a v3 for that if desired)
>>
> 
> Would still keep the "gfx_off_state check" to avoid executing the sequence due to buggy enable calls coming when it's already in gfxoff (if at all that happens).

The v3 patch addresses all of these issues.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-16 10:35   ` [PATCH v3] " Michel Dänzer
@ 2021-08-16 11:33     ` Lazar, Lijo
  2021-08-16 12:06       ` Christian König
  2021-08-17  7:51     ` Quan, Evan
  2021-08-17  8:23     ` [PATCH] " Michel Dänzer
  2 siblings, 1 reply; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-16 11:33 UTC (permalink / raw)
  To: Michel Dänzer, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel



On 8/16/2021 4:05 PM, Michel Dänzer wrote:
> From: Michel Dänzer <mdaenzer@redhat.com>
> 
> schedule_delayed_work does not push back the work if it was already
> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
> was disabled and re-enabled again during those 100 ms.
> 
> This resulted in frame drops / stutter with the upcoming mutter 41
> release on Navi 14, due to constantly enabling GFXOFF in the HW and
> disabling it again (for getting the GPU clock counter).
> 
> To fix this, call cancel_delayed_work_sync when the disable count
> transitions from 0 to 1, and only schedule the delayed work on the
> reverse transition, not if the disable count was already 0. This makes
> sure the delayed work doesn't run at unexpected times, and allows it to
> be lock-free.
> 
> v2:
> * Use cancel_delayed_work_sync & mutex_trylock instead of
>    mod_delayed_work.
> v3:
> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++++------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 22 +++++++++++++++++-----
>   2 files changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f3fd5ec710b6..f944ed858f3e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>   	struct amdgpu_device *adev =
>   		container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>   
> -	mutex_lock(&adev->gfx.gfx_off_mutex);
> -	if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
> -		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
> -			adev->gfx.gfx_off_state = true;
> -	}
> -	mutex_unlock(&adev->gfx.gfx_off_mutex);
> +	WARN_ON_ONCE(adev->gfx.gfx_off_state);

Don't see any case for this. It's not expected to be scheduled in this 
case, right?

> +	WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
> +

Thinking about ON_ONCE here - this may happen more than once if it's 
completed as part of cancel_ call. Is the warning needed?

Anyway,
	Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

> +	if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
> +		adev->gfx.gfx_off_state = true;
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index a0be0772c8b3..ca91aafcb32b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -563,15 +563,26 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>   
>   	mutex_lock(&adev->gfx.gfx_off_mutex);
>   
> -	if (!enable)
> -		adev->gfx.gfx_off_req_count++;
> -	else if (adev->gfx.gfx_off_req_count > 0)
> +	if (enable) {
> +		/* If the count is already 0, it means there's an imbalance bug somewhere.
> +		 * Note that the bug may be in a different caller than the one which triggers the
> +		 * WARN_ON_ONCE.
> +		 */
> +		if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
> +			goto unlock;
> +
>   		adev->gfx.gfx_off_req_count--;
> +	} else {
> +		adev->gfx.gfx_off_req_count++;
> +	}
>   
>   	if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>   		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
> -	} else if (!enable && adev->gfx.gfx_off_state) {
> -		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
> +	} else if (!enable && adev->gfx.gfx_off_req_count == 1) {
> +		cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
> +
> +		if (adev->gfx.gfx_off_state &&
> +		    !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>   			adev->gfx.gfx_off_state = false;
>   
>   			if (adev->gfx.funcs->init_spm_golden) {
> @@ -581,6 +592,7 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>   		}
>   	}
>   
> +unlock:
>   	mutex_unlock(&adev->gfx.gfx_off_mutex);
>   }
>   
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-16 11:33     ` Lazar, Lijo
@ 2021-08-16 12:06       ` Christian König
  2021-08-16 15:06         ` Michel Dänzer
  0 siblings, 1 reply; 49+ messages in thread
From: Christian König @ 2021-08-16 12:06 UTC (permalink / raw)
  To: Lazar, Lijo, Michel Dänzer, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

Am 16.08.21 um 13:33 schrieb Lazar, Lijo:
> On 8/16/2021 4:05 PM, Michel Dänzer wrote:
>> From: Michel Dänzer <mdaenzer@redhat.com>
>>
>> schedule_delayed_work does not push back the work if it was already
>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>> was disabled and re-enabled again during those 100 ms.
>>
>> This resulted in frame drops / stutter with the upcoming mutter 41
>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>> disabling it again (for getting the GPU clock counter).
>>
>> To fix this, call cancel_delayed_work_sync when the disable count
>> transitions from 0 to 1, and only schedule the delayed work on the
>> reverse transition, not if the disable count was already 0. This makes
>> sure the delayed work doesn't run at unexpected times, and allows it to
>> be lock-free.
>>
>> v2:
>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>    mod_delayed_work.
>> v3:
>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++++------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 22 +++++++++++++++++-----
>>   2 files changed, 22 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index f3fd5ec710b6..f944ed858f3e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2777,12 +2777,11 @@ static void 
>> amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>       struct amdgpu_device *adev =
>>           container_of(work, struct amdgpu_device, 
>> gfx.gfx_off_delay_work.work);
>>   -    mutex_lock(&adev->gfx.gfx_off_mutex);
>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, 
>> AMD_IP_BLOCK_TYPE_GFX, true))
>> -            adev->gfx.gfx_off_state = true;
>> -    }
>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>
> Don't see any case for this. It's not expected to be scheduled in this 
> case, right?
>
>> + WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>> +
>
> Thinking about ON_ONCE here - this may happen more than once if it's 
> completed as part of cancel_ call. Is the warning needed?

WARN_ON_ONCE() is usually used to prevent spamming the system log with 
warnings. E.g. the warning is only printed once indicating a driver bug 
and that's it.

>
> Anyway,
>     Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

Acked-by: Christian König <christian.koenig@amd.com>

Regards,
Christian.

>
>> +    if (!amdgpu_dpm_set_powergating_by_smu(adev, 
>> AMD_IP_BLOCK_TYPE_GFX, true))
>> +        adev->gfx.gfx_off_state = true;
>>   }
>>     /**
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> index a0be0772c8b3..ca91aafcb32b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> @@ -563,15 +563,26 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device 
>> *adev, bool enable)
>>         mutex_lock(&adev->gfx.gfx_off_mutex);
>>   -    if (!enable)
>> -        adev->gfx.gfx_off_req_count++;
>> -    else if (adev->gfx.gfx_off_req_count > 0)
>> +    if (enable) {
>> +        /* If the count is already 0, it means there's an imbalance 
>> bug somewhere.
>> +         * Note that the bug may be in a different caller than the 
>> one which triggers the
>> +         * WARN_ON_ONCE.
>> +         */
>> +        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>> +            goto unlock;
>> +
>>           adev->gfx.gfx_off_req_count--;
>> +    } else {
>> +        adev->gfx.gfx_off_req_count++;
>> +    }
>>         if (enable && !adev->gfx.gfx_off_state && 
>> !adev->gfx.gfx_off_req_count) {
>> schedule_delayed_work(&adev->gfx.gfx_off_delay_work, 
>> GFX_OFF_DELAY_ENABLE);
>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, 
>> AMD_IP_BLOCK_TYPE_GFX, false)) {
>> +    } else if (!enable && adev->gfx.gfx_off_req_count == 1) {
>> + cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>> +
>> +        if (adev->gfx.gfx_off_state &&
>> +            !amdgpu_dpm_set_powergating_by_smu(adev, 
>> AMD_IP_BLOCK_TYPE_GFX, false)) {
>>               adev->gfx.gfx_off_state = false;
>>                 if (adev->gfx.funcs->init_spm_golden) {
>> @@ -581,6 +592,7 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device 
>> *adev, bool enable)
>>           }
>>       }
>>   +unlock:
>>       mutex_unlock(&adev->gfx.gfx_off_mutex);
>>   }
>>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-16 12:06       ` Christian König
@ 2021-08-16 15:06         ` Michel Dänzer
  2021-08-16 19:02           ` Alex Deucher
  0 siblings, 1 reply; 49+ messages in thread
From: Michel Dänzer @ 2021-08-16 15:06 UTC (permalink / raw)
  To: Christian König, Lazar, Lijo, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

On 2021-08-16 2:06 p.m., Christian König wrote:
> Am 16.08.21 um 13:33 schrieb Lazar, Lijo:
>> On 8/16/2021 4:05 PM, Michel Dänzer wrote:
>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>
>>> schedule_delayed_work does not push back the work if it was already
>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>> was disabled and re-enabled again during those 100 ms.
>>>
>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>> disabling it again (for getting the GPU clock counter).
>>>
>>> To fix this, call cancel_delayed_work_sync when the disable count
>>> transitions from 0 to 1, and only schedule the delayed work on the
>>> reverse transition, not if the disable count was already 0. This makes
>>> sure the delayed work doesn't run at unexpected times, and allows it to
>>> be lock-free.
>>>
>>> v2:
>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>    mod_delayed_work.
>>> v3:
>>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>>>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++++------
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 22 +++++++++++++++++-----
>>>   2 files changed, 22 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index f3fd5ec710b6..f944ed858f3e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>       struct amdgpu_device *adev =
>>>           container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>   -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>> -            adev->gfx.gfx_off_state = true;
>>> -    }
>>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>>
>> Don't see any case for this. It's not expected to be scheduled in this case, right?
>>
>>> + WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>>> +
>>
>> Thinking about ON_ONCE here - this may happen more than once if it's completed as part of cancel_ call. Is the warning needed?
> 
> WARN_ON_ONCE() is usually used to prevent spamming the system log with warnings. E.g. the warning is only printed once indicating a driver bug and that's it.

Right, these WARN_ONs are like assert()s in user-space code, documenting the pre-conditions and checking them at runtime. And I use _ONCE so that if a pre-condition is ever violated for some reason, dmesg isn't spammed with multiple warnings.


>> Anyway,
>>     Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
> 
> Acked-by: Christian König <christian.koenig@amd.com>

Thanks guys!


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-16 15:06         ` Michel Dänzer
@ 2021-08-16 19:02           ` Alex Deucher
  0 siblings, 0 replies; 49+ messages in thread
From: Alex Deucher @ 2021-08-16 19:02 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Christian König, Lazar, Lijo, Alex Deucher,
	Christian König, Leo Liu, James Zhu, amd-gfx list,
	Maling list - DRI developers

Applied.  Thanks!

Alex

On Mon, Aug 16, 2021 at 11:07 AM Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2021-08-16 2:06 p.m., Christian König wrote:
> > Am 16.08.21 um 13:33 schrieb Lazar, Lijo:
> >> On 8/16/2021 4:05 PM, Michel Dänzer wrote:
> >>> From: Michel Dänzer <mdaenzer@redhat.com>
> >>>
> >>> schedule_delayed_work does not push back the work if it was already
> >>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
> >>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
> >>> was disabled and re-enabled again during those 100 ms.
> >>>
> >>> This resulted in frame drops / stutter with the upcoming mutter 41
> >>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
> >>> disabling it again (for getting the GPU clock counter).
> >>>
> >>> To fix this, call cancel_delayed_work_sync when the disable count
> >>> transitions from 0 to 1, and only schedule the delayed work on the
> >>> reverse transition, not if the disable count was already 0. This makes
> >>> sure the delayed work doesn't run at unexpected times, and allows it to
> >>> be lock-free.
> >>>
> >>> v2:
> >>> * Use cancel_delayed_work_sync & mutex_trylock instead of
> >>>    mod_delayed_work.
> >>> v3:
> >>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
> >>>
> >>> Cc: stable@vger.kernel.org
> >>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
> >>> ---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++++------
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 22 +++++++++++++++++-----
> >>>   2 files changed, 22 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index f3fd5ec710b6..f944ed858f3e 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
> >>>       struct amdgpu_device *adev =
> >>>           container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
> >>>   -    mutex_lock(&adev->gfx.gfx_off_mutex);
> >>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
> >>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
> >>> -            adev->gfx.gfx_off_state = true;
> >>> -    }
> >>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
> >>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
> >>
> >> Don't see any case for this. It's not expected to be scheduled in this case, right?
> >>
> >>> + WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
> >>> +
> >>
> >> Thinking about ON_ONCE here - this may happen more than once if it's completed as part of cancel_ call. Is the warning needed?
> >
> > WARN_ON_ONCE() is usually used to prevent spamming the system log with warnings. E.g. the warning is only printed once indicating a driver bug and that's it.
>
> Right, these WARN_ONs are like assert()s in user-space code, documenting the pre-conditions and checking them at runtime. And I use _ONCE so that if a pre-condition is ever violated for some reason, dmesg isn't spammed with multiple warnings.
>
>
> >> Anyway,
> >>     Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
> >
> > Acked-by: Christian König <christian.koenig@amd.com>
>
> Thanks guys!
>
>
> --
> Earthling Michel Dänzer               |               https://redhat.com
> Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-16 10:35   ` [PATCH v3] " Michel Dänzer
  2021-08-16 11:33     ` Lazar, Lijo
@ 2021-08-17  7:51     ` Quan, Evan
  2021-08-17  8:17       ` Lazar, Lijo
  2021-08-17  8:23     ` [PATCH] " Michel Dänzer
  2 siblings, 1 reply; 49+ messages in thread
From: Quan, Evan @ 2021-08-17  7:51 UTC (permalink / raw)
  To: Michel Dänzer, Deucher, Alexander, Koenig, Christian
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

[AMD Official Use Only]



> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Michel Dänzer
> Sent: Monday, August 16, 2021 6:35 PM
> To: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>
> Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-
> gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> Subject: [PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is
> disabled
> 
> From: Michel Dänzer <mdaenzer@redhat.com>
> 
> schedule_delayed_work does not push back the work if it was already
> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
> was disabled and re-enabled again during those 100 ms.
> 
> This resulted in frame drops / stutter with the upcoming mutter 41
> release on Navi 14, due to constantly enabling GFXOFF in the HW and
> disabling it again (for getting the GPU clock counter).
> 
> To fix this, call cancel_delayed_work_sync when the disable count
> transitions from 0 to 1, and only schedule the delayed work on the
> reverse transition, not if the disable count was already 0. This makes
> sure the delayed work doesn't run at unexpected times, and allows it to
> be lock-free.
> 
> v2:
> * Use cancel_delayed_work_sync & mutex_trylock instead of
>   mod_delayed_work.
> v3:
> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++++------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 22 +++++++++++++++++-
> ----
>  2 files changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f3fd5ec710b6..f944ed858f3e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2777,12 +2777,11 @@ static void
> amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>  	struct amdgpu_device *adev =
>  		container_of(work, struct amdgpu_device,
> gfx.gfx_off_delay_work.work);
> 
> -	mutex_lock(&adev->gfx.gfx_off_mutex);
> -	if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
> -		if (!amdgpu_dpm_set_powergating_by_smu(adev,
> AMD_IP_BLOCK_TYPE_GFX, true))
> -			adev->gfx.gfx_off_state = true;
> -	}
> -	mutex_unlock(&adev->gfx.gfx_off_mutex);
> +	WARN_ON_ONCE(adev->gfx.gfx_off_state);
> +	WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
> +
> +	if (!amdgpu_dpm_set_powergating_by_smu(adev,
> AMD_IP_BLOCK_TYPE_GFX, true))
> +		adev->gfx.gfx_off_state = true;
>  }
> 
>  /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index a0be0772c8b3..ca91aafcb32b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -563,15 +563,26 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device
> *adev, bool enable)
> 
>  	mutex_lock(&adev->gfx.gfx_off_mutex);
> 
> -	if (!enable)
> -		adev->gfx.gfx_off_req_count++;
> -	else if (adev->gfx.gfx_off_req_count > 0)
> +	if (enable) {
> +		/* If the count is already 0, it means there's an imbalance bug
> somewhere.
> +		 * Note that the bug may be in a different caller than the one
> which triggers the
> +		 * WARN_ON_ONCE.
> +		 */
> +		if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
> +			goto unlock;
> +
>  		adev->gfx.gfx_off_req_count--;
> +	} else {
> +		adev->gfx.gfx_off_req_count++;
> +	}
> 
>  	if (enable && !adev->gfx.gfx_off_state && !adev-
> >gfx.gfx_off_req_count) {
>  		schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
> GFX_OFF_DELAY_ENABLE);
> -	} else if (!enable && adev->gfx.gfx_off_state) {
> -		if (!amdgpu_dpm_set_powergating_by_smu(adev,
> AMD_IP_BLOCK_TYPE_GFX, false)) {
> +	} else if (!enable && adev->gfx.gfx_off_req_count == 1) {
[Quan, Evan] It seems here will leave a small time window for race condition. If amdgpu_device_delay_enable_gfx_off() happens to occur here, it will "WARN_ON_ONCE(adev->gfx.gfx_off_req_count);". How about something as below?
@@ -573,13 +573,11 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
                        goto unlock;

                adev->gfx.gfx_off_req_count--;
-       } else {
-               adev->gfx.gfx_off_req_count++;
        }

        if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
                schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
-       } else if (!enable && adev->gfx.gfx_off_req_count == 1) {
+       } else if (!enable && adev->gfx.gfx_off_req_count == 0) {
                cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);

                if (adev->gfx.gfx_off_state &&
@@ -593,6 +591,9 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
                }
        }

+       if (!enable)
+               adev->gfx.gfx_off_req_count++;
+
 unlock:

BR
Evan
> +		cancel_delayed_work_sync(&adev-
> >gfx.gfx_off_delay_work);
> +
> +		if (adev->gfx.gfx_off_state &&
> +		    !amdgpu_dpm_set_powergating_by_smu(adev,
> AMD_IP_BLOCK_TYPE_GFX, false)) {
>  			adev->gfx.gfx_off_state = false;
> 
>  			if (adev->gfx.funcs->init_spm_golden) {
> @@ -581,6 +592,7 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device
> *adev, bool enable)
>  		}
>  	}
> 
> +unlock:
>  	mutex_unlock(&adev->gfx.gfx_off_mutex);
>  }
> 
> --
> 2.32.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-17  7:51     ` Quan, Evan
@ 2021-08-17  8:17       ` Lazar, Lijo
  2021-08-17  8:35         ` Michel Dänzer
  0 siblings, 1 reply; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-17  8:17 UTC (permalink / raw)
  To: Quan, Evan, Michel Dänzer, Deucher, Alexander, Koenig, Christian
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel



On 8/17/2021 1:21 PM, Quan, Evan wrote:
> [AMD Official Use Only]
> 
> 
> 
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>> Michel Dänzer
>> Sent: Monday, August 16, 2021 6:35 PM
>> To: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>
>> Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-
>> gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org
>> Subject: [PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is
>> disabled
>>
>> From: Michel Dänzer <mdaenzer@redhat.com>
>>
>> schedule_delayed_work does not push back the work if it was already
>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>> was disabled and re-enabled again during those 100 ms.
>>
>> This resulted in frame drops / stutter with the upcoming mutter 41
>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>> disabling it again (for getting the GPU clock counter).
>>
>> To fix this, call cancel_delayed_work_sync when the disable count
>> transitions from 0 to 1, and only schedule the delayed work on the
>> reverse transition, not if the disable count was already 0. This makes
>> sure the delayed work doesn't run at unexpected times, and allows it to
>> be lock-free.
>>
>> v2:
>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>    mod_delayed_work.
>> v3:
>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++++------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 22 +++++++++++++++++-
>> ----
>>   2 files changed, 22 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index f3fd5ec710b6..f944ed858f3e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2777,12 +2777,11 @@ static void
>> amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>   	struct amdgpu_device *adev =
>>   		container_of(work, struct amdgpu_device,
>> gfx.gfx_off_delay_work.work);
>>
>> -	mutex_lock(&adev->gfx.gfx_off_mutex);
>> -	if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>> -		if (!amdgpu_dpm_set_powergating_by_smu(adev,
>> AMD_IP_BLOCK_TYPE_GFX, true))
>> -			adev->gfx.gfx_off_state = true;
>> -	}
>> -	mutex_unlock(&adev->gfx.gfx_off_mutex);
>> +	WARN_ON_ONCE(adev->gfx.gfx_off_state);
>> +	WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>> +
>> +	if (!amdgpu_dpm_set_powergating_by_smu(adev,
>> AMD_IP_BLOCK_TYPE_GFX, true))
>> +		adev->gfx.gfx_off_state = true;
>>   }
>>
>>   /**
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> index a0be0772c8b3..ca91aafcb32b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> @@ -563,15 +563,26 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device
>> *adev, bool enable)
>>
>>   	mutex_lock(&adev->gfx.gfx_off_mutex);
>>
>> -	if (!enable)
>> -		adev->gfx.gfx_off_req_count++;
>> -	else if (adev->gfx.gfx_off_req_count > 0)
>> +	if (enable) {
>> +		/* If the count is already 0, it means there's an imbalance bug
>> somewhere.
>> +		 * Note that the bug may be in a different caller than the one
>> which triggers the
>> +		 * WARN_ON_ONCE.
>> +		 */
>> +		if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>> +			goto unlock;
>> +
>>   		adev->gfx.gfx_off_req_count--;
>> +	} else {
>> +		adev->gfx.gfx_off_req_count++;
>> +	}
>>
>>   	if (enable && !adev->gfx.gfx_off_state && !adev-
>>> gfx.gfx_off_req_count) {
>>   		schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
>> GFX_OFF_DELAY_ENABLE);
>> -	} else if (!enable && adev->gfx.gfx_off_state) {
>> -		if (!amdgpu_dpm_set_powergating_by_smu(adev,
>> AMD_IP_BLOCK_TYPE_GFX, false)) {
>> +	} else if (!enable && adev->gfx.gfx_off_req_count == 1) {
> [Quan, Evan] It seems here will leave a small time window for race condition. If amdgpu_device_delay_enable_gfx_off() happens to occur here, it will "WARN_ON_ONCE(adev->gfx.gfx_off_req_count);". How about something as below?
> @@ -573,13 +573,11 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>                          goto unlock;
> 
>                  adev->gfx.gfx_off_req_count--;
> -       } else {
> -               adev->gfx.gfx_off_req_count++;
>          }
> 
>          if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>                  schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
> -       } else if (!enable && adev->gfx.gfx_off_req_count == 1) {
> +       } else if (!enable && adev->gfx.gfx_off_req_count == 0) {
>                  cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
> 
>                  if (adev->gfx.gfx_off_state &&
> @@ -593,6 +591,9 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>                  }
>          }
> 
> +       if (!enable)
> +               adev->gfx.gfx_off_req_count++;
> +
>   unlock:
> 

Hi Evan,

It's not a race per se, it is just an undesirable condition of Enable 
Gfxoff immediately followed by a Disable GfxOff. The purpose of the WARN 
is to intimate the user about it.

There are other cases - for ex: if amdgpu_device_delay_enable_gfx_off() 
called amdgpu_dpm_set_powergating_by_smu() already at the same place you 
pointed out. In this case WARN doesn't get printed, but it's not an 
optimal situation either. Probably it makes sense to move the WARN_ON as 
the last line of amdgpu_device_delay_enable_gfx_off. Either way, I don't 
think it's a race condition.

Thanks,
Lijo


> BR
> Evan
>> +		cancel_delayed_work_sync(&adev-
>>> gfx.gfx_off_delay_work);
>> +
>> +		if (adev->gfx.gfx_off_state &&
>> +		    !amdgpu_dpm_set_powergating_by_smu(adev,
>> AMD_IP_BLOCK_TYPE_GFX, false)) {
>>   			adev->gfx.gfx_off_state = false;
>>
>>   			if (adev->gfx.funcs->init_spm_golden) {
>> @@ -581,6 +592,7 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device
>> *adev, bool enable)
>>   		}
>>   	}
>>
>> +unlock:
>>   	mutex_unlock(&adev->gfx.gfx_off_mutex);
>>   }
>>
>> --
>> 2.32.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-16 10:35   ` [PATCH v3] " Michel Dänzer
  2021-08-16 11:33     ` Lazar, Lijo
  2021-08-17  7:51     ` Quan, Evan
@ 2021-08-17  8:23     ` Michel Dänzer
  2021-08-17  9:12       ` Lazar, Lijo
                         ` (2 more replies)
  2 siblings, 3 replies; 49+ messages in thread
From: Michel Dänzer @ 2021-08-17  8:23 UTC (permalink / raw)
  To: Alex Deucher, Christian König; +Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

From: Michel Dänzer <mdaenzer@redhat.com>

schedule_delayed_work does not push back the work if it was already
scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
was disabled and re-enabled again during those 100 ms.

This resulted in frame drops / stutter with the upcoming mutter 41
release on Navi 14, due to constantly enabling GFXOFF in the HW and
disabling it again (for getting the GPU clock counter).

To fix this, call cancel_delayed_work_sync when the disable count
transitions from 0 to 1, and only schedule the delayed work on the
reverse transition, not if the disable count was already 0. This makes
sure the delayed work doesn't run at unexpected times, and allows it to
be lock-free.

v2:
* Use cancel_delayed_work_sync & mutex_trylock instead of
  mod_delayed_work.
v3:
* Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
v4:
* Fix race condition between amdgpu_gfx_off_ctrl incrementing
  adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
  checking for it to be 0 (Evan Quan)

Cc: stable@vger.kernel.org
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
Acked-by: Christian König <christian.koenig@amd.com> # v3
Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
---

Alex, probably best to wait a bit longer before picking this up. :)

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 +++++++++++++++-------
 2 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f3fd5ec710b6..f944ed858f3e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
 	struct amdgpu_device *adev =
 		container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
 
-	mutex_lock(&adev->gfx.gfx_off_mutex);
-	if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
-		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
-			adev->gfx.gfx_off_state = true;
-	}
-	mutex_unlock(&adev->gfx.gfx_off_mutex);
+	WARN_ON_ONCE(adev->gfx.gfx_off_state);
+	WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
+
+	if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
+		adev->gfx.gfx_off_state = true;
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index a0be0772c8b3..b4ced45301be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
 
 	mutex_lock(&adev->gfx.gfx_off_mutex);
 
-	if (!enable)
-		adev->gfx.gfx_off_req_count++;
-	else if (adev->gfx.gfx_off_req_count > 0)
+	if (enable) {
+		/* If the count is already 0, it means there's an imbalance bug somewhere.
+		 * Note that the bug may be in a different caller than the one which triggers the
+		 * WARN_ON_ONCE.
+		 */
+		if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
+			goto unlock;
+
 		adev->gfx.gfx_off_req_count--;
 
-	if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
-		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
-	} else if (!enable && adev->gfx.gfx_off_state) {
-		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
-			adev->gfx.gfx_off_state = false;
+		if (adev->gfx.gfx_off_req_count == 0 && !adev->gfx.gfx_off_state)
+			schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
+	} else {
+		if (adev->gfx.gfx_off_req_count == 0) {
+			cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
+
+			if (adev->gfx.gfx_off_state &&
+			    !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
+				adev->gfx.gfx_off_state = false;
 
-			if (adev->gfx.funcs->init_spm_golden) {
-				dev_dbg(adev->dev, "GFXOFF is disabled, re-init SPM golden settings\n");
-				amdgpu_gfx_init_spm_golden(adev);
+				if (adev->gfx.funcs->init_spm_golden) {
+					dev_dbg(adev->dev,
+						"GFXOFF is disabled, re-init SPM golden settings\n");
+					amdgpu_gfx_init_spm_golden(adev);
+				}
 			}
 		}
+
+		adev->gfx.gfx_off_req_count++;
 	}
 
+unlock:
 	mutex_unlock(&adev->gfx.gfx_off_mutex);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-17  8:17       ` Lazar, Lijo
@ 2021-08-17  8:35         ` Michel Dänzer
  0 siblings, 0 replies; 49+ messages in thread
From: Michel Dänzer @ 2021-08-17  8:35 UTC (permalink / raw)
  To: Lazar, Lijo, Quan, Evan, Deucher, Alexander, Koenig, Christian
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

On 2021-08-17 10:17 a.m., Lazar, Lijo wrote:
> On 8/17/2021 1:21 PM, Quan, Evan wrote:
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>>> Michel Dänzer
>>> Sent: Monday, August 16, 2021 6:35 PM
>>> To: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>>> <Christian.Koenig@amd.com>
>>> Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-
>>> gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org
>>> Subject: [PATCH v3] drm/amdgpu: Cancel delayed work when GFXOFF is
>>> disabled
>>>
>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>
>>> schedule_delayed_work does not push back the work if it was already
>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>> was disabled and re-enabled again during those 100 ms.
>>>
>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>> disabling it again (for getting the GPU clock counter).
>>>
>>> To fix this, call cancel_delayed_work_sync when the disable count
>>> transitions from 0 to 1, and only schedule the delayed work on the
>>> reverse transition, not if the disable count was already 0. This makes
>>> sure the delayed work doesn't run at unexpected times, and allows it to
>>> be lock-free.
>>>
>>> v2:
>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>    mod_delayed_work.
>>> v3:
>>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>>>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++++------
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 22 +++++++++++++++++-
>>> ----
>>>   2 files changed, 22 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index f3fd5ec710b6..f944ed858f3e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2777,12 +2777,11 @@ static void
>>> amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>       struct amdgpu_device *adev =
>>>           container_of(work, struct amdgpu_device,
>>> gfx.gfx_off_delay_work.work);
>>>
>>> -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev,
>>> AMD_IP_BLOCK_TYPE_GFX, true))
>>> -            adev->gfx.gfx_off_state = true;
>>> -    }
>>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>>> +
>>> +    if (!amdgpu_dpm_set_powergating_by_smu(adev,
>>> AMD_IP_BLOCK_TYPE_GFX, true))
>>> +        adev->gfx.gfx_off_state = true;
>>>   }
>>>
>>>   /**
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> index a0be0772c8b3..ca91aafcb32b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> @@ -563,15 +563,26 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device
>>> *adev, bool enable)
>>>
>>>       mutex_lock(&adev->gfx.gfx_off_mutex);
>>>
>>> -    if (!enable)
>>> -        adev->gfx.gfx_off_req_count++;
>>> -    else if (adev->gfx.gfx_off_req_count > 0)
>>> +    if (enable) {
>>> +        /* If the count is already 0, it means there's an imbalance bug
>>> somewhere.
>>> +         * Note that the bug may be in a different caller than the one
>>> which triggers the
>>> +         * WARN_ON_ONCE.
>>> +         */
>>> +        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>> +            goto unlock;
>>> +
>>>           adev->gfx.gfx_off_req_count--;
>>> +    } else {
>>> +        adev->gfx.gfx_off_req_count++;
>>> +    }
>>>
>>>       if (enable && !adev->gfx.gfx_off_state && !adev-
>>>> gfx.gfx_off_req_count) {
>>>           schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
>>> GFX_OFF_DELAY_ENABLE);
>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev,
>>> AMD_IP_BLOCK_TYPE_GFX, false)) {
>>> +    } else if (!enable && adev->gfx.gfx_off_req_count == 1) {
>> [Quan, Evan] It seems here will leave a small time window for race condition. If amdgpu_device_delay_enable_gfx_off() happens to occur here, it will "WARN_ON_ONCE(adev->gfx.gfx_off_req_count);". How about something as below?
>> @@ -573,13 +573,11 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>                          goto unlock;
>>
>>                  adev->gfx.gfx_off_req_count--;
>> -       } else {
>> -               adev->gfx.gfx_off_req_count++;
>>          }
>>
>>          if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>                  schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>> -       } else if (!enable && adev->gfx.gfx_off_req_count == 1) {
>> +       } else if (!enable && adev->gfx.gfx_off_req_count == 0) {
>>                  cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>
>>                  if (adev->gfx.gfx_off_state &&
>> @@ -593,6 +591,9 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>                  }
>>          }
>>
>> +       if (!enable)
>> +               adev->gfx.gfx_off_req_count++;
>> +
>>   unlock:
>>
> 
> Hi Evan,
> 
> It's not a race per se, it is just an undesirable condition of Enable Gfxoff immediately followed by a Disable GfxOff. The purpose of the WARN is to intimate the user about it.

What Evan pointed out (good catch, thanks!) is technically a race condition WRT adev->gfx.gfx_off_req_count, even though in this case it would have only triggered the sanity checks in place to catch bugs like it, it wouldn't otherwise have affected the correctness of the code.

Fixed in v4.


> There are other cases - for ex: if amdgpu_device_delay_enable_gfx_off() called amdgpu_dpm_set_powergating_by_smu() already at the same place you pointed out.

That OTOH is indeed not a race condition, just unlucky timing.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-17  8:23     ` [PATCH] " Michel Dänzer
@ 2021-08-17  9:12       ` Lazar, Lijo
  2021-08-17  9:26         ` Michel Dänzer
  2021-08-17  9:33       ` Quan, Evan
  2021-08-18 21:56       ` Alex Deucher
  2 siblings, 1 reply; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-17  9:12 UTC (permalink / raw)
  To: Michel Dänzer, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel



On 8/17/2021 1:53 PM, Michel Dänzer wrote:
> From: Michel Dänzer <mdaenzer@redhat.com>
> 
> schedule_delayed_work does not push back the work if it was already
> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
> was disabled and re-enabled again during those 100 ms.
> 
> This resulted in frame drops / stutter with the upcoming mutter 41
> release on Navi 14, due to constantly enabling GFXOFF in the HW and
> disabling it again (for getting the GPU clock counter).
> 
> To fix this, call cancel_delayed_work_sync when the disable count
> transitions from 0 to 1, and only schedule the delayed work on the
> reverse transition, not if the disable count was already 0. This makes
> sure the delayed work doesn't run at unexpected times, and allows it to
> be lock-free.
> 
> v2:
> * Use cancel_delayed_work_sync & mutex_trylock instead of
>    mod_delayed_work.
> v3:
> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
> v4:
> * Fix race condition between amdgpu_gfx_off_ctrl incrementing
>    adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
>    checking for it to be 0 (Evan Quan)
> 
> Cc: stable@vger.kernel.org
> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
> Acked-by: Christian König <christian.koenig@amd.com> # v3
> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
> ---
> 
> Alex, probably best to wait a bit longer before picking this up. :)
> 
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 +++++++++++++++-------
>   2 files changed, 30 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f3fd5ec710b6..f944ed858f3e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>   	struct amdgpu_device *adev =
>   		container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>   
> -	mutex_lock(&adev->gfx.gfx_off_mutex);
> -	if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
> -		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
> -			adev->gfx.gfx_off_state = true;
> -	}
> -	mutex_unlock(&adev->gfx.gfx_off_mutex);
> +	WARN_ON_ONCE(adev->gfx.gfx_off_state);
> +	WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
> +
> +	if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
> +		adev->gfx.gfx_off_state = true;
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index a0be0772c8b3..b4ced45301be 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>   
>   	mutex_lock(&adev->gfx.gfx_off_mutex);
>   
> -	if (!enable)
> -		adev->gfx.gfx_off_req_count++;
> -	else if (adev->gfx.gfx_off_req_count > 0)
> +	if (enable) {
> +		/* If the count is already 0, it means there's an imbalance bug somewhere.
> +		 * Note that the bug may be in a different caller than the one which triggers the
> +		 * WARN_ON_ONCE.
> +		 */
> +		if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
> +			goto unlock;
> +
>   		adev->gfx.gfx_off_req_count--;
>   
> -	if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
> -		schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
> -	} else if (!enable && adev->gfx.gfx_off_state) {
> -		if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
> -			adev->gfx.gfx_off_state = false;
> +		if (adev->gfx.gfx_off_req_count == 0 && !adev->gfx.gfx_off_state)
> +			schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
> +	} else {
> +		if (adev->gfx.gfx_off_req_count == 0) {
> +			cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
> +
> +			if (adev->gfx.gfx_off_state &&

More of a question which I didn't check last time - Is this expected to 
be true when the disable call comes in first?

Thanks,
Lijo

> +			    !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
> +				adev->gfx.gfx_off_state = false;
>   
> -			if (adev->gfx.funcs->init_spm_golden) {
> -				dev_dbg(adev->dev, "GFXOFF is disabled, re-init SPM golden settings\n");
> -				amdgpu_gfx_init_spm_golden(adev);
> +				if (adev->gfx.funcs->init_spm_golden) {
> +					dev_dbg(adev->dev,
> +						"GFXOFF is disabled, re-init SPM golden settings\n");
> +					amdgpu_gfx_init_spm_golden(adev);
> +				}
>   			}
>   		}
> +
> +		adev->gfx.gfx_off_req_count++;
>   	}
>   
> +unlock:
>   	mutex_unlock(&adev->gfx.gfx_off_mutex);
>   }
>   
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-17  9:12       ` Lazar, Lijo
@ 2021-08-17  9:26         ` Michel Dänzer
  2021-08-17  9:37           ` Lazar, Lijo
  0 siblings, 1 reply; 49+ messages in thread
From: Michel Dänzer @ 2021-08-17  9:26 UTC (permalink / raw)
  To: Lazar, Lijo, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

On 2021-08-17 11:12 a.m., Lazar, Lijo wrote:
> 
> 
> On 8/17/2021 1:53 PM, Michel Dänzer wrote:
>> From: Michel Dänzer <mdaenzer@redhat.com>
>>
>> schedule_delayed_work does not push back the work if it was already
>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>> was disabled and re-enabled again during those 100 ms.
>>
>> This resulted in frame drops / stutter with the upcoming mutter 41
>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>> disabling it again (for getting the GPU clock counter).
>>
>> To fix this, call cancel_delayed_work_sync when the disable count
>> transitions from 0 to 1, and only schedule the delayed work on the
>> reverse transition, not if the disable count was already 0. This makes
>> sure the delayed work doesn't run at unexpected times, and allows it to
>> be lock-free.
>>
>> v2:
>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>    mod_delayed_work.
>> v3:
>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>> v4:
>> * Fix race condition between amdgpu_gfx_off_ctrl incrementing
>>    adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
>>    checking for it to be 0 (Evan Quan)
>>
>> Cc: stable@vger.kernel.org
>> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
>> Acked-by: Christian König <christian.koenig@amd.com> # v3
>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>> ---
>>
>> Alex, probably best to wait a bit longer before picking this up. :)
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 +++++++++++++++-------
>>   2 files changed, 30 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index f3fd5ec710b6..f944ed858f3e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>       struct amdgpu_device *adev =
>>           container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>   -    mutex_lock(&adev->gfx.gfx_off_mutex);
>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>> -            adev->gfx.gfx_off_state = true;
>> -    }
>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>> +    WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>> +
>> +    if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>> +        adev->gfx.gfx_off_state = true;
>>   }
>>     /**
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> index a0be0772c8b3..b4ced45301be 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>         mutex_lock(&adev->gfx.gfx_off_mutex);
>>   -    if (!enable)
>> -        adev->gfx.gfx_off_req_count++;
>> -    else if (adev->gfx.gfx_off_req_count > 0)
>> +    if (enable) {
>> +        /* If the count is already 0, it means there's an imbalance bug somewhere.
>> +         * Note that the bug may be in a different caller than the one which triggers the
>> +         * WARN_ON_ONCE.
>> +         */
>> +        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>> +            goto unlock;
>> +
>>           adev->gfx.gfx_off_req_count--;
>>   -    if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>> -            adev->gfx.gfx_off_state = false;
>> +        if (adev->gfx.gfx_off_req_count == 0 && !adev->gfx.gfx_off_state)
>> +            schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>> +    } else {
>> +        if (adev->gfx.gfx_off_req_count == 0) {
>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>> +
>> +            if (adev->gfx.gfx_off_state &&
> 
> More of a question which I didn't check last time - Is this expected to be true when the disable call comes in first?

My assumption is that cancel_delayed_work_sync guarantees amdgpu_device_delay_enable_gfx_off's assignment is visible here.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-17  8:23     ` [PATCH] " Michel Dänzer
  2021-08-17  9:12       ` Lazar, Lijo
@ 2021-08-17  9:33       ` Quan, Evan
  2021-08-18 21:56       ` Alex Deucher
  2 siblings, 0 replies; 49+ messages in thread
From: Quan, Evan @ 2021-08-17  9:33 UTC (permalink / raw)
  To: Michel Dänzer, Deucher, Alexander, Koenig, Christian
  Cc: Liu, Leo, Zhu, James, amd-gfx, dri-devel

[AMD Official Use Only]

Thanks! This seems fine to me.
Reviewed-by: Evan Quan <evan.quan@amd.com>

> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Michel Dänzer
> Sent: Tuesday, August 17, 2021 4:23 PM
> To: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>
> Cc: Liu, Leo <Leo.Liu@amd.com>; Zhu, James <James.Zhu@amd.com>; amd-
> gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org
> Subject: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is
> disabled
> 
> From: Michel Dänzer <mdaenzer@redhat.com>
> 
> schedule_delayed_work does not push back the work if it was already
> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF was
> disabled and re-enabled again during those 100 ms.
> 
> This resulted in frame drops / stutter with the upcoming mutter 41 release
> on Navi 14, due to constantly enabling GFXOFF in the HW and disabling it
> again (for getting the GPU clock counter).
> 
> To fix this, call cancel_delayed_work_sync when the disable count transitions
> from 0 to 1, and only schedule the delayed work on the reverse transition,
> not if the disable count was already 0. This makes sure the delayed work
> doesn't run at unexpected times, and allows it to be lock-free.
> 
> v2:
> * Use cancel_delayed_work_sync & mutex_trylock instead of
>   mod_delayed_work.
> v3:
> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
> v4:
> * Fix race condition between amdgpu_gfx_off_ctrl incrementing
>   adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
>   checking for it to be 0 (Evan Quan)
> 
> Cc: stable@vger.kernel.org
> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
> Acked-by: Christian König <christian.koenig@amd.com> # v3
> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
> ---
> 
> Alex, probably best to wait a bit longer before picking this up. :)
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 +++++++++++++++----
> ---
>  2 files changed, 30 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f3fd5ec710b6..f944ed858f3e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2777,12 +2777,11 @@ static void
> amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>  	struct amdgpu_device *adev =
>  		container_of(work, struct amdgpu_device,
> gfx.gfx_off_delay_work.work);
> 
> -	mutex_lock(&adev->gfx.gfx_off_mutex);
> -	if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
> -		if (!amdgpu_dpm_set_powergating_by_smu(adev,
> AMD_IP_BLOCK_TYPE_GFX, true))
> -			adev->gfx.gfx_off_state = true;
> -	}
> -	mutex_unlock(&adev->gfx.gfx_off_mutex);
> +	WARN_ON_ONCE(adev->gfx.gfx_off_state);
> +	WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
> +
> +	if (!amdgpu_dpm_set_powergating_by_smu(adev,
> AMD_IP_BLOCK_TYPE_GFX, true))
> +		adev->gfx.gfx_off_state = true;
>  }
> 
>  /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index a0be0772c8b3..b4ced45301be 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device
> *adev, bool enable)
> 
>  	mutex_lock(&adev->gfx.gfx_off_mutex);
> 
> -	if (!enable)
> -		adev->gfx.gfx_off_req_count++;
> -	else if (adev->gfx.gfx_off_req_count > 0)
> +	if (enable) {
> +		/* If the count is already 0, it means there's an imbalance bug
> somewhere.
> +		 * Note that the bug may be in a different caller than the one
> which triggers the
> +		 * WARN_ON_ONCE.
> +		 */
> +		if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
> +			goto unlock;
> +
>  		adev->gfx.gfx_off_req_count--;
> 
> -	if (enable && !adev->gfx.gfx_off_state && !adev-
> >gfx.gfx_off_req_count) {
> -		schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
> GFX_OFF_DELAY_ENABLE);
> -	} else if (!enable && adev->gfx.gfx_off_state) {
> -		if (!amdgpu_dpm_set_powergating_by_smu(adev,
> AMD_IP_BLOCK_TYPE_GFX, false)) {
> -			adev->gfx.gfx_off_state = false;
> +		if (adev->gfx.gfx_off_req_count == 0 && !adev-
> >gfx.gfx_off_state)
> +			schedule_delayed_work(&adev-
> >gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
> +	} else {
> +		if (adev->gfx.gfx_off_req_count == 0) {
> +			cancel_delayed_work_sync(&adev-
> >gfx.gfx_off_delay_work);
> +
> +			if (adev->gfx.gfx_off_state &&
> +			    !amdgpu_dpm_set_powergating_by_smu(adev,
> AMD_IP_BLOCK_TYPE_GFX, false)) {
> +				adev->gfx.gfx_off_state = false;
> 
> -			if (adev->gfx.funcs->init_spm_golden) {
> -				dev_dbg(adev->dev, "GFXOFF is disabled, re-
> init SPM golden settings\n");
> -				amdgpu_gfx_init_spm_golden(adev);
> +				if (adev->gfx.funcs->init_spm_golden) {
> +					dev_dbg(adev->dev,
> +						"GFXOFF is disabled, re-init
> SPM golden settings\n");
> +					amdgpu_gfx_init_spm_golden(adev);
> +				}
>  			}
>  		}
> +
> +		adev->gfx.gfx_off_req_count++;
>  	}
> 
> +unlock:
>  	mutex_unlock(&adev->gfx.gfx_off_mutex);
>  }
> 
> --
> 2.32.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-17  9:26         ` Michel Dänzer
@ 2021-08-17  9:37           ` Lazar, Lijo
  2021-08-17  9:59             ` Michel Dänzer
  0 siblings, 1 reply; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-17  9:37 UTC (permalink / raw)
  To: Michel Dänzer, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel



On 8/17/2021 2:56 PM, Michel Dänzer wrote:
> On 2021-08-17 11:12 a.m., Lazar, Lijo wrote:
>>
>>
>> On 8/17/2021 1:53 PM, Michel Dänzer wrote:
>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>
>>> schedule_delayed_work does not push back the work if it was already
>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>> was disabled and re-enabled again during those 100 ms.
>>>
>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>> disabling it again (for getting the GPU clock counter).
>>>
>>> To fix this, call cancel_delayed_work_sync when the disable count
>>> transitions from 0 to 1, and only schedule the delayed work on the
>>> reverse transition, not if the disable count was already 0. This makes
>>> sure the delayed work doesn't run at unexpected times, and allows it to
>>> be lock-free.
>>>
>>> v2:
>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>     mod_delayed_work.
>>> v3:
>>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>>> v4:
>>> * Fix race condition between amdgpu_gfx_off_ctrl incrementing
>>>     adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
>>>     checking for it to be 0 (Evan Quan)
>>>
>>> Cc: stable@vger.kernel.org
>>> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
>>> Acked-by: Christian König <christian.koenig@amd.com> # v3
>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>> ---
>>>
>>> Alex, probably best to wait a bit longer before picking this up. :)
>>>
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 +++++++++++++++-------
>>>    2 files changed, 30 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index f3fd5ec710b6..f944ed858f3e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>        struct amdgpu_device *adev =
>>>            container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>    -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>> -            adev->gfx.gfx_off_state = true;
>>> -    }
>>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>>> +
>>> +    if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>> +        adev->gfx.gfx_off_state = true;
>>>    }
>>>      /**
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> index a0be0772c8b3..b4ced45301be 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>>          mutex_lock(&adev->gfx.gfx_off_mutex);
>>>    -    if (!enable)
>>> -        adev->gfx.gfx_off_req_count++;
>>> -    else if (adev->gfx.gfx_off_req_count > 0)
>>> +    if (enable) {
>>> +        /* If the count is already 0, it means there's an imbalance bug somewhere.
>>> +         * Note that the bug may be in a different caller than the one which triggers the
>>> +         * WARN_ON_ONCE.
>>> +         */
>>> +        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>> +            goto unlock;
>>> +
>>>            adev->gfx.gfx_off_req_count--;
>>>    -    if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>>> -            adev->gfx.gfx_off_state = false;
>>> +        if (adev->gfx.gfx_off_req_count == 0 && !adev->gfx.gfx_off_state)
>>> +            schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>> +    } else {
>>> +        if (adev->gfx.gfx_off_req_count == 0) {
>>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>> +
>>> +            if (adev->gfx.gfx_off_state &&
>>
>> More of a question which I didn't check last time - Is this expected to be true when the disable call comes in first?
> 
> My assumption is that cancel_delayed_work_sync guarantees amdgpu_device_delay_enable_gfx_off's assignment is visible here.
> 

To clarify - when nothing is scheduled. If enable() is called when the 
count is 0, it goes to unlock. Now the expectation is someone to call 
Disable first.  Let's say  Disable() is called first, then the variable 
will be false, right?

Thanks,
Lijo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-17  9:37           ` Lazar, Lijo
@ 2021-08-17  9:59             ` Michel Dänzer
  2021-08-17 10:37               ` Lazar, Lijo
  0 siblings, 1 reply; 49+ messages in thread
From: Michel Dänzer @ 2021-08-17  9:59 UTC (permalink / raw)
  To: Lazar, Lijo, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

On 2021-08-17 11:37 a.m., Lazar, Lijo wrote:
> 
> 
> On 8/17/2021 2:56 PM, Michel Dänzer wrote:
>> On 2021-08-17 11:12 a.m., Lazar, Lijo wrote:
>>>
>>>
>>> On 8/17/2021 1:53 PM, Michel Dänzer wrote:
>>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>>
>>>> schedule_delayed_work does not push back the work if it was already
>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>> was disabled and re-enabled again during those 100 ms.
>>>>
>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>> disabling it again (for getting the GPU clock counter).
>>>>
>>>> To fix this, call cancel_delayed_work_sync when the disable count
>>>> transitions from 0 to 1, and only schedule the delayed work on the
>>>> reverse transition, not if the disable count was already 0. This makes
>>>> sure the delayed work doesn't run at unexpected times, and allows it to
>>>> be lock-free.
>>>>
>>>> v2:
>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>     mod_delayed_work.
>>>> v3:
>>>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>>>> v4:
>>>> * Fix race condition between amdgpu_gfx_off_ctrl incrementing
>>>>     adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
>>>>     checking for it to be 0 (Evan Quan)
>>>>
>>>> Cc: stable@vger.kernel.org
>>>> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
>>>> Acked-by: Christian König <christian.koenig@amd.com> # v3
>>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>>> ---
>>>>
>>>> Alex, probably best to wait a bit longer before picking this up. :)
>>>>
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 +++++++++++++++-------
>>>>    2 files changed, 30 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index f3fd5ec710b6..f944ed858f3e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>        struct amdgpu_device *adev =
>>>>            container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>>    -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>>> -            adev->gfx.gfx_off_state = true;
>>>> -    }
>>>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>>>> +
>>>> +    if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>>> +        adev->gfx.gfx_off_state = true;
>>>>    }
>>>>      /**
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>> index a0be0772c8b3..b4ced45301be 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>>>          mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>    -    if (!enable)
>>>> -        adev->gfx.gfx_off_req_count++;
>>>> -    else if (adev->gfx.gfx_off_req_count > 0)
>>>> +    if (enable) {
>>>> +        /* If the count is already 0, it means there's an imbalance bug somewhere.
>>>> +         * Note that the bug may be in a different caller than the one which triggers the
>>>> +         * WARN_ON_ONCE.
>>>> +         */
>>>> +        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>>> +            goto unlock;
>>>> +
>>>>            adev->gfx.gfx_off_req_count--;
>>>>    -    if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>>>> -            adev->gfx.gfx_off_state = false;
>>>> +        if (adev->gfx.gfx_off_req_count == 0 && !adev->gfx.gfx_off_state)
>>>> +            schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>> +    } else {
>>>> +        if (adev->gfx.gfx_off_req_count == 0) {
>>>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>>> +
>>>> +            if (adev->gfx.gfx_off_state &&
>>>
>>> More of a question which I didn't check last time - Is this expected to be true when the disable call comes in first?
>>
>> My assumption is that cancel_delayed_work_sync guarantees amdgpu_device_delay_enable_gfx_off's assignment is visible here.
>>
> 
> To clarify - when nothing is scheduled. If enable() is called when the count is 0, it goes to unlock. Now the expectation is someone to call Disable first.

Yes, the very first amdgpu_gfx_off_ctrl call must pass enable=false, or it's a bug, which

        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))

will catch.


> Let's say  Disable() is called first, then the variable will be false, right?

Ohh, I see what you mean. The first time amdgpu_gfx_off_ctrl is called with enable=false, adev->gfx.gfx_off_state == false (what it was initialized to), so it doesn't actually disable GFXOFF in HW.

Note that this is a separate pre-existing bug, not a regression of my patch.

I wonder what's the best solution for that, move the adev->gfx.gfx_off_state assignments into amdgpu_dpm_set_powergating_by_smu?


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-17  9:59             ` Michel Dänzer
@ 2021-08-17 10:37               ` Lazar, Lijo
  2021-08-17 11:06                 ` Michel Dänzer
  0 siblings, 1 reply; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-17 10:37 UTC (permalink / raw)
  To: Michel Dänzer, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel



On 8/17/2021 3:29 PM, Michel Dänzer wrote:
> On 2021-08-17 11:37 a.m., Lazar, Lijo wrote:
>>
>>
>> On 8/17/2021 2:56 PM, Michel Dänzer wrote:
>>> On 2021-08-17 11:12 a.m., Lazar, Lijo wrote:
>>>>
>>>>
>>>> On 8/17/2021 1:53 PM, Michel Dänzer wrote:
>>>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>>>
>>>>> schedule_delayed_work does not push back the work if it was already
>>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>>> was disabled and re-enabled again during those 100 ms.
>>>>>
>>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>>> disabling it again (for getting the GPU clock counter).
>>>>>
>>>>> To fix this, call cancel_delayed_work_sync when the disable count
>>>>> transitions from 0 to 1, and only schedule the delayed work on the
>>>>> reverse transition, not if the disable count was already 0. This makes
>>>>> sure the delayed work doesn't run at unexpected times, and allows it to
>>>>> be lock-free.
>>>>>
>>>>> v2:
>>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>>      mod_delayed_work.
>>>>> v3:
>>>>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>>>>> v4:
>>>>> * Fix race condition between amdgpu_gfx_off_ctrl incrementing
>>>>>      adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
>>>>>      checking for it to be 0 (Evan Quan)
>>>>>
>>>>> Cc: stable@vger.kernel.org
>>>>> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
>>>>> Acked-by: Christian König <christian.koenig@amd.com> # v3
>>>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>>>> ---
>>>>>
>>>>> Alex, probably best to wait a bit longer before picking this up. :)
>>>>>
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 +++++++++++++++-------
>>>>>     2 files changed, 30 insertions(+), 17 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index f3fd5ec710b6..f944ed858f3e 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>>         struct amdgpu_device *adev =
>>>>>             container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>>>     -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>>>> -            adev->gfx.gfx_off_state = true;
>>>>> -    }
>>>>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>>>>> +
>>>>> +    if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>>>> +        adev->gfx.gfx_off_state = true;
>>>>>     }
>>>>>       /**
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>> index a0be0772c8b3..b4ced45301be 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>>>>           mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>     -    if (!enable)
>>>>> -        adev->gfx.gfx_off_req_count++;
>>>>> -    else if (adev->gfx.gfx_off_req_count > 0)
>>>>> +    if (enable) {
>>>>> +        /* If the count is already 0, it means there's an imbalance bug somewhere.
>>>>> +         * Note that the bug may be in a different caller than the one which triggers the
>>>>> +         * WARN_ON_ONCE.
>>>>> +         */
>>>>> +        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>>>> +            goto unlock;
>>>>> +
>>>>>             adev->gfx.gfx_off_req_count--;
>>>>>     -    if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>>>>> -            adev->gfx.gfx_off_state = false;
>>>>> +        if (adev->gfx.gfx_off_req_count == 0 && !adev->gfx.gfx_off_state)
>>>>> +            schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>>> +    } else {
>>>>> +        if (adev->gfx.gfx_off_req_count == 0) {
>>>>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>>>> +
>>>>> +            if (adev->gfx.gfx_off_state &&
>>>>
>>>> More of a question which I didn't check last time - Is this expected to be true when the disable call comes in first?
>>>
>>> My assumption is that cancel_delayed_work_sync guarantees amdgpu_device_delay_enable_gfx_off's assignment is visible here.
>>>
>>
>> To clarify - when nothing is scheduled. If enable() is called when the count is 0, it goes to unlock. Now the expectation is someone to call Disable first.
> 
> Yes, the very first amdgpu_gfx_off_ctrl call must pass enable=false, or it's a bug, which
> 
>          if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
> 
> will catch.
> 
> 
>> Let's say  Disable() is called first, then the variable will be false, right?
> 
> Ohh, I see what you mean. The first time amdgpu_gfx_off_ctrl is called with enable=false, adev->gfx.gfx_off_state == false (what it was initialized to), so it doesn't actually disable GFXOFF in HW.

Exactly.
> 
> Note that this is a separate pre-existing bug, not a regression of my patch.
> 
> I wonder what's the best solution for that, move the adev->gfx.gfx_off_state assignments into amdgpu_dpm_set_powergating_by_smu?

Should be an existing one, never bothered about that condition before.

One hack would be

is_pending = cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);

	if ((adev->gfx.gfx_off_state || !is_pending) &&

If work was never scheduled or pending, is_pending should be false OR if 
it got executed, gfx_off_state should be set.

Thanks,
Lijo

> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-17 10:37               ` Lazar, Lijo
@ 2021-08-17 11:06                 ` Michel Dänzer
  2021-08-17 11:49                   ` Lazar, Lijo
  0 siblings, 1 reply; 49+ messages in thread
From: Michel Dänzer @ 2021-08-17 11:06 UTC (permalink / raw)
  To: Lazar, Lijo, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel

On 2021-08-17 12:37 p.m., Lazar, Lijo wrote:
> 
> 
> On 8/17/2021 3:29 PM, Michel Dänzer wrote:
>> On 2021-08-17 11:37 a.m., Lazar, Lijo wrote:
>>>
>>>
>>> On 8/17/2021 2:56 PM, Michel Dänzer wrote:
>>>> On 2021-08-17 11:12 a.m., Lazar, Lijo wrote:
>>>>>
>>>>>
>>>>> On 8/17/2021 1:53 PM, Michel Dänzer wrote:
>>>>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>>>>
>>>>>> schedule_delayed_work does not push back the work if it was already
>>>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>>>> was disabled and re-enabled again during those 100 ms.
>>>>>>
>>>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>>>> disabling it again (for getting the GPU clock counter).
>>>>>>
>>>>>> To fix this, call cancel_delayed_work_sync when the disable count
>>>>>> transitions from 0 to 1, and only schedule the delayed work on the
>>>>>> reverse transition, not if the disable count was already 0. This makes
>>>>>> sure the delayed work doesn't run at unexpected times, and allows it to
>>>>>> be lock-free.
>>>>>>
>>>>>> v2:
>>>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>>>      mod_delayed_work.
>>>>>> v3:
>>>>>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>>>>>> v4:
>>>>>> * Fix race condition between amdgpu_gfx_off_ctrl incrementing
>>>>>>      adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
>>>>>>      checking for it to be 0 (Evan Quan)
>>>>>>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
>>>>>> Acked-by: Christian König <christian.koenig@amd.com> # v3
>>>>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>>>>> ---
>>>>>>
>>>>>> Alex, probably best to wait a bit longer before picking this up. :)
>>>>>>
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 +++++++++++++++-------
>>>>>>     2 files changed, 30 insertions(+), 17 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index f3fd5ec710b6..f944ed858f3e 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>>>         struct amdgpu_device *adev =
>>>>>>             container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>>>>     -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>>>>> -            adev->gfx.gfx_off_state = true;
>>>>>> -    }
>>>>>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>>>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>>>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>>>>>> +
>>>>>> +    if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>>>>> +        adev->gfx.gfx_off_state = true;
>>>>>>     }
>>>>>>       /**
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>> index a0be0772c8b3..b4ced45301be 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>>>>>           mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>>     -    if (!enable)
>>>>>> -        adev->gfx.gfx_off_req_count++;
>>>>>> -    else if (adev->gfx.gfx_off_req_count > 0)
>>>>>> +    if (enable) {
>>>>>> +        /* If the count is already 0, it means there's an imbalance bug somewhere.
>>>>>> +         * Note that the bug may be in a different caller than the one which triggers the
>>>>>> +         * WARN_ON_ONCE.
>>>>>> +         */
>>>>>> +        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>>>>> +            goto unlock;
>>>>>> +
>>>>>>             adev->gfx.gfx_off_req_count--;
>>>>>>     -    if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>>>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>>>>>> -            adev->gfx.gfx_off_state = false;
>>>>>> +        if (adev->gfx.gfx_off_req_count == 0 && !adev->gfx.gfx_off_state)
>>>>>> +            schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>>>> +    } else {
>>>>>> +        if (adev->gfx.gfx_off_req_count == 0) {
>>>>>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>>>>> +
>>>>>> +            if (adev->gfx.gfx_off_state &&
>>>>>
>>>>> More of a question which I didn't check last time - Is this expected to be true when the disable call comes in first?
>>>>
>>>> My assumption is that cancel_delayed_work_sync guarantees amdgpu_device_delay_enable_gfx_off's assignment is visible here.
>>>>
>>>
>>> To clarify - when nothing is scheduled. If enable() is called when the count is 0, it goes to unlock. Now the expectation is someone to call Disable first.
>>
>> Yes, the very first amdgpu_gfx_off_ctrl call must pass enable=false, or it's a bug, which
>>
>>          if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>
>> will catch.
>>
>>
>>> Let's say  Disable() is called first, then the variable will be false, right?
>>
>> Ohh, I see what you mean. The first time amdgpu_gfx_off_ctrl is called with enable=false, adev->gfx.gfx_off_state == false (what it was initialized to), so it doesn't actually disable GFXOFF in HW.
> 
> Exactly.

Turns out that's not the end of that rabbit (side-)hole yet. :)

amdgpu_device_init initializes adev->gfx.gfx_off_req_count = 1. amdgpu_gfx_off_ctrl is then called with enable=true from amdgpu_device_init → amdgpu_device_ip_late_init → amdgpu_device_set_pg_state. This schedules amdgpu_device_delay_enable_gfx_off, which runs ~100ms later, enables GFXOFF in the HW and sets adev->gfx.gfx_off_state = true.

So it looks fine as is actually, if a bit convoluted. (I wonder if GFXOFF shouldn't rather be enabled synchronously during initialization though)


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-17 11:06                 ` Michel Dänzer
@ 2021-08-17 11:49                   ` Lazar, Lijo
  2021-08-17 12:55                     ` Lazar, Lijo
  0 siblings, 1 reply; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-17 11:49 UTC (permalink / raw)
  To: Michel Dänzer, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel



On 8/17/2021 4:36 PM, Michel Dänzer wrote:
> On 2021-08-17 12:37 p.m., Lazar, Lijo wrote:
>>
>>
>> On 8/17/2021 3:29 PM, Michel Dänzer wrote:
>>> On 2021-08-17 11:37 a.m., Lazar, Lijo wrote:
>>>>
>>>>
>>>> On 8/17/2021 2:56 PM, Michel Dänzer wrote:
>>>>> On 2021-08-17 11:12 a.m., Lazar, Lijo wrote:
>>>>>>
>>>>>>
>>>>>> On 8/17/2021 1:53 PM, Michel Dänzer wrote:
>>>>>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>>>>>
>>>>>>> schedule_delayed_work does not push back the work if it was already
>>>>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>>>>> was disabled and re-enabled again during those 100 ms.
>>>>>>>
>>>>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>>>>> disabling it again (for getting the GPU clock counter).
>>>>>>>
>>>>>>> To fix this, call cancel_delayed_work_sync when the disable count
>>>>>>> transitions from 0 to 1, and only schedule the delayed work on the
>>>>>>> reverse transition, not if the disable count was already 0. This makes
>>>>>>> sure the delayed work doesn't run at unexpected times, and allows it to
>>>>>>> be lock-free.
>>>>>>>
>>>>>>> v2:
>>>>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>>>>       mod_delayed_work.
>>>>>>> v3:
>>>>>>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>>>>>>> v4:
>>>>>>> * Fix race condition between amdgpu_gfx_off_ctrl incrementing
>>>>>>>       adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
>>>>>>>       checking for it to be 0 (Evan Quan)
>>>>>>>
>>>>>>> Cc: stable@vger.kernel.org
>>>>>>> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
>>>>>>> Acked-by: Christian König <christian.koenig@amd.com> # v3
>>>>>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>>>>>> ---
>>>>>>>
>>>>>>> Alex, probably best to wait a bit longer before picking this up. :)
>>>>>>>
>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 +++++++++++++++-------
>>>>>>>      2 files changed, 30 insertions(+), 17 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index f3fd5ec710b6..f944ed858f3e 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>>>>          struct amdgpu_device *adev =
>>>>>>>              container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>>>>>      -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>>>>>> -            adev->gfx.gfx_off_state = true;
>>>>>>> -    }
>>>>>>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>>>>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>>>>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>>>>>>> +
>>>>>>> +    if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
>>>>>>> +        adev->gfx.gfx_off_state = true;
>>>>>>>      }
>>>>>>>        /**
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>> index a0be0772c8b3..b4ced45301be 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>>>>>>            mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>>>      -    if (!enable)
>>>>>>> -        adev->gfx.gfx_off_req_count++;
>>>>>>> -    else if (adev->gfx.gfx_off_req_count > 0)
>>>>>>> +    if (enable) {
>>>>>>> +        /* If the count is already 0, it means there's an imbalance bug somewhere.
>>>>>>> +         * Note that the bug may be in a different caller than the one which triggers the
>>>>>>> +         * WARN_ON_ONCE.
>>>>>>> +         */
>>>>>>> +        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>>>>>> +            goto unlock;
>>>>>>> +
>>>>>>>              adev->gfx.gfx_off_req_count--;
>>>>>>>      -    if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>>>>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>>>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>>>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>>>>>>> -            adev->gfx.gfx_off_state = false;
>>>>>>> +        if (adev->gfx.gfx_off_req_count == 0 && !adev->gfx.gfx_off_state)
>>>>>>> +            schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>>>>> +    } else {
>>>>>>> +        if (adev->gfx.gfx_off_req_count == 0) {
>>>>>>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>>>>>> +
>>>>>>> +            if (adev->gfx.gfx_off_state &&
>>>>>>
>>>>>> More of a question which I didn't check last time - Is this expected to be true when the disable call comes in first?
>>>>>
>>>>> My assumption is that cancel_delayed_work_sync guarantees amdgpu_device_delay_enable_gfx_off's assignment is visible here.
>>>>>
>>>>
>>>> To clarify - when nothing is scheduled. If enable() is called when the count is 0, it goes to unlock. Now the expectation is someone to call Disable first.
>>>
>>> Yes, the very first amdgpu_gfx_off_ctrl call must pass enable=false, or it's a bug, which
>>>
>>>           if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>>
>>> will catch.
>>>
>>>
>>>> Let's say  Disable() is called first, then the variable will be false, right?
>>>
>>> Ohh, I see what you mean. The first time amdgpu_gfx_off_ctrl is called with enable=false, adev->gfx.gfx_off_state == false (what it was initialized to), so it doesn't actually disable GFXOFF in HW.
>>
>> Exactly.
> 
> Turns out that's not the end of that rabbit (side-)hole yet. :)
> 
> amdgpu_device_init initializes adev->gfx.gfx_off_req_count = 1. amdgpu_gfx_off_ctrl is then called with enable=true from amdgpu_device_init → amdgpu_device_ip_late_init → amdgpu_device_set_pg_state. This schedules amdgpu_device_delay_enable_gfx_off, which runs ~100ms later, enables GFXOFF in the HW and sets adev->gfx.gfx_off_state = true.
> 

What if a disable comes at < 100ms? Quite unlikely, neverthless in that 
case pending work will get cancelled and the variable won't be set until 
the work gets a chance to fully run. The assumption that GFXOFF disable 
succeeded in a subsequent amdgpu_gfx_off_ctrl  enable = false won't be 
correct as PMFW will by default enable GFXOFF when there is no activity.

Otherwise, keep an assumption that amdgpu_device_delay_enable_gfx_off 
gets a chance to run before any disable call comes - maybe that's the 
case in most cases.

> So it looks fine as is actually, if a bit convoluted. 

> (I wonder if GFXOFF shouldn't rather be enabled synchronously during initialization though)

Yes, that is logical. But amdgpu_device_ip_late_init is called also 
during amdgpu_device_resume. amdgpu_device_resume is used in pm_ops or 
runtime pm. In those cases it makes sense to delay it as there could be 
an immediate usage of GFX.

Thanks,
Lijo

> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-17 11:49                   ` Lazar, Lijo
@ 2021-08-17 12:55                     ` Lazar, Lijo
  0 siblings, 0 replies; 49+ messages in thread
From: Lazar, Lijo @ 2021-08-17 12:55 UTC (permalink / raw)
  To: Michel Dänzer, Alex Deucher, Christian König
  Cc: Leo Liu, James Zhu, amd-gfx, dri-devel



On 8/17/2021 5:19 PM, Lazar, Lijo wrote:
> 
> 
> On 8/17/2021 4:36 PM, Michel Dänzer wrote:
>> On 2021-08-17 12:37 p.m., Lazar, Lijo wrote:
>>>
>>>
>>> On 8/17/2021 3:29 PM, Michel Dänzer wrote:
>>>> On 2021-08-17 11:37 a.m., Lazar, Lijo wrote:
>>>>>
>>>>>
>>>>> On 8/17/2021 2:56 PM, Michel Dänzer wrote:
>>>>>> On 2021-08-17 11:12 a.m., Lazar, Lijo wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 8/17/2021 1:53 PM, Michel Dänzer wrote:
>>>>>>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>>>>>>
>>>>>>>> schedule_delayed_work does not push back the work if it was already
>>>>>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>>>>>> after the first time GFXOFF was disabled and re-enabled, even if 
>>>>>>>> GFXOFF
>>>>>>>> was disabled and re-enabled again during those 100 ms.
>>>>>>>>
>>>>>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>>>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>>>>>> disabling it again (for getting the GPU clock counter).
>>>>>>>>
>>>>>>>> To fix this, call cancel_delayed_work_sync when the disable count
>>>>>>>> transitions from 0 to 1, and only schedule the delayed work on the
>>>>>>>> reverse transition, not if the disable count was already 0. This 
>>>>>>>> makes
>>>>>>>> sure the delayed work doesn't run at unexpected times, and 
>>>>>>>> allows it to
>>>>>>>> be lock-free.
>>>>>>>>
>>>>>>>> v2:
>>>>>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>>>>>       mod_delayed_work.
>>>>>>>> v3:
>>>>>>>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian 
>>>>>>>> König)
>>>>>>>> v4:
>>>>>>>> * Fix race condition between amdgpu_gfx_off_ctrl incrementing
>>>>>>>>       adev->gfx.gfx_off_req_count and 
>>>>>>>> amdgpu_device_delay_enable_gfx_off
>>>>>>>>       checking for it to be 0 (Evan Quan)
>>>>>>>>
>>>>>>>> Cc: stable@vger.kernel.org
>>>>>>>> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com> # v3
>>>>>>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Alex, probably best to wait a bit longer before picking this up. :)
>>>>>>>>
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 
>>>>>>>> +++++++++++++++-------
>>>>>>>>      2 files changed, 30 insertions(+), 17 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index f3fd5ec710b6..f944ed858f3e 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -2777,12 +2777,11 @@ static void 
>>>>>>>> amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>>>>>          struct amdgpu_device *adev =
>>>>>>>>              container_of(work, struct amdgpu_device, 
>>>>>>>> gfx.gfx_off_delay_work.work);
>>>>>>>>      -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>>>> -    if (!adev->gfx.gfx_off_state && 
>>>>>>>> !adev->gfx.gfx_off_req_count) {
>>>>>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, 
>>>>>>>> AMD_IP_BLOCK_TYPE_GFX, true))
>>>>>>>> -            adev->gfx.gfx_off_state = true;
>>>>>>>> -    }
>>>>>>>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>>>>>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>>>>>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>>>>>>>> +
>>>>>>>> +    if (!amdgpu_dpm_set_powergating_by_smu(adev, 
>>>>>>>> AMD_IP_BLOCK_TYPE_GFX, true))
>>>>>>>> +        adev->gfx.gfx_off_state = true;
>>>>>>>>      }
>>>>>>>>        /**
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>>> index a0be0772c8b3..b4ced45301be 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>>> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct 
>>>>>>>> amdgpu_device *adev, bool enable)
>>>>>>>>            mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>>>>>      -    if (!enable)
>>>>>>>> -        adev->gfx.gfx_off_req_count++;
>>>>>>>> -    else if (adev->gfx.gfx_off_req_count > 0)
>>>>>>>> +    if (enable) {
>>>>>>>> +        /* If the count is already 0, it means there's an 
>>>>>>>> imbalance bug somewhere.
>>>>>>>> +         * Note that the bug may be in a different caller than 
>>>>>>>> the one which triggers the
>>>>>>>> +         * WARN_ON_ONCE.
>>>>>>>> +         */
>>>>>>>> +        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>>>>>>> +            goto unlock;
>>>>>>>> +
>>>>>>>>              adev->gfx.gfx_off_req_count--;
>>>>>>>>      -    if (enable && !adev->gfx.gfx_off_state && 
>>>>>>>> !adev->gfx.gfx_off_req_count) {
>>>>>>>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, 
>>>>>>>> GFX_OFF_DELAY_ENABLE);
>>>>>>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>>>>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, 
>>>>>>>> AMD_IP_BLOCK_TYPE_GFX, false)) {
>>>>>>>> -            adev->gfx.gfx_off_state = false;
>>>>>>>> +        if (adev->gfx.gfx_off_req_count == 0 && 
>>>>>>>> !adev->gfx.gfx_off_state)
>>>>>>>> +            
>>>>>>>> schedule_delayed_work(&adev->gfx.gfx_off_delay_work, 
>>>>>>>> GFX_OFF_DELAY_ENABLE);
>>>>>>>> +    } else {
>>>>>>>> +        if (adev->gfx.gfx_off_req_count == 0) {
>>>>>>>> +            
>>>>>>>> cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>>>>>>> +
>>>>>>>> +            if (adev->gfx.gfx_off_state &&
>>>>>>>
>>>>>>> More of a question which I didn't check last time - Is this 
>>>>>>> expected to be true when the disable call comes in first?
>>>>>>
>>>>>> My assumption is that cancel_delayed_work_sync guarantees 
>>>>>> amdgpu_device_delay_enable_gfx_off's assignment is visible here.
>>>>>>
>>>>>
>>>>> To clarify - when nothing is scheduled. If enable() is called when 
>>>>> the count is 0, it goes to unlock. Now the expectation is someone 
>>>>> to call Disable first.
>>>>
>>>> Yes, the very first amdgpu_gfx_off_ctrl call must pass enable=false, 
>>>> or it's a bug, which
>>>>
>>>>           if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>>>
>>>> will catch.
>>>>
>>>>
>>>>> Let's say  Disable() is called first, then the variable will be 
>>>>> false, right?
>>>>
>>>> Ohh, I see what you mean. The first time amdgpu_gfx_off_ctrl is 
>>>> called with enable=false, adev->gfx.gfx_off_state == false (what it 
>>>> was initialized to), so it doesn't actually disable GFXOFF in HW.
>>>
>>> Exactly.
>>
>> Turns out that's not the end of that rabbit (side-)hole yet. :)
>>
>> amdgpu_device_init initializes adev->gfx.gfx_off_req_count = 1. 
>> amdgpu_gfx_off_ctrl is then called with enable=true from 
>> amdgpu_device_init → amdgpu_device_ip_late_init → 
>> amdgpu_device_set_pg_state. This schedules 
>> amdgpu_device_delay_enable_gfx_off, which runs ~100ms later, enables 
>> GFXOFF in the HW and sets adev->gfx.gfx_off_state = true.
>>
> 
> What if a disable comes at < 100ms? Quite unlikely, neverthless in that 
> case pending work will get cancelled and the variable won't be set until 
> the work gets a chance to fully run. The assumption that GFXOFF disable 
> succeeded in a subsequent amdgpu_gfx_off_ctrl  enable = false won't be 
> correct as PMFW will by default enable GFXOFF when there is no activity.

"PMFW will by default enable GFXOFF when there is no activity."
Checked again and this is false at least for Sienna Cichlid/NV1x.Driver 
must explicitly allow GfxOff first. In that sense, driver doesn't need 
to disable GFXOFF unless it has succeeded in enabling it.

Overall, the existing logic is fine. Sorry for the confusion.

Thanks,
Lijo

> Otherwise, keep an assumption that amdgpu_device_delay_enable_gfx_off 
> gets a chance to run before any disable call comes - maybe that's the 
> case in most cases.
> 
>> So it looks fine as is actually, if a bit convoluted. 
> 
>> (I wonder if GFXOFF shouldn't rather be enabled synchronously during 
>> initialization though)
> 
> Yes, that is logical. But amdgpu_device_ip_late_init is called also 
> during amdgpu_device_resume. amdgpu_device_resume is used in pm_ops or 
> runtime pm. In those cases it makes sense to delay it as there could be 
> an immediate usage of GFX.
> 
> Thanks,
> Lijo
> 
>>
>>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
  2021-08-17  8:23     ` [PATCH] " Michel Dänzer
  2021-08-17  9:12       ` Lazar, Lijo
  2021-08-17  9:33       ` Quan, Evan
@ 2021-08-18 21:56       ` Alex Deucher
  2 siblings, 0 replies; 49+ messages in thread
From: Alex Deucher @ 2021-08-18 21:56 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Alex Deucher, Christian König, Leo Liu, James Zhu,
	amd-gfx list, Maling list - DRI developers

Applied.  Let's see how long this one lasts :)

Alex

On Tue, Aug 17, 2021 at 4:23 AM Michel Dänzer <michel@daenzer.net> wrote:
>
> From: Michel Dänzer <mdaenzer@redhat.com>
>
> schedule_delayed_work does not push back the work if it was already
> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
> was disabled and re-enabled again during those 100 ms.
>
> This resulted in frame drops / stutter with the upcoming mutter 41
> release on Navi 14, due to constantly enabling GFXOFF in the HW and
> disabling it again (for getting the GPU clock counter).
>
> To fix this, call cancel_delayed_work_sync when the disable count
> transitions from 0 to 1, and only schedule the delayed work on the
> reverse transition, not if the disable count was already 0. This makes
> sure the delayed work doesn't run at unexpected times, and allows it to
> be lock-free.
>
> v2:
> * Use cancel_delayed_work_sync & mutex_trylock instead of
>   mod_delayed_work.
> v3:
> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
> v4:
> * Fix race condition between amdgpu_gfx_off_ctrl incrementing
>   adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
>   checking for it to be 0 (Evan Quan)
>
> Cc: stable@vger.kernel.org
> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
> Acked-by: Christian König <christian.koenig@amd.com> # v3
> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
> ---
>
> Alex, probably best to wait a bit longer before picking this up. :)
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 +++++++++++++++-------
>  2 files changed, 30 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f3fd5ec710b6..f944ed858f3e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2777,12 +2777,11 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>         struct amdgpu_device *adev =
>                 container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>
> -       mutex_lock(&adev->gfx.gfx_off_mutex);
> -       if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
> -               if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
> -                       adev->gfx.gfx_off_state = true;
> -       }
> -       mutex_unlock(&adev->gfx.gfx_off_mutex);
> +       WARN_ON_ONCE(adev->gfx.gfx_off_state);
> +       WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
> +
> +       if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, true))
> +               adev->gfx.gfx_off_state = true;
>  }
>
>  /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index a0be0772c8b3..b4ced45301be 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>
>         mutex_lock(&adev->gfx.gfx_off_mutex);
>
> -       if (!enable)
> -               adev->gfx.gfx_off_req_count++;
> -       else if (adev->gfx.gfx_off_req_count > 0)
> +       if (enable) {
> +               /* If the count is already 0, it means there's an imbalance bug somewhere.
> +                * Note that the bug may be in a different caller than the one which triggers the
> +                * WARN_ON_ONCE.
> +                */
> +               if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
> +                       goto unlock;
> +
>                 adev->gfx.gfx_off_req_count--;
>
> -       if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
> -               schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
> -       } else if (!enable && adev->gfx.gfx_off_state) {
> -               if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
> -                       adev->gfx.gfx_off_state = false;
> +               if (adev->gfx.gfx_off_req_count == 0 && !adev->gfx.gfx_off_state)
> +                       schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
> +       } else {
> +               if (adev->gfx.gfx_off_req_count == 0) {
> +                       cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
> +
> +                       if (adev->gfx.gfx_off_state &&
> +                           !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
> +                               adev->gfx.gfx_off_state = false;
>
> -                       if (adev->gfx.funcs->init_spm_golden) {
> -                               dev_dbg(adev->dev, "GFXOFF is disabled, re-init SPM golden settings\n");
> -                               amdgpu_gfx_init_spm_golden(adev);
> +                               if (adev->gfx.funcs->init_spm_golden) {
> +                                       dev_dbg(adev->dev,
> +                                               "GFXOFF is disabled, re-init SPM golden settings\n");
> +                                       amdgpu_gfx_init_spm_golden(adev);
> +                               }
>                         }
>                 }
> +
> +               adev->gfx.gfx_off_req_count++;
>         }
>
> +unlock:
>         mutex_unlock(&adev->gfx.gfx_off_mutex);
>  }
>
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2021-08-18 21:58 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-11 16:52 [PATCH 1/2] drm/amdgpu: Use mod_delayed_work in amdgpu_gfx_off_ctrl Michel Dänzer
2021-08-11 16:52 ` [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks Michel Dänzer
2021-08-11 20:34   ` Alex Deucher
2021-08-11 21:00     ` Zhu, James
2021-08-11 21:34   ` AW: " Koenig, Christian
2021-08-11 22:12     ` Zhu, James
2021-08-11 22:22       ` Zhu, James
2021-08-12  2:42     ` Quan, Evan
2021-08-12  5:55       ` AW: " Koenig, Christian
2021-08-12  8:11         ` Michel Dänzer
2021-08-12 11:33           ` Lazar, Lijo
2021-08-12 16:54             ` Michel Dänzer
2021-08-13  4:23               ` Lazar, Lijo
2021-08-13 10:31                 ` Michel Dänzer
2021-08-13 11:18                   ` Lazar, Lijo
2021-08-16  7:33           ` Christian König
2021-08-12  2:43 ` [PATCH 1/2] drm/amdgpu: Use mod_delayed_work in amdgpu_gfx_off_ctrl Quan, Evan
2021-08-13 10:29 ` [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled Michel Dänzer
2021-08-13 11:50   ` Lazar, Lijo
2021-08-13 13:34     ` Michel Dänzer
2021-08-13 14:14       ` Lazar, Lijo
2021-08-13 14:40         ` Michel Dänzer
2021-08-13 15:07           ` Lazar, Lijo
2021-08-13 16:00             ` Michel Dänzer
2021-08-16  4:13               ` Lazar, Lijo
2021-08-16 10:45                 ` Michel Dänzer
2021-08-16  7:38   ` Christian König
2021-08-16 10:38     ` Michel Dänzer
2021-08-16 10:20   ` Quan, Evan
2021-08-16 10:43     ` Michel Dänzer
2021-08-16 10:35   ` [PATCH v3] " Michel Dänzer
2021-08-16 11:33     ` Lazar, Lijo
2021-08-16 12:06       ` Christian König
2021-08-16 15:06         ` Michel Dänzer
2021-08-16 19:02           ` Alex Deucher
2021-08-17  7:51     ` Quan, Evan
2021-08-17  8:17       ` Lazar, Lijo
2021-08-17  8:35         ` Michel Dänzer
2021-08-17  8:23     ` [PATCH] " Michel Dänzer
2021-08-17  9:12       ` Lazar, Lijo
2021-08-17  9:26         ` Michel Dänzer
2021-08-17  9:37           ` Lazar, Lijo
2021-08-17  9:59             ` Michel Dänzer
2021-08-17 10:37               ` Lazar, Lijo
2021-08-17 11:06                 ` Michel Dänzer
2021-08-17 11:49                   ` Lazar, Lijo
2021-08-17 12:55                     ` Lazar, Lijo
2021-08-17  9:33       ` Quan, Evan
2021-08-18 21:56       ` Alex Deucher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).