All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV
@ 2022-10-06 18:08 Bokun Zhang
  2022-10-06 18:11 ` Zhang, Bokun
  0 siblings, 1 reply; 5+ messages in thread
From: Bokun Zhang @ 2022-10-06 18:08 UTC (permalink / raw)
  To: amd-gfx; +Cc: Bokun Zhang

- Under SRIOV, SDMA engine is shared between VFs. Therefore,
  we will not stop SDMA during hw_fini. This is not an issue
  with normal dirver loading and unloading.

- However, when we put the SDMA engine to suspend state and resume
  it, the issue starts to show up. Something could attempt to use
  that SDMA engine to clear or move memory before the engine is
  initialized since the DRM entity is still there.

- Therefore, we will call sdma_v5_2_enable(false) during hw_fini,
  and if we are under SRIOV, we will call sdma_v5_2_enable(true)
  afterwards to allow other VFs to use SDMA. This way, the DRM
  entity of SDMA engine is emptied and it will follow the flow
  of resume code path.

Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index f136fec7b4f4..3eaf1a573e73 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -1357,12 +1357,19 @@ static int sdma_v5_2_hw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	if (amdgpu_sriov_vf(adev))
-		return 0;
-
+	/*
+	 * Under SRIOV, the VF cannot single-mindedly stop SDMA engine
+	 * However, we still need to clean up the DRM entity
+	 * Therefore, we will re-enable SDMA afterwards.
+	 */
 	sdma_v5_2_ctx_switch_enable(adev, false);
 	sdma_v5_2_enable(adev, false);
 
+	if (amdgpu_sriov_vf(adev)) {
+		sdma_v5_2_enable(adev, true);
+		sdma_v5_2_ctx_switch_enable(adev, true);
+	}
+
 	return 0;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV
  2022-10-06 18:08 [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV Bokun Zhang
@ 2022-10-06 18:11 ` Zhang, Bokun
  2022-10-06 19:56   ` Alex Deucher
  0 siblings, 1 reply; 5+ messages in thread
From: Zhang, Bokun @ 2022-10-06 18:11 UTC (permalink / raw)
  To: Liu, Monk, Deucher, Alexander, Deng, Emily, Koenig, Christian; +Cc: amd-gfx

[AMD Official Use Only - General]

Hey guys,
    Please help review this patch for the suspend and resume issue.
    I have tested it with multi-VF environment, I think it is ok.

Thanks!

-----Original Message-----
From: Bokun Zhang <Bokun.Zhang@amd.com> 
Sent: Thursday, October 6, 2022 2:09 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Bokun <Bokun.Zhang@amd.com>
Subject: [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV

- Under SRIOV, SDMA engine is shared between VFs. Therefore,
  we will not stop SDMA during hw_fini. This is not an issue
  with normal dirver loading and unloading.

- However, when we put the SDMA engine to suspend state and resume
  it, the issue starts to show up. Something could attempt to use
  that SDMA engine to clear or move memory before the engine is
  initialized since the DRM entity is still there.

- Therefore, we will call sdma_v5_2_enable(false) during hw_fini,
  and if we are under SRIOV, we will call sdma_v5_2_enable(true)
  afterwards to allow other VFs to use SDMA. This way, the DRM
  entity of SDMA engine is emptied and it will follow the flow
  of resume code path.

Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index f136fec7b4f4..3eaf1a573e73 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -1357,12 +1357,19 @@ static int sdma_v5_2_hw_fini(void *handle)  {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	if (amdgpu_sriov_vf(adev))
-		return 0;
-
+	/*
+	 * Under SRIOV, the VF cannot single-mindedly stop SDMA engine
+	 * However, we still need to clean up the DRM entity
+	 * Therefore, we will re-enable SDMA afterwards.
+	 */
 	sdma_v5_2_ctx_switch_enable(adev, false);
 	sdma_v5_2_enable(adev, false);
 
+	if (amdgpu_sriov_vf(adev)) {
+		sdma_v5_2_enable(adev, true);
+		sdma_v5_2_ctx_switch_enable(adev, true);
+	}
+
 	return 0;
 }
 
--
2.34.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV
  2022-10-06 18:11 ` Zhang, Bokun
@ 2022-10-06 19:56   ` Alex Deucher
  2022-10-07 21:38     ` Zhang, Bokun
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Deucher @ 2022-10-06 19:56 UTC (permalink / raw)
  To: Zhang, Bokun
  Cc: Deucher, Alexander, Deng, Emily, amd-gfx, Koenig, Christian, Liu, Monk

[-- Attachment #1: Type: text/plain, Size: 2517 bytes --]

On Thu, Oct 6, 2022 at 2:11 PM Zhang, Bokun <Bokun.Zhang@amd.com> wrote:
>
> [AMD Official Use Only - General]
>
> Hey guys,
>     Please help review this patch for the suspend and resume issue.
>     I have tested it with multi-VF environment, I think it is ok.

Seems a little hacky, but I think that's the least intrusive for
stable.  How about the attached patches?

Alex


>
> Thanks!
>
> -----Original Message-----
> From: Bokun Zhang <Bokun.Zhang@amd.com>
> Sent: Thursday, October 6, 2022 2:09 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Bokun <Bokun.Zhang@amd.com>
> Subject: [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV
>
> - Under SRIOV, SDMA engine is shared between VFs. Therefore,
>   we will not stop SDMA during hw_fini. This is not an issue
>   with normal dirver loading and unloading.
>
> - However, when we put the SDMA engine to suspend state and resume
>   it, the issue starts to show up. Something could attempt to use
>   that SDMA engine to clear or move memory before the engine is
>   initialized since the DRM entity is still there.
>
> - Therefore, we will call sdma_v5_2_enable(false) during hw_fini,
>   and if we are under SRIOV, we will call sdma_v5_2_enable(true)
>   afterwards to allow other VFs to use SDMA. This way, the DRM
>   entity of SDMA engine is emptied and it will follow the flow
>   of resume code path.
>
> Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> index f136fec7b4f4..3eaf1a573e73 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> @@ -1357,12 +1357,19 @@ static int sdma_v5_2_hw_fini(void *handle)  {
>         struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>
> -       if (amdgpu_sriov_vf(adev))
> -               return 0;
> -
> +       /*
> +        * Under SRIOV, the VF cannot single-mindedly stop SDMA engine
> +        * However, we still need to clean up the DRM entity
> +        * Therefore, we will re-enable SDMA afterwards.
> +        */
>         sdma_v5_2_ctx_switch_enable(adev, false);
>         sdma_v5_2_enable(adev, false);
>
> +       if (amdgpu_sriov_vf(adev)) {
> +               sdma_v5_2_enable(adev, true);
> +               sdma_v5_2_ctx_switch_enable(adev, true);
> +       }
> +
>         return 0;
>  }
>
> --
> 2.34.1

[-- Attachment #2: 0002-drm-amdgpu-switch-sdma-buffer-function-tear-down-to-.patch --]
[-- Type: text/x-patch, Size: 10649 bytes --]

From f83007b59e5a612dffef4235d7537fab9aba489c Mon Sep 17 00:00:00 2001
From: Alex Deucher <alexander.deucher@amd.com>
Date: Thu, 6 Oct 2022 15:31:40 -0400
Subject: [PATCH 2/3] drm/amdgpu: switch sdma buffer function tear down to a
 helper

Switch all of the SDMA implementations to use the helper to
tear down the ttm buffer manager.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 21 +++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h |  2 ++
 drivers/gpu/drm/amd/amdgpu/cik_sdma.c    |  6 +-----
 drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c   |  6 +-----
 drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c   |  6 +-----
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c   | 24 +++++-------------------
 drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c   |  6 +-----
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c   | 10 +---------
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c   |  9 +--------
 drivers/gpu/drm/amd/amdgpu/si_dma.c      |  5 ++---
 10 files changed, 36 insertions(+), 59 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
index 43cf8632cc1a..ea5278f094c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
@@ -285,3 +285,24 @@ int amdgpu_sdma_init_microcode(struct amdgpu_device *adev,
 	}
 	return err;
 }
+
+void amdgpu_sdma_unset_buffer_funcs_helper(struct amdgpu_device *adev)
+{
+	struct amdgpu_ring *sdma;
+	int i;
+
+	for (i = 0; i < adev->sdma.num_instances; i++) {
+		if (adev->sdma.has_page_queue) {
+			sdma = &adev->sdma.instance[i].page;
+			if (adev->mman.buffer_funcs_ring == sdma) {
+				amdgpu_ttm_set_buffer_funcs_status(adev, false);
+				break;
+			}
+		}
+		sdma = &adev->sdma.instance[i].ring;
+		if (adev->mman.buffer_funcs_ring == sdma) {
+			amdgpu_ttm_set_buffer_funcs_status(adev, false);
+			break;
+		}
+	}
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
index d2d88279fefb..7d99205c2e01 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
@@ -128,4 +128,6 @@ int amdgpu_sdma_init_microcode(struct amdgpu_device *adev,
         char *fw_name, u32 instance, bool duplicate);
 void amdgpu_sdma_destroy_inst_ctx(struct amdgpu_device *adev,
         bool duplicate);
+void amdgpu_sdma_unset_buffer_funcs_helper(struct amdgpu_device *adev);
+
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/cik_sdma.c b/drivers/gpu/drm/amd/amdgpu/cik_sdma.c
index 5647f13b98d4..cbca9866645c 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik_sdma.c
@@ -309,14 +309,10 @@ static void cik_sdma_ring_emit_fence(struct amdgpu_ring *ring, u64 addr, u64 seq
  */
 static void cik_sdma_gfx_stop(struct amdgpu_device *adev)
 {
-	struct amdgpu_ring *sdma0 = &adev->sdma.instance[0].ring;
-	struct amdgpu_ring *sdma1 = &adev->sdma.instance[1].ring;
 	u32 rb_cntl;
 	int i;
 
-	if ((adev->mman.buffer_funcs_ring == sdma0) ||
-	    (adev->mman.buffer_funcs_ring == sdma1))
-			amdgpu_ttm_set_buffer_funcs_status(adev, false);
+	amdgpu_sdma_unset_buffer_funcs_helper(adev);
 
 	for (i = 0; i < adev->sdma.num_instances; i++) {
 		rb_cntl = RREG32(mmSDMA0_GFX_RB_CNTL + sdma_offsets[i]);
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
index 6bdffdc1c0b9..c52d246a1d96 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
@@ -342,14 +342,10 @@ static void sdma_v2_4_ring_emit_fence(struct amdgpu_ring *ring, u64 addr, u64 se
  */
 static void sdma_v2_4_gfx_stop(struct amdgpu_device *adev)
 {
-	struct amdgpu_ring *sdma0 = &adev->sdma.instance[0].ring;
-	struct amdgpu_ring *sdma1 = &adev->sdma.instance[1].ring;
 	u32 rb_cntl, ib_cntl;
 	int i;
 
-	if ((adev->mman.buffer_funcs_ring == sdma0) ||
-	    (adev->mman.buffer_funcs_ring == sdma1))
-		amdgpu_ttm_set_buffer_funcs_status(adev, false);
+	amdgpu_sdma_unset_buffer_funcs_helper(adev);
 
 	for (i = 0; i < adev->sdma.num_instances; i++) {
 		rb_cntl = RREG32(mmSDMA0_GFX_RB_CNTL + sdma_offsets[i]);
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
index 2584fa3cb13e..486d9b5c1b9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
@@ -516,14 +516,10 @@ static void sdma_v3_0_ring_emit_fence(struct amdgpu_ring *ring, u64 addr, u64 se
  */
 static void sdma_v3_0_gfx_stop(struct amdgpu_device *adev)
 {
-	struct amdgpu_ring *sdma0 = &adev->sdma.instance[0].ring;
-	struct amdgpu_ring *sdma1 = &adev->sdma.instance[1].ring;
 	u32 rb_cntl, ib_cntl;
 	int i;
 
-	if ((adev->mman.buffer_funcs_ring == sdma0) ||
-	    (adev->mman.buffer_funcs_ring == sdma1))
-		amdgpu_ttm_set_buffer_funcs_status(adev, false);
+	amdgpu_sdma_unset_buffer_funcs_helper(adev);
 
 	for (i = 0; i < adev->sdma.num_instances; i++) {
 		rb_cntl = RREG32(mmSDMA0_GFX_RB_CNTL + sdma_offsets[i]);
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index 7241a9fb0121..7b4195f18a7c 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -915,18 +915,12 @@ static void sdma_v4_0_ring_emit_fence(struct amdgpu_ring *ring, u64 addr, u64 se
  */
 static void sdma_v4_0_gfx_stop(struct amdgpu_device *adev)
 {
-	struct amdgpu_ring *sdma[AMDGPU_MAX_SDMA_INSTANCES];
 	u32 rb_cntl, ib_cntl;
-	int i, unset = 0;
+	int i;
 
-	for (i = 0; i < adev->sdma.num_instances; i++) {
-		sdma[i] = &adev->sdma.instance[i].ring;
-
-		if ((adev->mman.buffer_funcs_ring == sdma[i]) && unset != 1) {
-			amdgpu_ttm_set_buffer_funcs_status(adev, false);
-			unset = 1;
-		}
+	amdgpu_sdma_unset_buffer_funcs_helper(adev);
 
+	for (i = 0; i < adev->sdma.num_instances; i++) {
 		rb_cntl = RREG32_SDMA(i, mmSDMA0_GFX_RB_CNTL);
 		rb_cntl = REG_SET_FIELD(rb_cntl, SDMA0_GFX_RB_CNTL, RB_ENABLE, 0);
 		WREG32_SDMA(i, mmSDMA0_GFX_RB_CNTL, rb_cntl);
@@ -957,20 +951,12 @@ static void sdma_v4_0_rlc_stop(struct amdgpu_device *adev)
  */
 static void sdma_v4_0_page_stop(struct amdgpu_device *adev)
 {
-	struct amdgpu_ring *sdma[AMDGPU_MAX_SDMA_INSTANCES];
 	u32 rb_cntl, ib_cntl;
 	int i;
-	bool unset = false;
 
-	for (i = 0; i < adev->sdma.num_instances; i++) {
-		sdma[i] = &adev->sdma.instance[i].page;
-
-		if ((adev->mman.buffer_funcs_ring == sdma[i]) &&
-			(!unset)) {
-			amdgpu_ttm_set_buffer_funcs_status(adev, false);
-			unset = true;
-		}
+	amdgpu_sdma_unset_buffer_funcs_helper(adev);
 
+	for (i = 0; i < adev->sdma.num_instances; i++) {
 		rb_cntl = RREG32_SDMA(i, mmSDMA0_PAGE_RB_CNTL);
 		rb_cntl = REG_SET_FIELD(rb_cntl, SDMA0_PAGE_RB_CNTL,
 					RB_ENABLE, 0);
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
index c05c3eebde4c..783048e1b0ce 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
@@ -584,14 +584,10 @@ static void sdma_v5_0_ring_emit_fence(struct amdgpu_ring *ring, u64 addr, u64 se
  */
 static void sdma_v5_0_gfx_stop(struct amdgpu_device *adev)
 {
-	struct amdgpu_ring *sdma0 = &adev->sdma.instance[0].ring;
-	struct amdgpu_ring *sdma1 = &adev->sdma.instance[1].ring;
 	u32 rb_cntl, ib_cntl;
 	int i;
 
-	if ((adev->mman.buffer_funcs_ring == sdma0) ||
-	    (adev->mman.buffer_funcs_ring == sdma1))
-		amdgpu_ttm_set_buffer_funcs_status(adev, false);
+	amdgpu_sdma_unset_buffer_funcs_helper(adev);
 
 	for (i = 0; i < adev->sdma.num_instances; i++) {
 		rb_cntl = RREG32_SOC15_IP(GC, sdma_v5_0_get_reg_offset(adev, i, mmSDMA0_GFX_RB_CNTL));
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index 3eaf1a573e73..c2ee53c2dd1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -414,18 +414,10 @@ static void sdma_v5_2_ring_emit_fence(struct amdgpu_ring *ring, u64 addr, u64 se
  */
 static void sdma_v5_2_gfx_stop(struct amdgpu_device *adev)
 {
-	struct amdgpu_ring *sdma0 = &adev->sdma.instance[0].ring;
-	struct amdgpu_ring *sdma1 = &adev->sdma.instance[1].ring;
-	struct amdgpu_ring *sdma2 = &adev->sdma.instance[2].ring;
-	struct amdgpu_ring *sdma3 = &adev->sdma.instance[3].ring;
 	u32 rb_cntl, ib_cntl;
 	int i;
 
-	if ((adev->mman.buffer_funcs_ring == sdma0) ||
-	    (adev->mman.buffer_funcs_ring == sdma1) ||
-	    (adev->mman.buffer_funcs_ring == sdma2) ||
-	    (adev->mman.buffer_funcs_ring == sdma3))
-		amdgpu_ttm_set_buffer_funcs_status(adev, false);
+	amdgpu_sdma_unset_buffer_funcs_helper(adev);
 
 	for (i = 0; i < adev->sdma.num_instances; i++) {
 		rb_cntl = RREG32_SOC15_IP(GC, sdma_v5_2_get_reg_offset(adev, i, mmSDMA0_GFX_RB_CNTL));
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
index 0150f66a5ae6..a6483483404e 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
@@ -398,14 +398,10 @@ static void sdma_v6_0_ring_emit_fence(struct amdgpu_ring *ring, u64 addr, u64 se
  */
 static void sdma_v6_0_gfx_stop(struct amdgpu_device *adev)
 {
-	struct amdgpu_ring *sdma0 = &adev->sdma.instance[0].ring;
-	struct amdgpu_ring *sdma1 = &adev->sdma.instance[1].ring;
 	u32 rb_cntl, ib_cntl;
 	int i;
 
-	if ((adev->mman.buffer_funcs_ring == sdma0) ||
-	    (adev->mman.buffer_funcs_ring == sdma1))
-		amdgpu_ttm_set_buffer_funcs_status(adev, false);
+	amdgpu_sdma_unset_buffer_funcs_helper(adev);
 
 	for (i = 0; i < adev->sdma.num_instances; i++) {
 		rb_cntl = RREG32_SOC15_IP(GC, sdma_v6_0_get_reg_offset(adev, i, regSDMA0_QUEUE0_RB_CNTL));
@@ -415,9 +411,6 @@ static void sdma_v6_0_gfx_stop(struct amdgpu_device *adev)
 		ib_cntl = REG_SET_FIELD(ib_cntl, SDMA0_QUEUE0_IB_CNTL, IB_ENABLE, 0);
 		WREG32_SOC15_IP(GC, sdma_v6_0_get_reg_offset(adev, i, regSDMA0_QUEUE0_IB_CNTL), ib_cntl);
 	}
-
-	sdma0->sched.ready = false;
-	sdma1->sched.ready = false;
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/si_dma.c b/drivers/gpu/drm/amd/amdgpu/si_dma.c
index f675111ace20..4d5e718540aa 100644
--- a/drivers/gpu/drm/amd/amdgpu/si_dma.c
+++ b/drivers/gpu/drm/amd/amdgpu/si_dma.c
@@ -116,15 +116,14 @@ static void si_dma_stop(struct amdgpu_device *adev)
 	u32 rb_cntl;
 	unsigned i;
 
+	amdgpu_sdma_unset_buffer_funcs_helper(adev);
+
 	for (i = 0; i < adev->sdma.num_instances; i++) {
 		ring = &adev->sdma.instance[i].ring;
 		/* dma0 */
 		rb_cntl = RREG32(DMA_RB_CNTL + sdma_offsets[i]);
 		rb_cntl &= ~DMA_RB_ENABLE;
 		WREG32(DMA_RB_CNTL + sdma_offsets[i], rb_cntl);
-
-		if (adev->mman.buffer_funcs_ring == ring)
-			amdgpu_ttm_set_buffer_funcs_status(adev, false);
 	}
 }
 
-- 
2.37.3


[-- Attachment #3: 0001-drm-amdgpu-Fix-SDMA-engine-resume-issue-under-SRIOV.patch --]
[-- Type: text/x-patch, Size: 1945 bytes --]

From b4892a605115699a8ba9f0c59d2e33e372cbf436 Mon Sep 17 00:00:00 2001
From: Bokun Zhang <Bokun.Zhang@amd.com>
Date: Fri, 7 Oct 2022 02:08:38 +0800
Subject: [PATCH 1/3] drm/amdgpu: Fix SDMA engine resume issue under SRIOV

- Under SRIOV, SDMA engine is shared between VFs. Therefore,
  we will not stop SDMA during hw_fini. This is not an issue
  with normal dirver loading and unloading.

- However, when we put the SDMA engine to suspend state and resume
  it, the issue starts to show up. Something could attempt to use
  that SDMA engine to clear or move memory before the engine is
  initialized since the DRM entity is still there.

- Therefore, we will call sdma_v5_2_enable(false) during hw_fini,
  and if we are under SRIOV, we will call sdma_v5_2_enable(true)
  afterwards to allow other VFs to use SDMA. This way, the DRM
  entity of SDMA engine is emptied and it will follow the flow
  of resume code path.

Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index f136fec7b4f4..3eaf1a573e73 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -1357,12 +1357,19 @@ static int sdma_v5_2_hw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	if (amdgpu_sriov_vf(adev))
-		return 0;
-
+	/*
+	 * Under SRIOV, the VF cannot single-mindedly stop SDMA engine
+	 * However, we still need to clean up the DRM entity
+	 * Therefore, we will re-enable SDMA afterwards.
+	 */
 	sdma_v5_2_ctx_switch_enable(adev, false);
 	sdma_v5_2_enable(adev, false);
 
+	if (amdgpu_sriov_vf(adev)) {
+		sdma_v5_2_enable(adev, true);
+		sdma_v5_2_ctx_switch_enable(adev, true);
+	}
+
 	return 0;
 }
 
-- 
2.37.3


[-- Attachment #4: 0003-drm-amdgpu-fix-SDMA-suspend-resume-on-SR-IOV.patch --]
[-- Type: text/x-patch, Size: 3426 bytes --]

From 8e6ff3180ab22e96d789e8e38084a5a84ceee533 Mon Sep 17 00:00:00 2001
From: Alex Deucher <alexander.deucher@amd.com>
Date: Thu, 6 Oct 2022 15:53:10 -0400
Subject: [PATCH 3/3] drm/amdgpu: fix SDMA suspend/resume on SR-IOV

Update all SDMA versions that support SR-IOV to properly
tear down the ttm buffer functions on suspend.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c |  5 ++++-
 drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c |  5 ++++-
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 16 ++++++----------
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c |  5 ++++-
 4 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index 7b4195f18a7c..298fa11702e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -1940,8 +1940,11 @@ static int sdma_v4_0_hw_fini(void *handle)
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 	int i;
 
-	if (amdgpu_sriov_vf(adev))
+	if (amdgpu_sriov_vf(adev)) {
+		/* disable the scheduler for SDMA */
+		amdgpu_sdma_unset_buffer_funcs_helper(adev);
 		return 0;
+	}
 
 	for (i = 0; i < adev->sdma.num_instances; i++) {
 		amdgpu_irq_put(adev, &adev->sdma.ecc_irq,
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
index 783048e1b0ce..d4d9f196db83 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
@@ -1456,8 +1456,11 @@ static int sdma_v5_0_hw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	if (amdgpu_sriov_vf(adev))
+	if (amdgpu_sriov_vf(adev)) {
+		/* disable the scheduler for SDMA */
+		amdgpu_sdma_unset_buffer_funcs_helper(adev);
 		return 0;
+	}
 
 	sdma_v5_0_ctx_switch_enable(adev, false);
 	sdma_v5_0_enable(adev, false);
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index c2ee53c2dd1b..809eca54fc61 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -1349,19 +1349,15 @@ static int sdma_v5_2_hw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	/*
-	 * Under SRIOV, the VF cannot single-mindedly stop SDMA engine
-	 * However, we still need to clean up the DRM entity
-	 * Therefore, we will re-enable SDMA afterwards.
-	 */
-	sdma_v5_2_ctx_switch_enable(adev, false);
-	sdma_v5_2_enable(adev, false);
-
 	if (amdgpu_sriov_vf(adev)) {
-		sdma_v5_2_enable(adev, true);
-		sdma_v5_2_ctx_switch_enable(adev, true);
+		/* disable the scheduler for SDMA */
+		amdgpu_sdma_unset_buffer_funcs_helper(adev);
+		return 0;
 	}
 
+	sdma_v5_2_ctx_switch_enable(adev, false);
+	sdma_v5_2_enable(adev, false);
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
index a6483483404e..da3beb0bf2fa 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
@@ -1311,8 +1311,11 @@ static int sdma_v6_0_hw_fini(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
-	if (amdgpu_sriov_vf(adev))
+	if (amdgpu_sriov_vf(adev)) {
+		/* disable the scheduler for SDMA */
+		amdgpu_sdma_unset_buffer_funcs_helper(adev);
 		return 0;
+	}
 
 	sdma_v6_0_ctx_switch_enable(adev, false);
 	sdma_v6_0_enable(adev, false);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV
  2022-10-06 19:56   ` Alex Deucher
@ 2022-10-07 21:38     ` Zhang, Bokun
  2022-10-12  1:43       ` Liu, Monk
  0 siblings, 1 reply; 5+ messages in thread
From: Zhang, Bokun @ 2022-10-07 21:38 UTC (permalink / raw)
  To: Alex Deucher
  Cc: amd-gfx, Deng, Emily, Deucher, Alexander, Koenig,  Christian,
	Liu, Monk, Jiang, Jerry (SW)

[AMD Official Use Only - General]

Tested-by: Bokun, Zhang <Bokun.Zhang@amd.com>

This patch is better since it extracted the unset code and only execute it in the SRIOV routine.
I have tested it with multi-VF.

Thanks!


-----Original Message-----
From: Alex Deucher <alexdeucher@gmail.com> 
Sent: Thursday, October 6, 2022 3:56 PM
To: Zhang, Bokun <Bokun.Zhang@amd.com>
Cc: Liu, Monk <Monk.Liu@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV

On Thu, Oct 6, 2022 at 2:11 PM Zhang, Bokun <Bokun.Zhang@amd.com> wrote:
>
> [AMD Official Use Only - General]
>
> Hey guys,
>     Please help review this patch for the suspend and resume issue.
>     I have tested it with multi-VF environment, I think it is ok.

Seems a little hacky, but I think that's the least intrusive for stable.  How about the attached patches?

Alex


>
> Thanks!
>
> -----Original Message-----
> From: Bokun Zhang <Bokun.Zhang@amd.com>
> Sent: Thursday, October 6, 2022 2:09 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Bokun <Bokun.Zhang@amd.com>
> Subject: [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV
>
> - Under SRIOV, SDMA engine is shared between VFs. Therefore,
>   we will not stop SDMA during hw_fini. This is not an issue
>   with normal dirver loading and unloading.
>
> - However, when we put the SDMA engine to suspend state and resume
>   it, the issue starts to show up. Something could attempt to use
>   that SDMA engine to clear or move memory before the engine is
>   initialized since the DRM entity is still there.
>
> - Therefore, we will call sdma_v5_2_enable(false) during hw_fini,
>   and if we are under SRIOV, we will call sdma_v5_2_enable(true)
>   afterwards to allow other VFs to use SDMA. This way, the DRM
>   entity of SDMA engine is emptied and it will follow the flow
>   of resume code path.
>
> Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c 
> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> index f136fec7b4f4..3eaf1a573e73 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> @@ -1357,12 +1357,19 @@ static int sdma_v5_2_hw_fini(void *handle)  {
>         struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>
> -       if (amdgpu_sriov_vf(adev))
> -               return 0;
> -
> +       /*
> +        * Under SRIOV, the VF cannot single-mindedly stop SDMA engine
> +        * However, we still need to clean up the DRM entity
> +        * Therefore, we will re-enable SDMA afterwards.
> +        */
>         sdma_v5_2_ctx_switch_enable(adev, false);
>         sdma_v5_2_enable(adev, false);
>
> +       if (amdgpu_sriov_vf(adev)) {
> +               sdma_v5_2_enable(adev, true);
> +               sdma_v5_2_ctx_switch_enable(adev, true);
> +       }
> +
>         return 0;
>  }
>
> --
> 2.34.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV
  2022-10-07 21:38     ` Zhang, Bokun
@ 2022-10-12  1:43       ` Liu, Monk
  0 siblings, 0 replies; 5+ messages in thread
From: Liu, Monk @ 2022-10-12  1:43 UTC (permalink / raw)
  To: Zhang, Bokun, Alex Deucher
  Cc: Deucher, Alexander, Deng, Emily, Koenig, Christian, amd-gfx,
	Jiang,  Jerry (SW)

[AMD Official Use Only - General]

Hi Bokun

Can you elaborate more on this reenabling SDMA engine during suspend ?
Why VF need the SDMA engine alive there ?

> -
> +       /*
> +        * Under SRIOV, the VF cannot single-mindedly stop SDMA engine
> +        * However, we still need to clean up the DRM entity
> +        * Therefore, we will re-enable SDMA afterwards.
> +        */

Thanks 
-------------------------------------------------------------------
Monk Liu | Cloud GPU & Virtualization Solution | AMD
-------------------------------------------------------------------
we are hiring software manager for CVS core team
-------------------------------------------------------------------

-----Original Message-----
From: Zhang, Bokun <Bokun.Zhang@amd.com> 
Sent: 2022年10月8日 5:38
To: Alex Deucher <alexdeucher@gmail.com>
Cc: Liu, Monk <Monk.Liu@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org; Jiang, Jerry (SW) <Jerry.Jiang@amd.com>
Subject: RE: [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV

[AMD Official Use Only - General]

Tested-by: Bokun, Zhang <Bokun.Zhang@amd.com>

This patch is better since it extracted the unset code and only execute it in the SRIOV routine.
I have tested it with multi-VF.

Thanks!


-----Original Message-----
From: Alex Deucher <alexdeucher@gmail.com> 
Sent: Thursday, October 6, 2022 3:56 PM
To: Zhang, Bokun <Bokun.Zhang@amd.com>
Cc: Liu, Monk <Monk.Liu@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV

On Thu, Oct 6, 2022 at 2:11 PM Zhang, Bokun <Bokun.Zhang@amd.com> wrote:
>
> [AMD Official Use Only - General]
>
> Hey guys,
>     Please help review this patch for the suspend and resume issue.
>     I have tested it with multi-VF environment, I think it is ok.

Seems a little hacky, but I think that's the least intrusive for stable.  How about the attached patches?

Alex


>
> Thanks!
>
> -----Original Message-----
> From: Bokun Zhang <Bokun.Zhang@amd.com>
> Sent: Thursday, October 6, 2022 2:09 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Bokun <Bokun.Zhang@amd.com>
> Subject: [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV
>
> - Under SRIOV, SDMA engine is shared between VFs. Therefore,
>   we will not stop SDMA during hw_fini. This is not an issue
>   with normal dirver loading and unloading.
>
> - However, when we put the SDMA engine to suspend state and resume
>   it, the issue starts to show up. Something could attempt to use
>   that SDMA engine to clear or move memory before the engine is
>   initialized since the DRM entity is still there.
>
> - Therefore, we will call sdma_v5_2_enable(false) during hw_fini,
>   and if we are under SRIOV, we will call sdma_v5_2_enable(true)
>   afterwards to allow other VFs to use SDMA. This way, the DRM
>   entity of SDMA engine is emptied and it will follow the flow
>   of resume code path.
>
> Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c 
> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> index f136fec7b4f4..3eaf1a573e73 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> @@ -1357,12 +1357,19 @@ static int sdma_v5_2_hw_fini(void *handle)  {
>         struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>
> -       if (amdgpu_sriov_vf(adev))
> -               return 0;
> -
> +       /*
> +        * Under SRIOV, the VF cannot single-mindedly stop SDMA engine
> +        * However, we still need to clean up the DRM entity
> +        * Therefore, we will re-enable SDMA afterwards.
> +        */
>         sdma_v5_2_ctx_switch_enable(adev, false);
>         sdma_v5_2_enable(adev, false);
>
> +       if (amdgpu_sriov_vf(adev)) {
> +               sdma_v5_2_enable(adev, true);
> +               sdma_v5_2_ctx_switch_enable(adev, true);
> +       }
> +
>         return 0;
>  }
>
> --
> 2.34.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-10-12  1:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-06 18:08 [PATCH] drm/amdgpu: Fix SDMA engine resume issue under SRIOV Bokun Zhang
2022-10-06 18:11 ` Zhang, Bokun
2022-10-06 19:56   ` Alex Deucher
2022-10-07 21:38     ` Zhang, Bokun
2022-10-12  1:43       ` Liu, Monk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.