All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
@ 2021-11-18 16:57 shaoyunl
  2021-11-19  3:07 ` Liu, Shaoyun
  2021-11-22 15:40 ` Felix Kuehling
  0 siblings, 2 replies; 6+ messages in thread
From: shaoyunl @ 2021-11-18 16:57 UTC (permalink / raw)
  To: amd-gfx; +Cc: shaoyunl

For sriov XGMI  configuration, the host driver will handle the hive reset,
so in guest side, the reset_sriov only be called once on one device. This will
make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already
been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov
function to make them balance .

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 10c8008d1da0..9a9d5493c676 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 
 	amdgpu_irq_gpu_reset_resume_helper(adev);
 	r = amdgpu_ib_ring_tests(adev);
-	amdgpu_amdkfd_post_reset(adev);
 
 error:
 	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
@@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
 	/* Actual ASIC resets if needed.*/
-	/* TODO Implement XGMI hive reset logic for SRIOV */
+	/* Host driver will handle XGMI hive reset for SRIOV */
 	if (amdgpu_sriov_vf(adev)) {
 		r = amdgpu_device_reset_sriov(adev, job ? false : true);
 		if (r)
@@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 skip_sched_resume:
 	list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
-		/* unlock kfd: SRIOV would do it separately */
-		if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
+		/* unlock kfd */
+		if (!need_emergency_restart)
 	                amdgpu_amdkfd_post_reset(tmp_adev);
 
 		/* kfd_post_reset will do nothing if kfd device is not initialized,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* RE: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
  2021-11-18 16:57 [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function shaoyunl
@ 2021-11-19  3:07 ` Liu, Shaoyun
  2021-11-22 15:15   ` Liu, Shaoyun
  2021-11-22 15:40 ` Felix Kuehling
  1 sibling, 1 reply; 6+ messages in thread
From: Liu, Shaoyun @ 2021-11-19  3:07 UTC (permalink / raw)
  To: amd-gfx

[AMD Official Use Only]

Ping 

-----Original Message-----
From: Liu, Shaoyun <Shaoyun.Liu@amd.com> 
Sent: Thursday, November 18, 2021 11:58 AM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun <Shaoyun.Liu@amd.com>
Subject: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function

For sriov XGMI  configuration, the host driver will handle the hive reset, so in guest side, the reset_sriov only be called once on one device. This will make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov function to make them balance .

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 10c8008d1da0..9a9d5493c676 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 
 	amdgpu_irq_gpu_reset_resume_helper(adev);
 	r = amdgpu_ib_ring_tests(adev);
-	amdgpu_amdkfd_post_reset(adev);
 
 error:
 	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) { @@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
 	/* Actual ASIC resets if needed.*/
-	/* TODO Implement XGMI hive reset logic for SRIOV */
+	/* Host driver will handle XGMI hive reset for SRIOV */
 	if (amdgpu_sriov_vf(adev)) {
 		r = amdgpu_device_reset_sriov(adev, job ? false : true);
 		if (r)
@@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 skip_sched_resume:
 	list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
-		/* unlock kfd: SRIOV would do it separately */
-		if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
+		/* unlock kfd */
+		if (!need_emergency_restart)
 	                amdgpu_amdkfd_post_reset(tmp_adev);
 
 		/* kfd_post_reset will do nothing if kfd device is not initialized,
--
2.17.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* RE: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
  2021-11-19  3:07 ` Liu, Shaoyun
@ 2021-11-22 15:15   ` Liu, Shaoyun
  0 siblings, 0 replies; 6+ messages in thread
From: Liu, Shaoyun @ 2021-11-22 15:15 UTC (permalink / raw)
  To: amd-gfx, Kuehling, Felix

[AMD Official Use Only]

ping

-----Original Message-----
From: Liu, Shaoyun 
Sent: Thursday, November 18, 2021 10:08 PM
To: amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function

[AMD Official Use Only]

Ping 

-----Original Message-----
From: Liu, Shaoyun <Shaoyun.Liu@amd.com> 
Sent: Thursday, November 18, 2021 11:58 AM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Shaoyun <Shaoyun.Liu@amd.com>
Subject: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function

For sriov XGMI  configuration, the host driver will handle the hive reset, so in guest side, the reset_sriov only be called once on one device. This will make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov function to make them balance .

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 10c8008d1da0..9a9d5493c676 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 
 	amdgpu_irq_gpu_reset_resume_helper(adev);
 	r = amdgpu_ib_ring_tests(adev);
-	amdgpu_amdkfd_post_reset(adev);
 
 error:
 	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) { @@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
 	/* Actual ASIC resets if needed.*/
-	/* TODO Implement XGMI hive reset logic for SRIOV */
+	/* Host driver will handle XGMI hive reset for SRIOV */
 	if (amdgpu_sriov_vf(adev)) {
 		r = amdgpu_device_reset_sriov(adev, job ? false : true);
 		if (r)
@@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 skip_sched_resume:
 	list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
-		/* unlock kfd: SRIOV would do it separately */
-		if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
+		/* unlock kfd */
+		if (!need_emergency_restart)
 	                amdgpu_amdkfd_post_reset(tmp_adev);
 
 		/* kfd_post_reset will do nothing if kfd device is not initialized,
--
2.17.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
  2021-11-18 16:57 [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function shaoyunl
  2021-11-19  3:07 ` Liu, Shaoyun
@ 2021-11-22 15:40 ` Felix Kuehling
  2021-11-22 16:16   ` Liu, Shaoyun
  1 sibling, 1 reply; 6+ messages in thread
From: Felix Kuehling @ 2021-11-22 15:40 UTC (permalink / raw)
  To: shaoyunl, amd-gfx

Am 2021-11-18 um 11:57 a.m. schrieb shaoyunl:
> For sriov XGMI  configuration, the host driver will handle the hive reset,
> so in guest side, the reset_sriov only be called once on one device. This will
> make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already
> been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov
> function to make them balance .
>
> Signed-off-by: shaoyunl <shaoyun.liu@amd.com>

Please change the headline prefix to "drm/amdgpu: ". The extra "/amd" is
redundant. And I'd also add a tag

Fixes: 9f4f2c1a3524 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in
sriov")

Note that the commit hash is the one from the drm-next branch, which is
what will get merged into master eventually. With those changes, the
patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 10c8008d1da0..9a9d5493c676 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>  
>  	amdgpu_irq_gpu_reset_resume_helper(adev);
>  	r = amdgpu_ib_ring_tests(adev);
> -	amdgpu_amdkfd_post_reset(adev);
>  
>  error:
>  	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> @@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>  
>  	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
>  	/* Actual ASIC resets if needed.*/
> -	/* TODO Implement XGMI hive reset logic for SRIOV */
> +	/* Host driver will handle XGMI hive reset for SRIOV */
>  	if (amdgpu_sriov_vf(adev)) {
>  		r = amdgpu_device_reset_sriov(adev, job ? false : true);
>  		if (r)
> @@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>  
>  skip_sched_resume:
>  	list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
> -		/* unlock kfd: SRIOV would do it separately */
> -		if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
> +		/* unlock kfd */
> +		if (!need_emergency_restart)
>  	                amdgpu_amdkfd_post_reset(tmp_adev);
>  
>  		/* kfd_post_reset will do nothing if kfd device is not initialized,

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
  2021-11-22 15:40 ` Felix Kuehling
@ 2021-11-22 16:16   ` Liu, Shaoyun
  2021-11-22 17:41     ` Felix Kuehling
  0 siblings, 1 reply; 6+ messages in thread
From: Liu, Shaoyun @ 2021-11-22 16:16 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx

[AMD Official Use Only]

Thanks for the review .
The hash for the previous change from gerrirgit/amd-staging-drm-next branch is 7079e7d5c6bf248bff,  so there is another drm-next branch that not in the  gerritgit for upstream ? 

Thanks 
Shaoyun.liu


-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@amd.com> 
Sent: Monday, November 22, 2021 10:40 AM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function

Am 2021-11-18 um 11:57 a.m. schrieb shaoyunl:
> For sriov XGMI  configuration, the host driver will handle the hive 
> reset, so in guest side, the reset_sriov only be called once on one 
> device. This will make kfd post_reset unblanced with kfd pre_reset 
> since kfd pre_reset already been moved out of reset_sriov function. 
> Move kfd post_reset out of reset_sriov function to make them balance .
>
> Signed-off-by: shaoyunl <shaoyun.liu@amd.com>

Please change the headline prefix to "drm/amdgpu: ". The extra "/amd" is redundant. And I'd also add a tag

Fixes: 9f4f2c1a3524 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in
sriov")

Note that the commit hash is the one from the drm-next branch, which is what will get merged into master eventually. With those changes, the patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 10c8008d1da0..9a9d5493c676 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
>  
>  	amdgpu_irq_gpu_reset_resume_helper(adev);
>  	r = amdgpu_ib_ring_tests(adev);
> -	amdgpu_amdkfd_post_reset(adev);
>  
>  error:
>  	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) 
> { @@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct 
> amdgpu_device *adev,
>  
>  	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
>  	/* Actual ASIC resets if needed.*/
> -	/* TODO Implement XGMI hive reset logic for SRIOV */
> +	/* Host driver will handle XGMI hive reset for SRIOV */
>  	if (amdgpu_sriov_vf(adev)) {
>  		r = amdgpu_device_reset_sriov(adev, job ? false : true);
>  		if (r)
> @@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct 
> amdgpu_device *adev,
>  
>  skip_sched_resume:
>  	list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
> -		/* unlock kfd: SRIOV would do it separately */
> -		if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
> +		/* unlock kfd */
> +		if (!need_emergency_restart)
>  	                amdgpu_amdkfd_post_reset(tmp_adev);
>  
>  		/* kfd_post_reset will do nothing if kfd device is not initialized,

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
  2021-11-22 16:16   ` Liu, Shaoyun
@ 2021-11-22 17:41     ` Felix Kuehling
  0 siblings, 0 replies; 6+ messages in thread
From: Felix Kuehling @ 2021-11-22 17:41 UTC (permalink / raw)
  To: Liu, Shaoyun, amd-gfx

Am 2021-11-22 um 11:16 a.m. schrieb Liu, Shaoyun:
> [AMD Official Use Only]
>
> Thanks for the review .
> The hash for the previous change from gerrirgit/amd-staging-drm-next branch is 7079e7d5c6bf248bff,  so there is another drm-next branch that not in the  gerritgit for upstream ? 

Yes. amd-staging-drm-next is our AMD internal branch. Alex sends pull
requests to Dave Airlie's for his drm-next branch where they get
integrated with all the other DRM driver changes. That usually results
in different commit hashes.

Regards,
  Felix


>
> Thanks 
> Shaoyun.liu
>
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@amd.com> 
> Sent: Monday, November 22, 2021 10:40 AM
> To: Liu, Shaoyun <Shaoyun.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
>
> Am 2021-11-18 um 11:57 a.m. schrieb shaoyunl:
>> For sriov XGMI  configuration, the host driver will handle the hive 
>> reset, so in guest side, the reset_sriov only be called once on one 
>> device. This will make kfd post_reset unblanced with kfd pre_reset 
>> since kfd pre_reset already been moved out of reset_sriov function. 
>> Move kfd post_reset out of reset_sriov function to make them balance .
>>
>> Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
> Please change the headline prefix to "drm/amdgpu: ". The extra "/amd" is redundant. And I'd also add a tag
>
> Fixes: 9f4f2c1a3524 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in
> sriov")
>
> Note that the commit hash is the one from the drm-next branch, which is what will get merged into master eventually. With those changes, the patch is
>
> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
>
>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++----
>>  1 file changed, 3 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 10c8008d1da0..9a9d5493c676 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct 
>> amdgpu_device *adev,
>>  
>>  	amdgpu_irq_gpu_reset_resume_helper(adev);
>>  	r = amdgpu_ib_ring_tests(adev);
>> -	amdgpu_amdkfd_post_reset(adev);
>>  
>>  error:
>>  	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) 
>> { @@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct 
>> amdgpu_device *adev,
>>  
>>  	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
>>  	/* Actual ASIC resets if needed.*/
>> -	/* TODO Implement XGMI hive reset logic for SRIOV */
>> +	/* Host driver will handle XGMI hive reset for SRIOV */
>>  	if (amdgpu_sriov_vf(adev)) {
>>  		r = amdgpu_device_reset_sriov(adev, job ? false : true);
>>  		if (r)
>> @@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct 
>> amdgpu_device *adev,
>>  
>>  skip_sched_resume:
>>  	list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
>> -		/* unlock kfd: SRIOV would do it separately */
>> -		if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
>> +		/* unlock kfd */
>> +		if (!need_emergency_restart)
>>  	                amdgpu_amdkfd_post_reset(tmp_adev);
>>  
>>  		/* kfd_post_reset will do nothing if kfd device is not initialized,

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-11-22 17:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-18 16:57 [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function shaoyunl
2021-11-19  3:07 ` Liu, Shaoyun
2021-11-22 15:15   ` Liu, Shaoyun
2021-11-22 15:40 ` Felix Kuehling
2021-11-22 16:16   ` Liu, Shaoyun
2021-11-22 17:41     ` Felix Kuehling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.