amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV
@ 2020-06-29  7:11 Monk Liu
  2020-06-29  8:18 ` Christian König
  0 siblings, 1 reply; 5+ messages in thread
From: Monk Liu @ 2020-06-29  7:11 UTC (permalink / raw)
  To: amd-gfx; +Cc: Monk Liu

From: pengzhou <PengJu.Zhou@amd.com>

issue:
originally we kickoff IB test asynchronously with driver's init, thus
the IB test may still running when the driver loading done (modprobe amdgpu done).
if we shutdown VM immediately after amdgpu driver loaded then GPU may
hang because the IB test is still running

fix:
make IB test synchronize with driver init thus it won't still running
when we shutdown the VM.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 457f5d2..4f54660 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3292,8 +3292,16 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	/* must succeed. */
 	amdgpu_ras_resume(adev);
 
-	queue_delayed_work(system_wq, &adev->delayed_init_work,
+	if (amdgpu_sriov_vf(adev)) {
+		r = amdgpu_ib_ring_tests(adev);
+		if (r) {
+			DRM_ERROR("ib ring test failed (%d).\n", r);
+			return r;
+		}
+	} else {
+		queue_delayed_work(system_wq, &adev->delayed_init_work,
 			   msecs_to_jiffies(AMDGPU_RESUME_MS));
+	}
 
 	r = sysfs_create_files(&adev->dev->kobj, amdgpu_dev_attributes);
 	if (r) {
@@ -3329,7 +3337,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 	int r;
 
 	DRM_INFO("amdgpu: finishing device.\n");
-	flush_delayed_work(&adev->delayed_init_work);
+	if (!amdgpu_sriov_vf(adev))
+		flush_delayed_work(&adev->delayed_init_work);
 	adev->shutdown = true;
 
 	/* make sure IB test finished before entering exclusive mode
@@ -3425,7 +3434,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 	if (fbcon)
 		amdgpu_fbdev_set_suspend(adev, 1);
 
-	cancel_delayed_work_sync(&adev->delayed_init_work);
+	if (!amdgpu_sriov_vf(adev))
+		cancel_delayed_work_sync(&adev->delayed_init_work);
 
 	if (!amdgpu_device_has_dc_support(adev)) {
 		/* turn off display hw */
@@ -3528,8 +3538,16 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
 	if (r)
 		return r;
 
-	queue_delayed_work(system_wq, &adev->delayed_init_work,
+	if (amdgpu_sriov_vf(adev)) {
+		r = amdgpu_ib_ring_tests(adev);
+		if (r) {
+			DRM_ERROR("ib ring test failed (%d).\n", r);
+			return r;
+		}
+	} else {
+		queue_delayed_work(system_wq, &adev->delayed_init_work,
 			   msecs_to_jiffies(AMDGPU_RESUME_MS));
+	}
 
 	if (!amdgpu_device_has_dc_support(adev)) {
 		/* pin cursors */
@@ -3554,7 +3572,8 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
 		return r;
 
 	/* Make sure IB tests flushed */
-	flush_delayed_work(&adev->delayed_init_work);
+	if (!amdgpu_sriov_vf(adev))
+		flush_delayed_work(&adev->delayed_init_work);
 
 	/* blat the mode back in */
 	if (fbcon) {
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV
  2020-06-29  7:11 [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV Monk Liu
@ 2020-06-29  8:18 ` Christian König
  2020-06-29  9:03   ` Liu, Monk
  0 siblings, 1 reply; 5+ messages in thread
From: Christian König @ 2020-06-29  8:18 UTC (permalink / raw)
  To: Monk Liu, amd-gfx

Am 29.06.20 um 09:11 schrieb Monk Liu:
> From: pengzhou <PengJu.Zhou@amd.com>
>
> issue:
> originally we kickoff IB test asynchronously with driver's init, thus
> the IB test may still running when the driver loading done (modprobe amdgpu done).
> if we shutdown VM immediately after amdgpu driver loaded then GPU may
> hang because the IB test is still running
>
> fix:
> make IB test synchronize with driver init thus it won't still running
> when we shutdown the VM.

We explicitly added the asynchronously IB test for SRIOV to make driver 
load faster. Why is that now a problem?

And why would it help when the VM shuts down? We cancel/flush the test 
during driver unload/suspend as well.

>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 ++++++++++++++++++++++++-----
>   1 file changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 457f5d2..4f54660 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3292,8 +3292,16 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	/* must succeed. */
>   	amdgpu_ras_resume(adev);
>   
> -	queue_delayed_work(system_wq, &adev->delayed_init_work,
> +	if (amdgpu_sriov_vf(adev)) {
> +		r = amdgpu_ib_ring_tests(adev);
> +		if (r) {
> +			DRM_ERROR("ib ring test failed (%d).\n", r);
> +			return r;
> +		}
> +	} else {
> +		queue_delayed_work(system_wq, &adev->delayed_init_work,
>   			   msecs_to_jiffies(AMDGPU_RESUME_MS));
> +	}
>   
>   	r = sysfs_create_files(&adev->dev->kobj, amdgpu_dev_attributes);
>   	if (r) {
> @@ -3329,7 +3337,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>   	int r;
>   
>   	DRM_INFO("amdgpu: finishing device.\n");
> -	flush_delayed_work(&adev->delayed_init_work);
> +	if (!amdgpu_sriov_vf(adev))
> +		flush_delayed_work(&adev->delayed_init_work);

You can drop this change, flushing a work which was never scheduled is 
harmless.

>   	adev->shutdown = true;
>   
>   	/* make sure IB test finished before entering exclusive mode
> @@ -3425,7 +3434,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>   	if (fbcon)
>   		amdgpu_fbdev_set_suspend(adev, 1);
>   
> -	cancel_delayed_work_sync(&adev->delayed_init_work);
> +	if (!amdgpu_sriov_vf(adev))
> +		cancel_delayed_work_sync(&adev->delayed_init_work);
>   
>   	if (!amdgpu_device_has_dc_support(adev)) {
>   		/* turn off display hw */
> @@ -3528,8 +3538,16 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>   	if (r)
>   		return r;
>   
> -	queue_delayed_work(system_wq, &adev->delayed_init_work,
> +	if (amdgpu_sriov_vf(adev)) {
> +		r = amdgpu_ib_ring_tests(adev);
> +		if (r) {
> +			DRM_ERROR("ib ring test failed (%d).\n", r);
> +			return r;
> +		}
> +	} else {
> +		queue_delayed_work(system_wq, &adev->delayed_init_work,
>   			   msecs_to_jiffies(AMDGPU_RESUME_MS));
> +	}
>   
>   	if (!amdgpu_device_has_dc_support(adev)) {
>   		/* pin cursors */
> @@ -3554,7 +3572,8 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>   		return r;
>   
>   	/* Make sure IB tests flushed */
> -	flush_delayed_work(&adev->delayed_init_work);
> +	if (!amdgpu_sriov_vf(adev))
> +		flush_delayed_work(&adev->delayed_init_work);
>   
>   	/* blat the mode back in */
>   	if (fbcon) {

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV
  2020-06-29  8:18 ` Christian König
@ 2020-06-29  9:03   ` Liu, Monk
  2020-06-29  9:09     ` Christian König
  0 siblings, 1 reply; 5+ messages in thread
From: Liu, Monk @ 2020-06-29  9:03 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx

>>> We explicitly added the asynchronously IB test for SRIOV to make driver load faster. Why is that now a problem?
Well I didn't notice this change explicitly for SRIOV, at least from the code it is quite a normal change 


>>> And why would it help when the VM shuts down? We cancel/flush the test during driver unload/suspend as well.
If you do the sequence like: load amdgpu and shutdown VM immediately , you will hit problems .

Virtual Machine's shutdown won't be blocked if the IB test is on the fly , e.g.: you can do "init 0" right after "modprobe amdgpu" and the "amdgpu_device_fini" won't be called 
Thus the flushing on IB test won't happen so the IB test is still running with the VM already shutdown and lead to GPU crash (usually VCN/VCE crash on invalid memory address)

 
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com> 
Sent: Monday, June 29, 2020 4:18 PM
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV

Am 29.06.20 um 09:11 schrieb Monk Liu:
> From: pengzhou <PengJu.Zhou@amd.com>
>
> issue:
> originally we kickoff IB test asynchronously with driver's init, thus 
> the IB test may still running when the driver loading done (modprobe amdgpu done).
> if we shutdown VM immediately after amdgpu driver loaded then GPU may 
> hang because the IB test is still running
>
> fix:
> make IB test synchronize with driver init thus it won't still running 
> when we shutdown the VM.

We explicitly added the asynchronously IB test for SRIOV to make driver load faster. Why is that now a problem?

And why would it help when the VM shuts down? We cancel/flush the test during driver unload/suspend as well.

>
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 ++++++++++++++++++++++++-----
>   1 file changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 457f5d2..4f54660 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3292,8 +3292,16 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	/* must succeed. */
>   	amdgpu_ras_resume(adev);
>   
> -	queue_delayed_work(system_wq, &adev->delayed_init_work,
> +	if (amdgpu_sriov_vf(adev)) {
> +		r = amdgpu_ib_ring_tests(adev);
> +		if (r) {
> +			DRM_ERROR("ib ring test failed (%d).\n", r);
> +			return r;
> +		}
> +	} else {
> +		queue_delayed_work(system_wq, &adev->delayed_init_work,
>   			   msecs_to_jiffies(AMDGPU_RESUME_MS));
> +	}
>   
>   	r = sysfs_create_files(&adev->dev->kobj, amdgpu_dev_attributes);
>   	if (r) {
> @@ -3329,7 +3337,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>   	int r;
>   
>   	DRM_INFO("amdgpu: finishing device.\n");
> -	flush_delayed_work(&adev->delayed_init_work);
> +	if (!amdgpu_sriov_vf(adev))
> +		flush_delayed_work(&adev->delayed_init_work);

You can drop this change, flushing a work which was never scheduled is harmless.

>   	adev->shutdown = true;
>   
>   	/* make sure IB test finished before entering exclusive mode @@ 
> -3425,7 +3434,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>   	if (fbcon)
>   		amdgpu_fbdev_set_suspend(adev, 1);
>   
> -	cancel_delayed_work_sync(&adev->delayed_init_work);
> +	if (!amdgpu_sriov_vf(adev))
> +		cancel_delayed_work_sync(&adev->delayed_init_work);
>   
>   	if (!amdgpu_device_has_dc_support(adev)) {
>   		/* turn off display hw */
> @@ -3528,8 +3538,16 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>   	if (r)
>   		return r;
>   
> -	queue_delayed_work(system_wq, &adev->delayed_init_work,
> +	if (amdgpu_sriov_vf(adev)) {
> +		r = amdgpu_ib_ring_tests(adev);
> +		if (r) {
> +			DRM_ERROR("ib ring test failed (%d).\n", r);
> +			return r;
> +		}
> +	} else {
> +		queue_delayed_work(system_wq, &adev->delayed_init_work,
>   			   msecs_to_jiffies(AMDGPU_RESUME_MS));
> +	}
>   
>   	if (!amdgpu_device_has_dc_support(adev)) {
>   		/* pin cursors */
> @@ -3554,7 +3572,8 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>   		return r;
>   
>   	/* Make sure IB tests flushed */
> -	flush_delayed_work(&adev->delayed_init_work);
> +	if (!amdgpu_sriov_vf(adev))
> +		flush_delayed_work(&adev->delayed_init_work);
>   
>   	/* blat the mode back in */
>   	if (fbcon) {

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV
  2020-06-29  9:03   ` Liu, Monk
@ 2020-06-29  9:09     ` Christian König
  2020-06-29  9:34       ` Liu, Monk
  0 siblings, 1 reply; 5+ messages in thread
From: Christian König @ 2020-06-29  9:09 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx

Am 29.06.20 um 11:03 schrieb Liu, Monk:
>>>> We explicitly added the asynchronously IB test for SRIOV to make driver load faster. Why is that now a problem?
> Well I didn't notice this change explicitly for SRIOV, at least from the code it is quite a normal change

It's been a while, but if I remember correctly the original motivation 
was to reduce the time we spend in exclusive mode for SRIOV.

Of course bare metal quickly picket this up as well because it reduces 
the time we spend in modprobe.

>>>> And why would it help when the VM shuts down? We cancel/flush the test during driver unload/suspend as well.
> If you do the sequence like: load amdgpu and shutdown VM immediately , you will hit problems .
>
> Virtual Machine's shutdown won't be blocked if the IB test is on the fly , e.g.: you can do "init 0" right after "modprobe amdgpu" and the "amdgpu_device_fini" won't be called
> Thus the flushing on IB test won't happen so the IB test is still running with the VM already shutdown and lead to GPU crash (usually VCN/VCE crash on invalid memory address)

Yeah, that is rather obvious. We try to reduce the modprobe time, but 
this is rather harmful for this test case.

But this can happen with any work running on the GPU. Do we have a 
specific test case for this or why is it a problem?

I mean we should not change the driver just to make a strange QA test 
case work which otherwise never happens in practice.

If we really need this I suggest to just add "if (amdgpu_sriov_vf(adev)) 
flush_delayed_work(&adev->delayed_init_work);" at the end of 
amdgpu_device_init().

Thanks,
Christian.

>
>   
> _____________________________________
> Monk Liu|GPU Virtualization Team |AMD
>
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Monday, June 29, 2020 4:18 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV
>
> Am 29.06.20 um 09:11 schrieb Monk Liu:
>> From: pengzhou <PengJu.Zhou@amd.com>
>>
>> issue:
>> originally we kickoff IB test asynchronously with driver's init, thus
>> the IB test may still running when the driver loading done (modprobe amdgpu done).
>> if we shutdown VM immediately after amdgpu driver loaded then GPU may
>> hang because the IB test is still running
>>
>> fix:
>> make IB test synchronize with driver init thus it won't still running
>> when we shutdown the VM.
> We explicitly added the asynchronously IB test for SRIOV to make driver load faster. Why is that now a problem?
>
> And why would it help when the VM shuts down? We cancel/flush the test during driver unload/suspend as well.
>
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 ++++++++++++++++++++++++-----
>>    1 file changed, 24 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 457f5d2..4f54660 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3292,8 +3292,16 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>    	/* must succeed. */
>>    	amdgpu_ras_resume(adev);
>>    
>> -	queue_delayed_work(system_wq, &adev->delayed_init_work,
>> +	if (amdgpu_sriov_vf(adev)) {
>> +		r = amdgpu_ib_ring_tests(adev);
>> +		if (r) {
>> +			DRM_ERROR("ib ring test failed (%d).\n", r);
>> +			return r;
>> +		}
>> +	} else {
>> +		queue_delayed_work(system_wq, &adev->delayed_init_work,
>>    			   msecs_to_jiffies(AMDGPU_RESUME_MS));
>> +	}
>>    
>>    	r = sysfs_create_files(&adev->dev->kobj, amdgpu_dev_attributes);
>>    	if (r) {
>> @@ -3329,7 +3337,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>    	int r;
>>    
>>    	DRM_INFO("amdgpu: finishing device.\n");
>> -	flush_delayed_work(&adev->delayed_init_work);
>> +	if (!amdgpu_sriov_vf(adev))
>> +		flush_delayed_work(&adev->delayed_init_work);
> You can drop this change, flushing a work which was never scheduled is harmless.
>
>>    	adev->shutdown = true;
>>    
>>    	/* make sure IB test finished before entering exclusive mode @@
>> -3425,7 +3434,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>>    	if (fbcon)
>>    		amdgpu_fbdev_set_suspend(adev, 1);
>>    
>> -	cancel_delayed_work_sync(&adev->delayed_init_work);
>> +	if (!amdgpu_sriov_vf(adev))
>> +		cancel_delayed_work_sync(&adev->delayed_init_work);
>>    
>>    	if (!amdgpu_device_has_dc_support(adev)) {
>>    		/* turn off display hw */
>> @@ -3528,8 +3538,16 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>>    	if (r)
>>    		return r;
>>    
>> -	queue_delayed_work(system_wq, &adev->delayed_init_work,
>> +	if (amdgpu_sriov_vf(adev)) {
>> +		r = amdgpu_ib_ring_tests(adev);
>> +		if (r) {
>> +			DRM_ERROR("ib ring test failed (%d).\n", r);
>> +			return r;
>> +		}
>> +	} else {
>> +		queue_delayed_work(system_wq, &adev->delayed_init_work,
>>    			   msecs_to_jiffies(AMDGPU_RESUME_MS));
>> +	}
>>    
>>    	if (!amdgpu_device_has_dc_support(adev)) {
>>    		/* pin cursors */
>> @@ -3554,7 +3572,8 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>>    		return r;
>>    
>>    	/* Make sure IB tests flushed */
>> -	flush_delayed_work(&adev->delayed_init_work);
>> +	if (!amdgpu_sriov_vf(adev))
>> +		flush_delayed_work(&adev->delayed_init_work);
>>    
>>    	/* blat the mode back in */
>>    	if (fbcon) {
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV
  2020-06-29  9:09     ` Christian König
@ 2020-06-29  9:34       ` Liu, Monk
  0 siblings, 0 replies; 5+ messages in thread
From: Liu, Monk @ 2020-06-29  9:34 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx

Sounds doable as well

_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com> 
Sent: Monday, June 29, 2020 5:10 PM
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV

Am 29.06.20 um 11:03 schrieb Liu, Monk:
>>>> We explicitly added the asynchronously IB test for SRIOV to make driver load faster. Why is that now a problem?
> Well I didn't notice this change explicitly for SRIOV, at least from 
> the code it is quite a normal change

It's been a while, but if I remember correctly the original motivation was to reduce the time we spend in exclusive mode for SRIOV.

Of course bare metal quickly picket this up as well because it reduces the time we spend in modprobe.

>>>> And why would it help when the VM shuts down? We cancel/flush the test during driver unload/suspend as well.
> If you do the sequence like: load amdgpu and shutdown VM immediately , you will hit problems .
>
> Virtual Machine's shutdown won't be blocked if the IB test is on the 
> fly , e.g.: you can do "init 0" right after "modprobe amdgpu" and the 
> "amdgpu_device_fini" won't be called Thus the flushing on IB test 
> won't happen so the IB test is still running with the VM already 
> shutdown and lead to GPU crash (usually VCN/VCE crash on invalid 
> memory address)

Yeah, that is rather obvious. We try to reduce the modprobe time, but this is rather harmful for this test case.

But this can happen with any work running on the GPU. Do we have a specific test case for this or why is it a problem?

I mean we should not change the driver just to make a strange QA test case work which otherwise never happens in practice.

If we really need this I suggest to just add "if (amdgpu_sriov_vf(adev)) flush_delayed_work(&adev->delayed_init_work);" at the end of amdgpu_device_init().

Thanks,
Christian.

>
>   
> _____________________________________
> Monk Liu|GPU Virtualization Team |AMD
>
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Monday, June 29, 2020 4:18 PM
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: make IB test synchronize with init 
> for SRIOV
>
> Am 29.06.20 um 09:11 schrieb Monk Liu:
>> From: pengzhou <PengJu.Zhou@amd.com>
>>
>> issue:
>> originally we kickoff IB test asynchronously with driver's init, thus 
>> the IB test may still running when the driver loading done (modprobe amdgpu done).
>> if we shutdown VM immediately after amdgpu driver loaded then GPU may 
>> hang because the IB test is still running
>>
>> fix:
>> make IB test synchronize with driver init thus it won't still running 
>> when we shutdown the VM.
> We explicitly added the asynchronously IB test for SRIOV to make driver load faster. Why is that now a problem?
>
> And why would it help when the VM shuts down? We cancel/flush the test during driver unload/suspend as well.
>
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 ++++++++++++++++++++++++-----
>>    1 file changed, 24 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 457f5d2..4f54660 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3292,8 +3292,16 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>    	/* must succeed. */
>>    	amdgpu_ras_resume(adev);
>>    
>> -	queue_delayed_work(system_wq, &adev->delayed_init_work,
>> +	if (amdgpu_sriov_vf(adev)) {
>> +		r = amdgpu_ib_ring_tests(adev);
>> +		if (r) {
>> +			DRM_ERROR("ib ring test failed (%d).\n", r);
>> +			return r;
>> +		}
>> +	} else {
>> +		queue_delayed_work(system_wq, &adev->delayed_init_work,
>>    			   msecs_to_jiffies(AMDGPU_RESUME_MS));
>> +	}
>>    
>>    	r = sysfs_create_files(&adev->dev->kobj, amdgpu_dev_attributes);
>>    	if (r) {
>> @@ -3329,7 +3337,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>    	int r;
>>    
>>    	DRM_INFO("amdgpu: finishing device.\n");
>> -	flush_delayed_work(&adev->delayed_init_work);
>> +	if (!amdgpu_sriov_vf(adev))
>> +		flush_delayed_work(&adev->delayed_init_work);
> You can drop this change, flushing a work which was never scheduled is harmless.
>
>>    	adev->shutdown = true;
>>    
>>    	/* make sure IB test finished before entering exclusive mode @@
>> -3425,7 +3434,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>>    	if (fbcon)
>>    		amdgpu_fbdev_set_suspend(adev, 1);
>>    
>> -	cancel_delayed_work_sync(&adev->delayed_init_work);
>> +	if (!amdgpu_sriov_vf(adev))
>> +		cancel_delayed_work_sync(&adev->delayed_init_work);
>>    
>>    	if (!amdgpu_device_has_dc_support(adev)) {
>>    		/* turn off display hw */
>> @@ -3528,8 +3538,16 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>>    	if (r)
>>    		return r;
>>    
>> -	queue_delayed_work(system_wq, &adev->delayed_init_work,
>> +	if (amdgpu_sriov_vf(adev)) {
>> +		r = amdgpu_ib_ring_tests(adev);
>> +		if (r) {
>> +			DRM_ERROR("ib ring test failed (%d).\n", r);
>> +			return r;
>> +		}
>> +	} else {
>> +		queue_delayed_work(system_wq, &adev->delayed_init_work,
>>    			   msecs_to_jiffies(AMDGPU_RESUME_MS));
>> +	}
>>    
>>    	if (!amdgpu_device_has_dc_support(adev)) {
>>    		/* pin cursors */
>> @@ -3554,7 +3572,8 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>>    		return r;
>>    
>>    	/* Make sure IB tests flushed */
>> -	flush_delayed_work(&adev->delayed_init_work);
>> +	if (!amdgpu_sriov_vf(adev))
>> +		flush_delayed_work(&adev->delayed_init_work);
>>    
>>    	/* blat the mode back in */
>>    	if (fbcon) {
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7CMo
> nk.Liu%40amd.com%7Ce2c5bd13d7ce4260907108d81c0c322d%7C3dd8961fe4884e60
> 8e11a82d994e183d%7C0%7C0%7C637290185993763355&amp;sdata=cTl61%2BnFm3KM
> q8U7p0gsem7TTiAy8iZuVbYymwJScjo%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-06-29  9:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-29  7:11 [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV Monk Liu
2020-06-29  8:18 ` Christian König
2020-06-29  9:03   ` Liu, Monk
2020-06-29  9:09     ` Christian König
2020-06-29  9:34       ` Liu, Monk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).