All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] drm/amd/amdgpu:Fix compute ring unable to detect hang.
@ 2019-09-19  7:08 Jesse Zhang
       [not found] ` <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Jesse Zhang @ 2019-09-19  7:08 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang

When compute fence did signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.

In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.

Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/soc15.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 7c7e9f5..6cd5548 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -687,6 +687,16 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
 	adev->rev_id = soc15_get_rev_id(adev);
 	adev->nbio.funcs->detect_hw_virt(adev);
 
+	/*
+	 * If running under SR-IOV or passthrough mode and user did not set
+	 * custom value for compute ring timeout, set timeout to be the same
+	 * as gfx ring timeout to avoid compute ring cannot detect an true
+	 * hang.
+	 */
+	if ((amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) &&
+		(adev->compute_timeout == MAX_SCHEDULE_TIMEOUT))
+		adev->compute_timeout = adev->gfx_timeout;
+
 	if (amdgpu_sriov_vf(adev))
 		adev->virt.ops = &xgpu_ai_virt_ops;
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v3] drm/amd/amdgpu:Fix compute ring unable to detect hang.
       [not found] ` <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-19  7:58   ` Jesse Zhang
  2019-09-19  8:00   ` [PATCH v4] " Jesse Zhang
  1 sibling, 0 replies; 10+ messages in thread
From: Jesse Zhang @ 2019-09-19  7:58 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang

When compute fence did signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.

In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.

Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  5 +----
 drivers/gpu/drm/amd/amdgpu/soc15.c        | 10 ++++++++++
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index cbcaa7c..963b6d1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -468,10 +468,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 			 * For sriov case, always use the timeout
 			 * as gfx ring
 			 */
-			if (!amdgpu_sriov_vf(ring->adev))
-				timeout = adev->compute_timeout;
-			else
-				timeout = adev->gfx_timeout;
+			timeout = adev->compute_timeout;
 			break;
 		case AMDGPU_RING_TYPE_SDMA:
 			timeout = adev->sdma_timeout;
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 7c7e9f5..6cd5548 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -687,6 +687,16 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
 	adev->rev_id = soc15_get_rev_id(adev);
 	adev->nbio.funcs->detect_hw_virt(adev);
 
+	/*
+	 * If running under SR-IOV or passthrough mode and user did not set
+	 * custom value for compute ring timeout, set timeout to be the same
+	 * as gfx ring timeout to avoid compute ring cannot detect an true
+	 * hang.
+	 */
+	if ((amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) &&
+		(adev->compute_timeout == MAX_SCHEDULE_TIMEOUT))
+		adev->compute_timeout = adev->gfx_timeout;
+
 	if (amdgpu_sriov_vf(adev))
 		adev->virt.ops = &xgpu_ai_virt_ops;
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4] drm/amd/amdgpu:Fix compute ring unable to detect hang.
       [not found] ` <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
  2019-09-19  7:58   ` [PATCH v3] " Jesse Zhang
@ 2019-09-19  8:00   ` Jesse Zhang
       [not found]     ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
  1 sibling, 1 reply; 10+ messages in thread
From: Jesse Zhang @ 2019-09-19  8:00 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang

When compute fence did not signal, compute ring cannot detect hardware
hang because its timeout value is set to be infinite by default.

In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.

Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  5 +----
 drivers/gpu/drm/amd/amdgpu/soc15.c        | 10 ++++++++++
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index cbcaa7c..963b6d1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -468,10 +468,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 			 * For sriov case, always use the timeout
 			 * as gfx ring
 			 */
-			if (!amdgpu_sriov_vf(ring->adev))
-				timeout = adev->compute_timeout;
-			else
-				timeout = adev->gfx_timeout;
+			timeout = adev->compute_timeout;
 			break;
 		case AMDGPU_RING_TYPE_SDMA:
 			timeout = adev->sdma_timeout;
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 7c7e9f5..6cd5548 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -687,6 +687,16 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
 	adev->rev_id = soc15_get_rev_id(adev);
 	adev->nbio.funcs->detect_hw_virt(adev);
 
+	/*
+	 * If running under SR-IOV or passthrough mode and user did not set
+	 * custom value for compute ring timeout, set timeout to be the same
+	 * as gfx ring timeout to avoid compute ring cannot detect an true
+	 * hang.
+	 */
+	if ((amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) &&
+		(adev->compute_timeout == MAX_SCHEDULE_TIMEOUT))
+		adev->compute_timeout = adev->gfx_timeout;
+
 	if (amdgpu_sriov_vf(adev))
 		adev->virt.ops = &xgpu_ai_virt_ops;
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4] drm/amd/amdgpu:Fix compute ring unable to detect hang.
       [not found]     ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-19  8:14       ` Christian König
  2019-09-19 10:09       ` [PATCH v5] " Jesse Zhang
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Christian König @ 2019-09-19  8:14 UTC (permalink / raw)
  To: Jesse Zhang, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 19.09.19 um 10:00 schrieb Jesse Zhang:
> When compute fence did not signal, compute ring cannot detect hardware
> hang because its timeout value is set to be infinite by default.
>
> In SR-IOV and passthrough mode, if user does not declare custome timeout
> value for compute ring, then use gfx ring timeout value as default. So
> that when there is a ture hardware hang, compute ring can detect it.
>
> Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
> Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  5 +----
>   drivers/gpu/drm/amd/amdgpu/soc15.c        | 10 ++++++++++
>   2 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index cbcaa7c..963b6d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -468,10 +468,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>   			 * For sriov case, always use the timeout
>   			 * as gfx ring
>   			 */

Please also remove the comment since that is now stale.

Apart from that looks good to me,
Christian.

> -			if (!amdgpu_sriov_vf(ring->adev))
> -				timeout = adev->compute_timeout;
> -			else
> -				timeout = adev->gfx_timeout;
> +			timeout = adev->compute_timeout;
>   			break;
>   		case AMDGPU_RING_TYPE_SDMA:
>   			timeout = adev->sdma_timeout;
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
> index 7c7e9f5..6cd5548 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> @@ -687,6 +687,16 @@ int soc15_set_ip_blocks(struct amdgpu_device *adev)
>   	adev->rev_id = soc15_get_rev_id(adev);
>   	adev->nbio.funcs->detect_hw_virt(adev);
>   
> +	/*
> +	 * If running under SR-IOV or passthrough mode and user did not set
> +	 * custom value for compute ring timeout, set timeout to be the same
> +	 * as gfx ring timeout to avoid compute ring cannot detect an true
> +	 * hang.
> +	 */
> +	if ((amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev)) &&
> +		(adev->compute_timeout == MAX_SCHEDULE_TIMEOUT))
> +		adev->compute_timeout = adev->gfx_timeout;
> +
>   	if (amdgpu_sriov_vf(adev))
>   		adev->virt.ops = &xgpu_ai_virt_ops;
>   

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5] drm/amd/amdgpu:Fix compute ring unable to detect hang.
       [not found]     ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
  2019-09-19  8:14       ` Christian König
@ 2019-09-19 10:09       ` Jesse Zhang
       [not found]         ` <1568887741-1029-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
  2019-09-20  2:36       ` [PATCH v6] " Jesse Zhang
  2019-09-20  2:38       ` [PATCH v7] " Jesse Zhang
  3 siblings, 1 reply; 10+ messages in thread
From: Jesse Zhang @ 2019-09-19 10:09 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang

When compute fence did signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.

In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.

Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +++-
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3b5282b..03ac5a1da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
 
 	amdgpu_device_check_block_size(adev);
 
-	ret = amdgpu_device_get_job_timeout_settings(adev);
-	if (ret) {
-		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
-		return ret;
-	}
-
 	adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
 
 	return ret;
@@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	if (r)
 		return r;
 
+	r = amdgpu_device_get_job_timeout_settings(adev);
+	if (r) {
+		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
+		return r;
+	}
+
 	/* doorbell bar mapping and doorbell index init*/
 	amdgpu_device_doorbell_init(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 420888e..1236245 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
 		}
 		/*
 		 * There is only one value specified and
-		 * it should apply to all non-compute jobs.
+		 * it should apply to all jobs.
 		 */
 		if (index == 1)
 			adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
+			if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
+				adev->compute_timeout = adev->gfx_timeout;
 	}
 
 	return ret;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5] drm/amd/amdgpu:Fix compute ring unable to detect hang.
       [not found]         ` <1568887741-1029-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-19 12:12           ` Christian König
  0 siblings, 0 replies; 10+ messages in thread
From: Christian König @ 2019-09-19 12:12 UTC (permalink / raw)
  To: Jesse Zhang, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 19.09.19 um 12:09 schrieb Jesse Zhang:
> When compute fence did signal, compute ring cannot detect hardware hang
> because its timeout value is set to be infinite by default.
>
> In SR-IOV and passthrough mode, if user does not declare custome timeout
> value for compute ring, then use gfx ring timeout value as default. So
> that when there is a ture hardware hang, compute ring can detect it.
>
> Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
> Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +++-
>   2 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3b5282b..03ac5a1da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
>   
>   	amdgpu_device_check_block_size(adev);
>   
> -	ret = amdgpu_device_get_job_timeout_settings(adev);
> -	if (ret) {
> -		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
> -		return ret;
> -	}
> -
>   	adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
>   
>   	return ret;
> @@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	if (r)
>   		return r;
>   
> +	r = amdgpu_device_get_job_timeout_settings(adev);
> +	if (r) {
> +		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
> +		return r;
> +	}
> +

I assume that you move the code because previously SRIOV/passthrough 
setting is not available yet?

But even with this here you can still remove the extra SRIOV check in 
amdgpu_fence.c.

Regards,
Christian.

>   	/* doorbell bar mapping and doorbell index init*/
>   	amdgpu_device_doorbell_init(adev);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 420888e..1236245 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
>   		}
>   		/*
>   		 * There is only one value specified and
> -		 * it should apply to all non-compute jobs.
> +		 * it should apply to all jobs.
>   		 */
>   		if (index == 1)
>   			adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
> +			if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
> +				adev->compute_timeout = adev->gfx_timeout;
>   	}
>   
>   	return ret;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v6] drm/amd/amdgpu:Fix compute ring unable to detect hang.
       [not found]     ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
  2019-09-19  8:14       ` Christian König
  2019-09-19 10:09       ` [PATCH v5] " Jesse Zhang
@ 2019-09-20  2:36       ` Jesse Zhang
  2019-09-20  2:38       ` [PATCH v7] " Jesse Zhang
  3 siblings, 0 replies; 10+ messages in thread
From: Jesse Zhang @ 2019-09-20  2:36 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang

When compute fence did signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.

In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.

Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 13 +------------
 3 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3b5282b..03ac5a1da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
 
 	amdgpu_device_check_block_size(adev);
 
-	ret = amdgpu_device_get_job_timeout_settings(adev);
-	if (ret) {
-		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
-		return ret;
-	}
-
 	adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
 
 	return ret;
@@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	if (r)
 		return r;
 
+	r = amdgpu_device_get_job_timeout_settings(adev);
+	if (r) {
+		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
+		return r;
+	}
+
 	/* doorbell bar mapping and doorbell index init*/
 	amdgpu_device_doorbell_init(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 420888e..1236245 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
 		}
 		/*
 		 * There is only one value specified and
-		 * it should apply to all non-compute jobs.
+		 * it should apply to all jobs.
 		 */
 		if (index == 1)
 			adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
+			if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
+				adev->compute_timeout = adev->gfx_timeout;
 	}
 
 	return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index cbcaa7c..9ef53ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -460,18 +460,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 			timeout = adev->gfx_timeout;
 			break;
 		case AMDGPU_RING_TYPE_COMPUTE:
-			/*
-			 * For non-sriov case, no timeout enforce
-			 * on compute ring by default. Unless user
-			 * specifies a timeout for compute ring.
-			 *
-			 * For sriov case, always use the timeout
-			 * as gfx ring
-			 */
-			if (!amdgpu_sriov_vf(ring->adev))
-				timeout = adev->compute_timeout;
-			else
-				timeout = adev->gfx_timeout;
+			timeout = adev->compute_timeout;
 			break;
 		case AMDGPU_RING_TYPE_SDMA:
 			timeout = adev->sdma_timeout;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v7] drm/amd/amdgpu:Fix compute ring unable to detect hang.
       [not found]     ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
                         ` (2 preceding siblings ...)
  2019-09-20  2:36       ` [PATCH v6] " Jesse Zhang
@ 2019-09-20  2:38       ` Jesse Zhang
       [not found]         ` <1568947109-5924-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
  3 siblings, 1 reply; 10+ messages in thread
From: Jesse Zhang @ 2019-09-20  2:38 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang

When compute fence did not signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.

In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.

Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  4 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 13 +------------
 3 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3b5282b..03ac5a1da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
 
 	amdgpu_device_check_block_size(adev);
 
-	ret = amdgpu_device_get_job_timeout_settings(adev);
-	if (ret) {
-		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
-		return ret;
-	}
-
 	adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
 
 	return ret;
@@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	if (r)
 		return r;
 
+	r = amdgpu_device_get_job_timeout_settings(adev);
+	if (r) {
+		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
+		return r;
+	}
+
 	/* doorbell bar mapping and doorbell index init*/
 	amdgpu_device_doorbell_init(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 420888e..1236245 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1378,10 +1378,12 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
 		}
 		/*
 		 * There is only one value specified and
-		 * it should apply to all non-compute jobs.
+		 * it should apply to all jobs.
 		 */
 		if (index == 1)
 			adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
+			if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
+				adev->compute_timeout = adev->gfx_timeout;
 	}
 
 	return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index cbcaa7c..9ef53ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -460,18 +460,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 			timeout = adev->gfx_timeout;
 			break;
 		case AMDGPU_RING_TYPE_COMPUTE:
-			/*
-			 * For non-sriov case, no timeout enforce
-			 * on compute ring by default. Unless user
-			 * specifies a timeout for compute ring.
-			 *
-			 * For sriov case, always use the timeout
-			 * as gfx ring
-			 */
-			if (!amdgpu_sriov_vf(ring->adev))
-				timeout = adev->compute_timeout;
-			else
-				timeout = adev->gfx_timeout;
+			timeout = adev->compute_timeout;
 			break;
 		case AMDGPU_RING_TYPE_SDMA:
 			timeout = adev->sdma_timeout;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v8] drm/amd/amdgpu:Fix compute ring unable to detect hang.
       [not found]         ` <1568947109-5924-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-20  6:57           ` Jesse Zhang
       [not found]             ` <1568962637-26150-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Jesse Zhang @ 2019-09-20  6:57 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Jesse Zhang

When compute fence did not signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.

In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.

Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  7 ++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 13 +------------
 3 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3b5282b..03ac5a1da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
 
 	amdgpu_device_check_block_size(adev);
 
-	ret = amdgpu_device_get_job_timeout_settings(adev);
-	if (ret) {
-		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
-		return ret;
-	}
-
 	adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
 
 	return ret;
@@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	if (r)
 		return r;
 
+	r = amdgpu_device_get_job_timeout_settings(adev);
+	if (r) {
+		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
+		return r;
+	}
+
 	/* doorbell bar mapping and doorbell index init*/
 	amdgpu_device_doorbell_init(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 420888e..98be49b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1338,10 +1338,15 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
 	/*
 	 * By default timeout for non compute jobs is 10000.
 	 * And there is no timeout enforced on compute jobs.
+	 * In SR-IOV or passthrough mode, timeout for compute
+	 * jobs are 10000 by default.
 	 */
 	adev->gfx_timeout = msecs_to_jiffies(10000);
 	adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
-	adev->compute_timeout = MAX_SCHEDULE_TIMEOUT;
+	if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
+		adev->compute_timeout = adev->gfx_timeout;
+	else
+		adev->compute_timeout = MAX_SCHEDULE_TIMEOUT;
 
 	if (strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENTH)) {
 		while ((timeout_setting = strsep(&input, ",")) &&
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index cbcaa7c..9ef53ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -460,18 +460,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 			timeout = adev->gfx_timeout;
 			break;
 		case AMDGPU_RING_TYPE_COMPUTE:
-			/*
-			 * For non-sriov case, no timeout enforce
-			 * on compute ring by default. Unless user
-			 * specifies a timeout for compute ring.
-			 *
-			 * For sriov case, always use the timeout
-			 * as gfx ring
-			 */
-			if (!amdgpu_sriov_vf(ring->adev))
-				timeout = adev->compute_timeout;
-			else
-				timeout = adev->gfx_timeout;
+			timeout = adev->compute_timeout;
 			break;
 		case AMDGPU_RING_TYPE_SDMA:
 			timeout = adev->sdma_timeout;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v8] drm/amd/amdgpu:Fix compute ring unable to detect hang.
       [not found]             ` <1568962637-26150-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
@ 2019-09-20 14:29               ` Christian König
  0 siblings, 0 replies; 10+ messages in thread
From: Christian König @ 2019-09-20 14:29 UTC (permalink / raw)
  To: Jesse Zhang, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 20.09.19 um 08:57 schrieb Jesse Zhang:
> When compute fence did not signal, compute ring cannot detect hardware hang
> because its timeout value is set to be infinite by default.
>
> In SR-IOV and passthrough mode, if user does not declare custome timeout
> value for compute ring, then use gfx ring timeout value as default. So
> that when there is a ture hardware hang, compute ring can detect it.
>
> Change-Id: I794ec0868c6c0aad407749457260ecfee0617c10
> Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++++++------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  7 ++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 13 +------------
>   3 files changed, 13 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3b5282b..03ac5a1da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1024,12 +1024,6 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
>   
>   	amdgpu_device_check_block_size(adev);
>   
> -	ret = amdgpu_device_get_job_timeout_settings(adev);
> -	if (ret) {
> -		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
> -		return ret;
> -	}
> -
>   	adev->firmware.load_type = amdgpu_ucode_get_load_type(adev, amdgpu_fw_load_type);
>   
>   	return ret;
> @@ -2732,6 +2726,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	if (r)
>   		return r;
>   
> +	r = amdgpu_device_get_job_timeout_settings(adev);
> +	if (r) {
> +		dev_err(adev->dev, "invalid lockup_timeout parameter syntax\n");
> +		return r;
> +	}
> +
>   	/* doorbell bar mapping and doorbell index init*/
>   	amdgpu_device_doorbell_init(adev);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 420888e..98be49b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1338,10 +1338,15 @@ int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
>   	/*
>   	 * By default timeout for non compute jobs is 10000.
>   	 * And there is no timeout enforced on compute jobs.
> +	 * In SR-IOV or passthrough mode, timeout for compute
> +	 * jobs are 10000 by default.
>   	 */
>   	adev->gfx_timeout = msecs_to_jiffies(10000);
>   	adev->sdma_timeout = adev->video_timeout = adev->gfx_timeout;
> -	adev->compute_timeout = MAX_SCHEDULE_TIMEOUT;
> +	if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
> +		adev->compute_timeout = adev->gfx_timeout;
> +	else
> +		adev->compute_timeout = MAX_SCHEDULE_TIMEOUT;
>   
>   	if (strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENTH)) {
>   		while ((timeout_setting = strsep(&input, ",")) &&
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index cbcaa7c..9ef53ca 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -460,18 +460,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>   			timeout = adev->gfx_timeout;
>   			break;
>   		case AMDGPU_RING_TYPE_COMPUTE:
> -			/*
> -			 * For non-sriov case, no timeout enforce
> -			 * on compute ring by default. Unless user
> -			 * specifies a timeout for compute ring.
> -			 *
> -			 * For sriov case, always use the timeout
> -			 * as gfx ring
> -			 */
> -			if (!amdgpu_sriov_vf(ring->adev))
> -				timeout = adev->compute_timeout;
> -			else
> -				timeout = adev->gfx_timeout;
> +			timeout = adev->compute_timeout;
>   			break;
>   		case AMDGPU_RING_TYPE_SDMA:
>   			timeout = adev->sdma_timeout;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-09-20 14:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-19  7:08 [PATCH v2] drm/amd/amdgpu:Fix compute ring unable to detect hang Jesse Zhang
     [not found] ` <1568876935-18731-2-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-19  7:58   ` [PATCH v3] " Jesse Zhang
2019-09-19  8:00   ` [PATCH v4] " Jesse Zhang
     [not found]     ` <1568880041-19830-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-19  8:14       ` Christian König
2019-09-19 10:09       ` [PATCH v5] " Jesse Zhang
     [not found]         ` <1568887741-1029-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-19 12:12           ` Christian König
2019-09-20  2:36       ` [PATCH v6] " Jesse Zhang
2019-09-20  2:38       ` [PATCH v7] " Jesse Zhang
     [not found]         ` <1568947109-5924-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-20  6:57           ` [PATCH v8] " Jesse Zhang
     [not found]             ` <1568962637-26150-1-git-send-email-zhexi.zhang-5C7GfCeVMHo@public.gmane.org>
2019-09-20 14:29               ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.